News:

Bored?  Looking to kill some time?  Want to chat with other SMF users?  Join us in IRC chat or Discord

Main Menu

CR + LF file corruption after FTP transfer

Started by GigaWatt, February 26, 2018, 07:30:45 PM

Previous topic - Next topic

GigaWatt

I don't know exactly where to post this... it's not exactly related to SMF itself, but I was Googling around the past few weeks for a solution to this problem to no avail.

The problem happened after the main admin decided that (for some unknown reason ::)) to just... abandon the forum. And not just abandon it, he removed it from the hosting service (he just removed the top level domain and... you get the idea, nothing loads ::)). I was a mod, but I decided to keep the forum going, along with some of the other admins of the forum. Since none of us had access to the hosting, I asked one of them to (politely ::)) ask the main admin if we could have a copy of the script and the database. I personally knew the main admin... I was surprised when I received a copy of the database and the script.

Anyhow, I had to update the script and the database, it was running an older version of SMF (I think 1.1.16), blah blah blah... to cut a long story short, everything turned out well... except for the attachments. Something must have happened during the FTP transfer (the main admin wrote me that he did the backup of the script via FTP). After some file analysis and hex editing, binary compares and whatnot, I learned that this is called a "CR + LF file corruption". Lucky me ::).

I have no idea if he (the main admin) did this on purpose or not... in the end, it doesn't matter. What matters to me is... is there a way to fix the files?

Now, after some online searching, I found out that the most common type of corruption is just a simple "0D 0A" binary modification. For those who have no idea what these bytes represent, 0D represents carriage return (CR) and 0A represents line feed (LF). Unix/Linux uses LF (0A) to end a line (and start a new one), Macs use CR (0D) and Windows uses both of them, CR + LF (0D 0A). Most FTP programs, when set to transfer FTP files in ASCII mode, just make a simple binary change (change 0D to 0A or vice versa, in some cases add 00 before or after the line end), but they don't "eat" bytes. In this particular case, whole bytes were missing... which in return, means that the file downloaded via FTP had a smaller file size than the original file.

Here's a screen shot of a binary compare of an original (left) and a corrupted (right) file.



Basically, you have to add bytes to the file based on a certain criteria (binary string) that appears in the file.

Now, I searched and searched... I didn't find any program that can do this. I can do it manually for some of the attachments, but I can't do it for all of them (around 7000 attachments).

Basically, what I'm looking for is a program that can do this in bulk. Give it a bunch of files, set the binary criteria, patch the files.

Or... if someone has a better idea, I'm opened to suggestions. In case you're thinking of suggesting another backup of the forum's script and files... already tried to force that with the main admin... it's a no go. He's a coder... and an admin, so... you know... he's god ::)... he knows best ::)... he thinks I'm not doing something right and either won't make another backup or... he's deleted the forum from the server (I think this is the more probable scenario). In either case, I can't get another copy of the files.
"This is really a generic concept about human thinking - when faced with large tasks we're naturally inclined to try to break them down into a bunch of smaller tasks that together make up the whole."

"A 500 error loosely translates to the webserver saying, "WTF?"..."

Sir Osis of Liver

If you used FileZilla to download/upload attachments, and transfer type was not set to binary, the files were transferred as ascii and are corrupt.  Several schemes were posted to fix the files, but none of the ones I tried ever worked.  I don't know of any software that will do what you're suggesting, possibly someone else does, otherwise you would have to write it yourself.

Ashes and diamonds, foe and friend,
 we were all equal in the end.

                                     - R. Waters

Aleksi "Lex" Kilpinen

Are the originals corrupt for sure, will they not work at all if you give them a proper filetype on your local harddrive? The attachments folder(s) should be transferred as binary both ways to avoid this problem.
If they are corrupt already, I haven't heard of a good way to save them.
Slava
Ukraini!
"Before you allow people access to your forum, especially in an administrative position, you must be aware that that person can seriously damage your forum. Therefore, you should only allow people that you trust, implicitly, to have such access." -Douglas

How you can help SMF

GigaWatt

Quote from: Sir Osis of Liver on February 26, 2018, 09:37:55 PM
If you used FileZilla to download/upload attachments, and transfer type was not set to binary, the files were transferred as ascii and are corrupt.

I have no idea which software was used. I wasn't the one who made the backup.

But there was a MACOSX folder inside the archive (the backup of the script)... I guess it could have been done on a Mac.

Quote from: Aleksi "Lex" Kilpinen on February 26, 2018, 11:42:36 PM
Are the originals corrupt for sure, will they not work at all if you give them a proper filetype on your local harddrive?

No. Actually, that's how I got to the corrupted file in the example. I checked the hash in the database and searched for that string in the attachments, just added the proper extension (pdf in the example) and compared it to the "original" (I had a local copy of the file).

And yes, all of them are corrupt. I have local copes of some of the files... all of the attachments that I compared to the "original" files were corrupt. I have no idea if all of them are, but at least those that contain "0D 0A" in sequence as a binary string are corrupt for sure.

Quote from: Aleksi "Lex" Kilpinen on February 26, 2018, 11:42:36 PM
The attachments folder(s) should be transferred as binary both ways to avoid this problem.

That's what I though ::). As it turn out, the default setting for most FTP programs is to transfer files in "Auto" mode in both ways. In the process, they analyze the file extension and decide weather they should transfer the file and change the line endings (CR, LF or CR + LF) or transfer the file untouched. And get this... some of them assume that if a file has no file extension (like the attachments), that it's a text file ??? :o. I assume this is how they got corrupted.

Quote from: Aleksi "Lex" Kilpinen on February 26, 2018, 11:42:36 PM
If they are corrupt already, I haven't heard of a good way to save them.

I thought that might be the case...

I'm not a coder by occupation, I work with low level stuff mostly (microcontrollers, electronics and whatnot), but I guess I can code something like this... it shouldn't be too hard. Except for one thing :S.

Let's say you have a binary string somewhere and let's say this binary string contains the problematic string "0D 0A".

... 00 FF 00 FF 0D 0A 00 FF 00 FF ...

Now, here is what the FTP program did.

... 00 FF 00 FF 0A 00 FF 00 FF ...

Compare the upper with the lower sequence. It's clear that 0D is missing. The FTP program literally "ate" a byte each time it encountered the string 0D 0A.

This is easily fixable if we knew for certain that each time we encounter 0A in the corrupted file, we have to add 0D in front of it and just move all of the other bytes "to the right"... but, it's not that simple.

Let's say we encounter the following string in the corrupted file.

... 00 FF 00 FF 0A 00 FF 00 FF 0A 00 FF 00 FF ...

We would assume that the original should be like this.

... 00 FF 00 FF 0D 0A 00 FF 00 FF 0D 0A 00 FF 00 FF ...

But what if this string in the original file was actually like this.

... 00 FF 00 FF 0A 00 FF 00 FF 0D 0A 00 FF 00 FF ...

We would add an extra 0D in the file and, once more, the file is corrupt :S.

Based on the assumption that these are reserved characters and that most of the time they are used in combination, I guess my question is, what are my chances of being wrong and adding that extra 0D byte if there wasn't one present in the original ???. I would code something like this, but only if I knew that I could save a large portion of the attachments (lets' say 20% and above). If more than 90% would still be corrupt, I wouldn't bother.
"This is really a generic concept about human thinking - when faced with large tasks we're naturally inclined to try to break them down into a bunch of smaller tasks that together make up the whole."

"A 500 error loosely translates to the webserver saying, "WTF?"..."

Aleksi "Lex" Kilpinen

I'm not sure of this, but on a hunch I'd say you would be playing a lottery - it might work for some files by chance, might not for others. Because we can't know what the files actually contained originally, reconstructing them is not really easy - even if your assumption sounds logical.
Slava
Ukraini!
"Before you allow people access to your forum, especially in an administrative position, you must be aware that that person can seriously damage your forum. Therefore, you should only allow people that you trust, implicitly, to have such access." -Douglas

How you can help SMF

GigaWatt

Thought so :S.

Well, I'll try and write something, see if does something useful with the attachment and if it does... I'll post results.

I'm keeping my fingers crossed, but I'm also not expecting too much based on my assumptions.
"This is really a generic concept about human thinking - when faced with large tasks we're naturally inclined to try to break them down into a bunch of smaller tasks that together make up the whole."

"A 500 error loosely translates to the webserver saying, "WTF?"..."

Aleksi "Lex" Kilpinen

Well, good luck :) And I am curious, so do tell us if you get results.
Slava
Ukraini!
"Before you allow people access to your forum, especially in an administrative position, you must be aware that that person can seriously damage your forum. Therefore, you should only allow people that you trust, implicitly, to have such access." -Douglas

How you can help SMF

GigaWatt

*Update*

I noticed that most of the files that are images are actually intact :). I guess I was right, these are reserved characters and the chance of one being next to the other in a media file are very slim. Luckily, most of the attachments are images :), and if some of them are corrupt, I guess it doesn't affect the decoding process (preview of the image) that much. I binary checked about 10 image attachments, most of them were JPEG, 2 or 3 were PNG and 1 was BMP. Only the bitmap file had the problematic string 0D 0A in 3 or 4 places in the file (I guess it has something to do with the coding of bitmaps, how they are saved ???, maybe it uses line endings to separate settings or file info ???), but the preview of the image was identical to the original file (the uncorrupted one). I was using FastStone Image Viewer... I have no idea if other programs will display the image properly, but FastStone Image Viewer doesn't seem to mind. If I save the corrupted image (recode it) as a BMP file in FastStone Image Viewer, it adds the 0D 0A string in (almost) the same places that the original (not corrupted) image had them.

Maybe different programs under different OSes use different line endings to separate file info in image files, but I suppose most image viewing/editing programs are equipped to handle this ???. For example, if you save a BMP image under Linux in a native Linux application, it'll use LF (0A) to separate line endings in the file info. I haven't tried this, it's just a guess, but I'll try it ;).

On the other hand, I'm almost certain that all binary (EXE) and archive (ZIP, RAR, 7Z, GZIP, XZ...) files are corrupt :S. I tried opening about 5 or 6 archive files (most were RAR, 1 was ZIP and 1 was 7Z), all of them reported that they were corrupt. Two of the RAR files had a 5% recovery record. One of them reported that it was corrupt, but opened the archive, and after I repaired it, it extracted the content of the archive. The files in the archive didn't seem to be damaged, but most of them were images (they previewed properly) and I didn't have the original ones to make a binary check, so I just declared them as OK (undamaged). The other RAR archive also reported corruption, but the 5% recovery record didn't repair the archive :S. I guess the recovery record in RAR archives works with a 50/50 chance of recovering the archive. I know, it probably depends on what's been damaged and how much of the archive has been damaged, but since no one can predict in which part of the RAR archive the 0D 0A string is going to appear (I guess these are not reserved characters in archives), your chances of recovering a RAR archive with an embedded recovery record are 50/50. Well...I guess it's better than nothing.

The binary files are a different story. I haven't checked, but I suppose almost all of them are Windows binaries (EXE). During the testing, I was working with one corrupted binary. It was an MS Visual C++ 6.0 application (a small application, few hundred KB). It had numerous corruptions in the file, but the interesting thing was that all of them were in ASCII strings. I haven't checked this, but I assume that the compiler also uses 0D 0A to end/start lines in text strings. The good thing was that I managed to repair the file :) by "replacing" 0A with 0D 0A (add one byte after each 0A, change 0A to 0D, change the added byte to 0A). I guess in this particular case (IDE/langage/compiler combination), this binary string is reserved, so there is no chance of 0D or 0A being alone in the binary, they will always go in pairs. I have no idea if it will work on any other combination (compressed binary, different IDE/language/compiler), but I plan on digging a little deeper ;).
"This is really a generic concept about human thinking - when faced with large tasks we're naturally inclined to try to break them down into a bunch of smaller tasks that together make up the whole."

"A 500 error loosely translates to the webserver saying, "WTF?"..."

Aleksi "Lex" Kilpinen

That's interesting, and nice to hear you actually have found a chance of saving some of them. :)
Slava
Ukraini!
"Before you allow people access to your forum, especially in an administrative position, you must be aware that that person can seriously damage your forum. Therefore, you should only allow people that you trust, implicitly, to have such access." -Douglas

How you can help SMF

GigaWatt

Yeah, I'm also relieved :), since most of the attachments were image files (JPG, PNG, GIF...).

In the SMF 2.0.x database, I noticed that the script also makes a checksum of all the attachments. I'll try and upload the attachments on the updated version of the forum (2.0.15), run the Attachment Integrity tool and see what it reports. The files that show a cheksum mismatch are corrupted of course.

Can SMF 2.0.15 list the corrupted files? An even better option would be to list them in different data types (images, binary, archives, etc.).
"This is really a generic concept about human thinking - when faced with large tasks we're naturally inclined to try to break them down into a bunch of smaller tasks that together make up the whole."

"A 500 error loosely translates to the webserver saying, "WTF?"..."

Aleksi "Lex" Kilpinen

I'm not truly familiar with the attachment handling on that level, but I don't think SMF offers a report option like that. But - if the checksums are there, and if we already have a tool to check them - I don't think building on that foundation would be too difficult, and it should be possible to do as a mod. (If something doesn't already exist).
Slava
Ukraini!
"Before you allow people access to your forum, especially in an administrative position, you must be aware that that person can seriously damage your forum. Therefore, you should only allow people that you trust, implicitly, to have such access." -Douglas

How you can help SMF

GigaWatt

It would be a nice mod ;). Even if the admin doesn't want/know how to fix the files, he could just delete the corrupted ones and their links to posts in the database... sort of a clean up :).

How about links to posts with attachments. Can SMF 2.0.x delete files that have links in posts, but the attachment doesn't exist in the attachments folder and vice versa (there are files in the attachments folder with something that resembles a hash value, but there are no links to them in the database, excluding .htaccess and index.php of course)?
"This is really a generic concept about human thinking - when faced with large tasks we're naturally inclined to try to break them down into a bunch of smaller tasks that together make up the whole."

"A 500 error loosely translates to the webserver saying, "WTF?"..."

Aleksi "Lex" Kilpinen

Yeah, there is an adminside tool to browse attachments, and it also shows where they are linked to.
The maintenance can also automatically prune links that are broken. The tools are just not so refined as to do everything you wanted.
Slava
Ukraini!
"Before you allow people access to your forum, especially in an administrative position, you must be aware that that person can seriously damage your forum. Therefore, you should only allow people that you trust, implicitly, to have such access." -Douglas

How you can help SMF

GigaWatt

Hmmm... OK, I guess I'll work with what I've got ;).

I'll try and post on the attachment issue as soon as I have more results ;).
"This is really a generic concept about human thinking - when faced with large tasks we're naturally inclined to try to break them down into a bunch of smaller tasks that together make up the whole."

"A 500 error loosely translates to the webserver saying, "WTF?"..."

richardwbb

I like your effort of repair. I've put FileZilla in binary mode and after that upload and download ascii files without problems. I really think the auto transfer option should be set at binary at all times. And what helped me was opening the directory with attachments on a linux environment, that way it can show thumbnails for every image thats wasn't damaged. This way I found quickly a low percentage became damaged, which I had backups for. For Windows you might need a viewer application or backup all those files in attachments and rename all of that to .jpg, which should produce positive results. But again I've always put ftp programs in binary and wonder why that option is there, is it old maybe. Ascii files do transfer properlt as binary. The linux 'file' command might be helpful too. There probably is a Windows equivalent I am not aware of.
If my post in this topic looks ambiguous to you, then I'm with Murphy's law and General Stupidity. In other words, trial and error.

GigaWatt

Quote from: richardwbb on March 09, 2018, 02:43:37 PM
I really think the auto transfer option should be set at binary at all times.

Exactly ;). I had a conversation with the previous "uber admin". He told me that each time the forum was transferred via FTP (we changed hosts a few times), the program was set in Auto mode ::). This is a perfect example that auto mode should be avoided at all costs and you should always set your FTP program to run the transfer in binary mode.

Quote from: richardwbb on March 09, 2018, 02:43:37 PM
And what helped me was opening the directory with attachments on a linux environment, that way it can show thumbnails for every image thats wasn't damaged. This way I found quickly a low percentage became damaged, which I had backups for.

Yeah, I was also thinking the same. I thought, since Linux reads MIME types, not extensions, it should show a preview of the images, at least the ones that aren't corrupt. This was one of my next tests ;). I'll post when I have some new info ;).

Quote from: richardwbb on March 09, 2018, 02:43:37 PM
The linux 'file' command might be helpful too. There probably is a Windows equivalent I am not aware of.

I don't think so... and even if there is one, I don't think it'll be as detailed as the Linux one (distinguish media, documents, binaries, etc.), I think it'll only distinguish between text and... everything else... not to mention that most CAD and image manipulation programs basically use text files (xml or something else... doesn't really matter) as their project file, so it might report, let's say, a CorelDRAW file as a text file ::). Linux will do the same in this case, but at least I can easily determine which files are media files (most of the attachments), text files (most probably project files of some sort), documents (Linux should also report them as something other than a regular text file), archives and everything else (mostly binaries in the case of my forum).
"This is really a generic concept about human thinking - when faced with large tasks we're naturally inclined to try to break them down into a bunch of smaller tasks that together make up the whole."

"A 500 error loosely translates to the webserver saying, "WTF?"..."

richardwbb

I still wonder what ancient use auto or ASCII mode has. I've been using FileZilla for a long time before I noticed some avatars had been damaged because of the default setting which I suspect is ancient. I forget to think about GnuWin32, the file command has been ported, but probably there is a .NET or PowerShell solution I am not aware of.
If my post in this topic looks ambiguous to you, then I'm with Murphy's law and General Stupidity. In other words, trial and error.

GigaWatt

PowerShell is a different story... yeah, I suspect that it might have a command that will do what any Linux distro does out of the box ;).
"This is really a generic concept about human thinking - when faced with large tasks we're naturally inclined to try to break them down into a bunch of smaller tasks that together make up the whole."

"A 500 error loosely translates to the webserver saying, "WTF?"..."

Aleksi "Lex" Kilpinen

At least in theory, the ascii mode is there to protect you from mistakes. For example, upload a config file with wrong line endings, and you could potentially render a server inoperable. It's also said to be faster than binary.
Slava
Ukraini!
"Before you allow people access to your forum, especially in an administrative position, you must be aware that that person can seriously damage your forum. Therefore, you should only allow people that you trust, implicitly, to have such access." -Douglas

How you can help SMF

GigaWatt

Quote from: Aleksi "Lex" Kilpinen on March 16, 2018, 06:22:06 AM
For example, upload a config file with wrong line endings, and you could potentially render a server inoperable.

As far as I know, both Linux and BSD support CR + LF (Windows) line endings. Everything should work without a problem if the server is set up correctly... even if the line endings don't match the native Linux/BSD ones.

Quote from: Aleksi "Lex" Kilpinen on March 16, 2018, 06:22:06 AM
It's also said to be faster than binary.

This doesn't make any sense. In binary mode, the FTP client just transfers the files without analyzing them first to see if they are a text file or not... ASCII mode should be slower.

Anyway, there is some progress on the topic at hand ;).

I did some thinking about the corrupt attachment issue... I may have a solution ;). I just need some information on how the hash of the attachments is calculated (i.e. their file name in the attachments folder). I noticed that even if I upload the same file twice in a certain topic, the hash values for both of the files differ in the database. This was kind of disappointing, since I thought that the hash values are only check sum based, but I guess I was wrong.

My idea was to let the tool (I started coding something) do every possible reconstruction scenario of a file, calculate the check sum and each time a reconstruction scenario is complete, compare the check sum with the generated check sum in SMF. Since there is an attachment integrity tool, I assume there must be some sort of a check sum calculated by the scrip or at least taken into account when calculating the hash value of the attachment. I just need the algorithm that calculates the hash value, reverse it, discard anything that is not the check sum of the file and compare this value to the check sum of the reconstructed file, which is of course calculated in the same way that the script calculates the check sums ;). I know there could be hundreds of possible reconstruction scenarios for each file, especially if the file contained lots of 0D 0A binary strings, but with modern CPUs, I don't think it'll be such a big problem ;).
"This is really a generic concept about human thinking - when faced with large tasks we're naturally inclined to try to break them down into a bunch of smaller tasks that together make up the whole."

"A 500 error loosely translates to the webserver saying, "WTF?"..."

Aleksi "Lex" Kilpinen

Quote from: GigaWatt on March 16, 2018, 06:26:11 PM
Quote from: Aleksi "Lex" Kilpinen on March 16, 2018, 06:22:06 AM
For example, upload a config file with wrong line endings, and you could potentially render a server inoperable.

As far as I know, both Linux and BSD support CR + LF (Windows) line endings. Everything should work without a problem if the server is set up correctly... even if the line endings don't match the native Linux/BSD ones.
But do it the other way around, and Windows boxes will crash.
Quote from: GigaWatt on March 16, 2018, 06:26:11 PM
Quote from: Aleksi "Lex" Kilpinen on March 16, 2018, 06:22:06 AM
It's also said to be faster than binary.

This doesn't make any sense. In binary mode, the FTP client just transfers the files without analyzing them first to see if they are a text file or not... ASCII mode should be slower.
I'm not sure of this either way, but that's what I have heard. The whole protocol was originally designed for transferring text files.

Quote from: GigaWatt on March 16, 2018, 06:26:11 PM
Anyway, there is some progress on the topic at hand ;).

I did some thinking about the corrupt attachment issue... I may have a solution ;). I just need some information on how the hash of the attachments is calculated (i.e. their file name in the attachments folder). I noticed that even if I upload the same file twice in a certain topic, the hash values for both of the files differ in the database. This was kind of disappointing, since I thought that the hash values are only check sum based, but I guess I was wrong.

My idea was to let the tool (I started coding something) do every possible reconstruction scenario of a file, calculate the check sum and each time a reconstruction scenario is complete, compare the check sum with the generated check sum in SMF. Since there is an attachment integrity tool, I assume there must be some sort of a check sum calculated by the scrip or at least taken into account when calculating the hash value of the attachment. I just need the algorithm that calculates the hash value, reverse it, discard anything that is not the check sum of the file and compare this value to the check sum of the reconstructed file, which is of course calculated in the same way that the script calculates the check sums ;). I know there could be hundreds of possible reconstruction scenarios for each file, especially if the file contained lots of 0D 0A binary strings, but with modern CPUs, I don't think it'll be such a big problem ;).
This sounds like a cool experiment really, might even work.
Slava
Ukraini!
"Before you allow people access to your forum, especially in an administrative position, you must be aware that that person can seriously damage your forum. Therefore, you should only allow people that you trust, implicitly, to have such access." -Douglas

How you can help SMF

GigaWatt

Quote from: Aleksi "Lex" Kilpinen on March 17, 2018, 02:31:59 AM
But do it the other way around, and Windows boxes will crash.

Yeah, true ;).

But, on the other hand, I don't think there are many Windows servers out there. Personally, I wouldn't run my forum on a Windows box :D.

In any case, you might be right... it might be safer to work in ASCII mode, but do it only if you're certain that the files you're transfering are text.

Quote from: Aleksi "Lex" Kilpinen on March 17, 2018, 02:31:59 AM
I'm not sure of this either way, but that's what I have heard. The whole protocol was originally designed for transferring text files.

Had no idea this is the case. File Transfer Protocol (at least in my head) associates with... well... just that, a protocol for transfering files, not a protocol specifically designed for transfering text files.... although the Wiki page mentions both ASCII and binary (image) mode so, I guess historically, both modes might have been a part of the original standard.

Quote from: Aleksi "Lex" Kilpinen on March 17, 2018, 02:31:59 AM
This sounds like a cool experiment really, might even work.

Yeah, I'm hoping it will ;).

I'll start digging in the script, see how the hash is calculated and will write back as soon as I've made some progress ;).
"This is really a generic concept about human thinking - when faced with large tasks we're naturally inclined to try to break them down into a bunch of smaller tasks that together make up the whole."

"A 500 error loosely translates to the webserver saying, "WTF?"..."

Kindred

You'd be wrong.
  Most corporate businesses run on Windows and IIS.
Слaва
Украинi

Please do not PM, IM or Email me with support questions.  You will get better and faster responses in the support boards.  Thank you.

"Loki is not evil, although he is certainly not a force for good. Loki is... complicated."

GigaWatt

"This is really a generic concept about human thinking - when faced with large tasks we're naturally inclined to try to break them down into a bunch of smaller tasks that together make up the whole."

"A 500 error loosely translates to the webserver saying, "WTF?"..."

Aleksi "Lex" Kilpinen

Slava
Ukraini!
"Before you allow people access to your forum, especially in an administrative position, you must be aware that that person can seriously damage your forum. Therefore, you should only allow people that you trust, implicitly, to have such access." -Douglas

How you can help SMF

GigaWatt

Some. I'm working with a member on my forum on the issue. Apparently, SMF uses SHA1 and MD5 for it's hash values, but it also uses a time stamp and a random value to generate the hash. I (we) think this is the part that actually generates the hash (located in Subs.php).

// Just make up a nice hash...
if ($new)
    return sha1(md5($filename . time()) . mt_rand());


Sure, I could get the time stamp value (it's stored in the database), but what about the random value :S. I can't recreate it, thus, I can't get the MD5 and SHA1 signature, which of course means that the one the program will calculate is practically useless :S.

Is the mt_rand value stored somewhere in the database? From what we've analyzed, it's not, but just in case, I thought I'd ask.
"This is really a generic concept about human thinking - when faced with large tasks we're naturally inclined to try to break them down into a bunch of smaller tasks that together make up the whole."

"A 500 error loosely translates to the webserver saying, "WTF?"..."

GigaWatt

Just an update on this issue, which is still present, but less painful, since most active members were nice enough to reupload most of the affected files... a big thanks guys ;). The forum wouldn't be the same without you :).

In any case, I might have mentioned before in this tread that I'm working with a member of the forum on this issue (not a team member, a regular one, but he offered to help since he is a lot more involved in web design, PHP, Java, ASP.NET, ect.), and when I saw the mt_rand command... I knew thing weren't looking good. So, just in case I was wrong (and not to bias the guy), I provided this info (part of the code in my previous post) to the member who offered a hand... his thoughts were the same as mine... if the value from the mt_rand command is not kept somewhere in the database, there is no way to get to the original CRC of the file :S. I've searched a bunch of times on different occasions for info on this, but I came to the conclusion that this value is not kept in the database... and if that value should be stored somewhere, it should be in the attachments table, but the attachments table doesn't contain any column that might indicate that it holds a value like this.

So... in short, I gave up on the project.

The good news is that the old admin somehow made a compile of old backups and reuploaded them on the forum, so the number of corrupted files fell down to about 20% (at least that's what the attachment integrity tool reported). Members also chipped in and in 3 months, I think the value fell to about 10% now... which is not that bad actually :).

For future reference, I think it should be taken into consideration to include this random value in a column in the attachments table. This will help a lot in problems like this one. I've given up on writing this tool, but only because there is no point in doing it if I can't get that random value.

Just a thought for the developers to think about. I know they've got bigger issues at the moment, but it might be a good idea to include something like this in future releases of SMF.

I'll just mark this thread as solved ;). Thanks to everyone who joined the discussion ;).
"This is really a generic concept about human thinking - when faced with large tasks we're naturally inclined to try to break them down into a bunch of smaller tasks that together make up the whole."

"A 500 error loosely translates to the webserver saying, "WTF?"..."

Chen Zhen

#27
The issue happens because the SMF 2.0.X branch saves its attachments without a file extension.
Most attachments are usually either an image or compressed file and therefore require binary transfer mode.
FTP clients should be manually set to binary transfer mode so it knows what to use for files without extensions.
The devs already changed this with the SMF 2.1.X branch long ago to a .DAT extension standard which triggers binary mode.

I will give you some advise that is my opinion based on my own experiences.
Do not use Filezilla FTP client even if you set it to binary mode.. one time I did an update that may have changed that setting back to auto.
It caused me some problems such as you describe here in this thread due to me neglecting to double check various backed up files.
I now use and prefer FTP Rush.




My SMF Mods & Plug-Ins

WebDev

"Either you repeat the same conventional doctrines everybody is saying, or else you say something true, and it will sound like it's from Neptune." - Noam Chomsky

GigaWatt

I use Core FTP Pro in binary mode. I wasn't the one who made the mistake, the previous admin did... I just tried to correct the mistake he did during the last backup (when he finally gave up on being the admin on the forum).
"This is really a generic concept about human thinking - when faced with large tasks we're naturally inclined to try to break them down into a bunch of smaller tasks that together make up the whole."

"A 500 error loosely translates to the webserver saying, "WTF?"..."

Chen Zhen

There is a way to fix corrupted compressed file line endings with Linux.
I adapted a script (with a loop on the attachments folder path) to do it using C a while back and can try to dig it up if you want it.
It will possibly fix any of the corrupted compressed files (zip, gzip, etc).

My SMF Mods & Plug-Ins

WebDev

"Either you repeat the same conventional doctrines everybody is saying, or else you say something true, and it will sound like it's from Neptune." - Noam Chomsky

GigaWatt

Don't know of you've read the the whole thread, but it's not that simple. There are numerous corruptions in almost every file which had 0D 0A anywhere in the file. This is pretty common in DOC, PDF and TXT files... not to mention other file types, which also use some sort of a textual file to store the project, plan, vector drawing, etc. Not just that, the FTP program actually "ate" bytes from the file, leaving the file with a smaller file size than it actually had.

Quote from: GigaWatt on February 27, 2018, 01:34:07 PM
Let's say you have a binary string somewhere and let's say this binary string contains the problematic string "0D 0A".

... 00 FF 00 FF 0D 0A 00 FF 00 FF ...

Now, here is what the FTP program did.

... 00 FF 00 FF 0A 00 FF 00 FF ...

Compare the upper with the lower sequence. It's clear that 0D is missing. The FTP program literally "ate" a byte each time it encountered the string 0D 0A.

This is easily fixable if we knew for certain that each time we encounter 0A in the corrupted file, we have to add 0D in front of it and just move all of the other bytes "to the right"... but, it's not that simple.

Let's say we encounter the following string in the corrupted file.

... 00 FF 00 FF 0A 00 FF 00 FF 0A 00 FF 00 FF ...

We would assume that the original should be like this.

... 00 FF 00 FF 0D 0A 00 FF 00 FF 0D 0A 00 FF 00 FF ...

But what if this string in the original file was actually like this.

... 00 FF 00 FF 0A 00 FF 00 FF 0D 0A 00 FF 00 FF ...

We would add an extra 0D in the file and, once more, the file is corrupt :S.

Based on the assumption that these are reserved characters and that most of the time they are used in combination, I guess my question is, what are my chances of being wrong and adding that extra 0D byte if there wasn't one present in the original ???. I would code something like this, but only if I knew that I could save a large portion of the attachments (lets' say 20% and above). If more than 90% would still be corrupt, I wouldn't bother.

That's why I needed the CRCs. I have no way of knowing if I've made the appropriate changes to the file if I don't know the original CRC. Sure, as I wrote before, I could reverse it from the hash... if it wasn't for mt_rand which is not stored in the database, which is why I suggested to be included in a separate column in the attachments table in future releases of SMF, so these sorts of things can be solved easily and accurately (when you can reverse the CRC, you can be 100% sure that what you've corrected in the file is exactly what should have been corrected).

If you can find the script, post it ;). It wouldn't do me much good now, since I have no idea which attachments are affected (the attachment integrity tool doesn't report that... also a good feature to have, list all attachments that have different file sizes than the ones stored in the database) and I wouldn't run the script on all of the files since it might corrupt files that are OK, but it could be a good reference to anyone having similar problems ;).
"This is really a generic concept about human thinking - when faced with large tasks we're naturally inclined to try to break them down into a bunch of smaller tasks that together make up the whole."

"A 500 error loosely translates to the webserver saying, "WTF?"..."

Chen Zhen

The attachments table holds the file extension data along with the actual file name & the current file name that has no extension.
One could write a PHP script that copies all files that have db entries listed as zip & gz to another folder (ensure to name them with the appropriate extension).
After that you run the fixgz program on the new directory.

I dug up something recursive regarding the fixgz but I'm not sure if it's what I used in the end.
This was a while ago & I can't seem to find where I put what I wrote.
fixgz was written in C & the commands are using Linux.
There were some minor dependencies ( stdio.h ) but without testing it I can't remember atm.

fixgz.c

/* fixgz attempts to fix a binary file transferred in ascii mode by
* removing each extra CR when it followed by LF.
* usage: fixgz  bad.gz fixed.gz

* Copyright 1998 Jean-loup Gailly <[email protected]>
*   This software is provided 'as-is', without any express or implied
* warranty.  In no event will the author be held liable for any damages
* arising from the use of this software.

* Permission is granted to anyone to use this software for any purpose,
* including commercial applications, and to alter it and redistribute it
* freely.
*/

#include <stdio.h>

int main(argc, argv)
     int argc;
     char **argv;
{
    int c1, c2; /* input bytes */
    FILE *in;   /* corrupted input file */
    FILE *out;  /* fixed output file */

    if (argc <= 2) {
fprintf(stderr, "usage: fixgz bad.gz fixed.gz\n");
exit(1);
    }
    in  = fopen(argv[1], "rb");
    if (in == NULL) {
fprintf(stderr, "fixgz: cannot open %s\n", argv[1]);
exit(1);
    }
    out = fopen(argv[2], "wb");
    if (in == NULL) {
fprintf(stderr, "fixgz: cannot create %s\n", argv[2]);
exit(1);
    }

    c1 = fgetc(in);

    while ((c2 = fgetc(in)) != EOF) {
if (c1 != '\r' || c2 != '\n') {
    fputc(c1, out);
}
c1 = c2;
    }
    if (c1 != EOF) {
fputc(c1, out);
    }
    exit(0);
    return 0; /* avoid warning */
}


possible recursive command ~ use Linux (CentOS/Ubuntu) from related parent directory:
Assuming compressed files were transferred to: ../attachments2

        cc -o fixgz fixgz.c
        fname=${file##*/}
        fpath=${file%/*}
        dname=${fpath##*/}
        fixgz $file attachments2/${fname}


My SMF Mods & Plug-Ins

WebDev

"Either you repeat the same conventional doctrines everybody is saying, or else you say something true, and it will sound like it's from Neptune." - Noam Chomsky

GigaWatt

#32
Quote from: Chen Zhen on June 04, 2018, 09:36:06 AM
One could write a PHP script that copies all files that have db entries listed as zip & gz to another folder (ensure to name them with the appropriate extension). After that you run the fixgz program on the new directory.

That's only for archive files... and only zip and gzip... which are not the most common desktop archives used these days (almost 100% of the users visiting my forum use Windows, so they usually upload a rar or a 7z archive). What about every other file type, as I mentioned, documents (doc, docx, pdf, proprietary vector file types, even some image formats are affected)? In proprietary file formats, it's not uncommon to use CR, LF or both, CR + LF in different combination to separate certain things from the header or the end of the file. fixgz can't help in those cases.

Back when I was tinkering with this problem, I checked what archive types are usually uploaded by users. 80% were rar archives, the rest were 7z archives. There were only 20 zip archive and 2 gzip archives :D.

This will work on forums that have a large Linux userbase, like programming or Linux support forums. Mine is about electronics... the primary (and in almost all of the cases, only) OS of the users visiting my forum is Windows.

It's great that you posted the code to fixgz ;). I also ran into this program when I was searching for a solution to my problem. The only problem is, in it's original form, this program doesn't add or remove bytes from the file, it just does a "search and replace", which is not what happened to the files in my case. Anyhow, it was a great reference when I started writing my code in C++ to try and repair the files, but mine had to be a lot more complicated, computing every possible combination of CR, LF or CR + LF line endings in a file, adding and removing bytes whenever it encountered an 0D in the file, calculate the CRC of that particular file (all of this done in memory) and compare it with the CRC/hash of the file in the database... which, as I later discovered, is calculated by taking into account the name of the file, the date and time of upload and, unfortunately, a random value which is not stored anywhere in the database :S.

That's why I proposed to the developers (hope some of them read this thread) to make a separate column in the attachments table and to store this random value, so that someone could hopefully recreate his files in cases like this one. In that case, a tool could be made to correct all of the files affected by any kind of FTP corruption. I'd be willing to give it another shot, writing this tool, but only if the algorithm for generating the hash is changed and doesn't include a random value, or if that random value is stored in the database. In any other case, reversing the hash is pointless.

You can download a precompiled Windows binary along with the source here.
"This is really a generic concept about human thinking - when faced with large tasks we're naturally inclined to try to break them down into a bunch of smaller tasks that together make up the whole."

"A 500 error loosely translates to the webserver saying, "WTF?"..."

Advertisement: