News:

Want to get involved in developing SMF, then why not lend a hand on our github!

Main Menu

CR + LF file corruption after FTP transfer

Started by GigaWatt, February 26, 2018, 07:30:45 PM

Previous topic - Next topic

GigaWatt

I don't know exactly where to post this... it's not exactly related to SMF itself, but I was Googling around the past few weeks for a solution to this problem to no avail.

The problem happened after the main admin decided that (for some unknown reason ::)) to just... abandon the forum. And not just abandon it, he removed it from the hosting service (he just removed the top level domain and... you get the idea, nothing loads ::)). I was a mod, but I decided to keep the forum going, along with some of the other admins of the forum. Since none of us had access to the hosting, I asked one of them to (politely ::)) ask the main admin if we could have a copy of the script and the database. I personally knew the main admin... I was surprised when I received a copy of the database and the script.

Anyhow, I had to update the script and the database, it was running an older version of SMF (I think 1.1.16), blah blah blah... to cut a long story short, everything turned out well... except for the attachments. Something must have happened during the FTP transfer (the main admin wrote me that he did the backup of the script via FTP). After some file analysis and hex editing, binary compares and whatnot, I learned that this is called a "CR + LF file corruption". Lucky me ::).

I have no idea if he (the main admin) did this on purpose or not... in the end, it doesn't matter. What matters to me is... is there a way to fix the files?

Now, after some online searching, I found out that the most common type of corruption is just a simple "0D 0A" binary modification. For those who have no idea what these bytes represent, 0D represents carriage return (CR) and 0A represents line feed (LF). Unix/Linux uses LF (0A) to end a line (and start a new one), Macs use CR (0D) and Windows uses both of them, CR + LF (0D 0A). Most FTP programs, when set to transfer FTP files in ASCII mode, just make a simple binary change (change 0D to 0A or vice versa, in some cases add 00 before or after the line end), but they don't "eat" bytes. In this particular case, whole bytes were missing... which in return, means that the file downloaded via FTP had a smaller file size than the original file.

Here's a screen shot of a binary compare of an original (left) and a corrupted (right) file.



Basically, you have to add bytes to the file based on a certain criteria (binary string) that appears in the file.

Now, I searched and searched... I didn't find any program that can do this. I can do it manually for some of the attachments, but I can't do it for all of them (around 7000 attachments).

Basically, what I'm looking for is a program that can do this in bulk. Give it a bunch of files, set the binary criteria, patch the files.

Or... if someone has a better idea, I'm opened to suggestions. In case you're thinking of suggesting another backup of the forum's script and files... already tried to force that with the main admin... it's a no go. He's a coder... and an admin, so... you know... he's god ::)... he knows best ::)... he thinks I'm not doing something right and either won't make another backup or... he's deleted the forum from the server (I think this is the more probable scenario). In either case, I can't get another copy of the files.
"This is really a generic concept about human thinking - when faced with large tasks we're naturally inclined to try to break them down into a bunch of smaller tasks that together make up the whole."

"A 500 error loosely translates to the webserver saying, "WTF?"..."

Sir Osis of Liver

If you used FileZilla to download/upload attachments, and transfer type was not set to binary, the files were transferred as ascii and are corrupt.  Several schemes were posted to fix the files, but none of the ones I tried ever worked.  I don't know of any software that will do what you're suggesting, possibly someone else does, otherwise you would have to write it yourself.

Ashes and diamonds, foe and friend,
 we were all equal in the end.

                                     - R. Waters

Aleksi "Lex" Kilpinen

Are the originals corrupt for sure, will they not work at all if you give them a proper filetype on your local harddrive? The attachments folder(s) should be transferred as binary both ways to avoid this problem.
If they are corrupt already, I haven't heard of a good way to save them.
Slava
Ukraini!
"Before you allow people access to your forum, especially in an administrative position, you must be aware that that person can seriously damage your forum. Therefore, you should only allow people that you trust, implicitly, to have such access." -Douglas

How you can help SMF

GigaWatt

Quote from: Sir Osis of Liver on February 26, 2018, 09:37:55 PM
If you used FileZilla to download/upload attachments, and transfer type was not set to binary, the files were transferred as ascii and are corrupt.

I have no idea which software was used. I wasn't the one who made the backup.

But there was a MACOSX folder inside the archive (the backup of the script)... I guess it could have been done on a Mac.

Quote from: Aleksi "Lex" Kilpinen on February 26, 2018, 11:42:36 PM
Are the originals corrupt for sure, will they not work at all if you give them a proper filetype on your local harddrive?

No. Actually, that's how I got to the corrupted file in the example. I checked the hash in the database and searched for that string in the attachments, just added the proper extension (pdf in the example) and compared it to the "original" (I had a local copy of the file).

And yes, all of them are corrupt. I have local copes of some of the files... all of the attachments that I compared to the "original" files were corrupt. I have no idea if all of them are, but at least those that contain "0D 0A" in sequence as a binary string are corrupt for sure.

Quote from: Aleksi "Lex" Kilpinen on February 26, 2018, 11:42:36 PM
The attachments folder(s) should be transferred as binary both ways to avoid this problem.

That's what I though ::). As it turn out, the default setting for most FTP programs is to transfer files in "Auto" mode in both ways. In the process, they analyze the file extension and decide weather they should transfer the file and change the line endings (CR, LF or CR + LF) or transfer the file untouched. And get this... some of them assume that if a file has no file extension (like the attachments), that it's a text file ??? :o. I assume this is how they got corrupted.

Quote from: Aleksi "Lex" Kilpinen on February 26, 2018, 11:42:36 PM
If they are corrupt already, I haven't heard of a good way to save them.

I thought that might be the case...

I'm not a coder by occupation, I work with low level stuff mostly (microcontrollers, electronics and whatnot), but I guess I can code something like this... it shouldn't be too hard. Except for one thing :S.

Let's say you have a binary string somewhere and let's say this binary string contains the problematic string "0D 0A".

... 00 FF 00 FF 0D 0A 00 FF 00 FF ...

Now, here is what the FTP program did.

... 00 FF 00 FF 0A 00 FF 00 FF ...

Compare the upper with the lower sequence. It's clear that 0D is missing. The FTP program literally "ate" a byte each time it encountered the string 0D 0A.

This is easily fixable if we knew for certain that each time we encounter 0A in the corrupted file, we have to add 0D in front of it and just move all of the other bytes "to the right"... but, it's not that simple.

Let's say we encounter the following string in the corrupted file.

... 00 FF 00 FF 0A 00 FF 00 FF 0A 00 FF 00 FF ...

We would assume that the original should be like this.

... 00 FF 00 FF 0D 0A 00 FF 00 FF 0D 0A 00 FF 00 FF ...

But what if this string in the original file was actually like this.

... 00 FF 00 FF 0A 00 FF 00 FF 0D 0A 00 FF 00 FF ...

We would add an extra 0D in the file and, once more, the file is corrupt :S.

Based on the assumption that these are reserved characters and that most of the time they are used in combination, I guess my question is, what are my chances of being wrong and adding that extra 0D byte if there wasn't one present in the original ???. I would code something like this, but only if I knew that I could save a large portion of the attachments (lets' say 20% and above). If more than 90% would still be corrupt, I wouldn't bother.
"This is really a generic concept about human thinking - when faced with large tasks we're naturally inclined to try to break them down into a bunch of smaller tasks that together make up the whole."

"A 500 error loosely translates to the webserver saying, "WTF?"..."

Aleksi "Lex" Kilpinen

I'm not sure of this, but on a hunch I'd say you would be playing a lottery - it might work for some files by chance, might not for others. Because we can't know what the files actually contained originally, reconstructing them is not really easy - even if your assumption sounds logical.
Slava
Ukraini!
"Before you allow people access to your forum, especially in an administrative position, you must be aware that that person can seriously damage your forum. Therefore, you should only allow people that you trust, implicitly, to have such access." -Douglas

How you can help SMF

GigaWatt

Thought so :S.

Well, I'll try and write something, see if does something useful with the attachment and if it does... I'll post results.

I'm keeping my fingers crossed, but I'm also not expecting too much based on my assumptions.
"This is really a generic concept about human thinking - when faced with large tasks we're naturally inclined to try to break them down into a bunch of smaller tasks that together make up the whole."

"A 500 error loosely translates to the webserver saying, "WTF?"..."

Aleksi "Lex" Kilpinen

Well, good luck :) And I am curious, so do tell us if you get results.
Slava
Ukraini!
"Before you allow people access to your forum, especially in an administrative position, you must be aware that that person can seriously damage your forum. Therefore, you should only allow people that you trust, implicitly, to have such access." -Douglas

How you can help SMF

GigaWatt

*Update*

I noticed that most of the files that are images are actually intact :). I guess I was right, these are reserved characters and the chance of one being next to the other in a media file are very slim. Luckily, most of the attachments are images :), and if some of them are corrupt, I guess it doesn't affect the decoding process (preview of the image) that much. I binary checked about 10 image attachments, most of them were JPEG, 2 or 3 were PNG and 1 was BMP. Only the bitmap file had the problematic string 0D 0A in 3 or 4 places in the file (I guess it has something to do with the coding of bitmaps, how they are saved ???, maybe it uses line endings to separate settings or file info ???), but the preview of the image was identical to the original file (the uncorrupted one). I was using FastStone Image Viewer... I have no idea if other programs will display the image properly, but FastStone Image Viewer doesn't seem to mind. If I save the corrupted image (recode it) as a BMP file in FastStone Image Viewer, it adds the 0D 0A string in (almost) the same places that the original (not corrupted) image had them.

Maybe different programs under different OSes use different line endings to separate file info in image files, but I suppose most image viewing/editing programs are equipped to handle this ???. For example, if you save a BMP image under Linux in a native Linux application, it'll use LF (0A) to separate line endings in the file info. I haven't tried this, it's just a guess, but I'll try it ;).

On the other hand, I'm almost certain that all binary (EXE) and archive (ZIP, RAR, 7Z, GZIP, XZ...) files are corrupt :S. I tried opening about 5 or 6 archive files (most were RAR, 1 was ZIP and 1 was 7Z), all of them reported that they were corrupt. Two of the RAR files had a 5% recovery record. One of them reported that it was corrupt, but opened the archive, and after I repaired it, it extracted the content of the archive. The files in the archive didn't seem to be damaged, but most of them were images (they previewed properly) and I didn't have the original ones to make a binary check, so I just declared them as OK (undamaged). The other RAR archive also reported corruption, but the 5% recovery record didn't repair the archive :S. I guess the recovery record in RAR archives works with a 50/50 chance of recovering the archive. I know, it probably depends on what's been damaged and how much of the archive has been damaged, but since no one can predict in which part of the RAR archive the 0D 0A string is going to appear (I guess these are not reserved characters in archives), your chances of recovering a RAR archive with an embedded recovery record are 50/50. Well...I guess it's better than nothing.

The binary files are a different story. I haven't checked, but I suppose almost all of them are Windows binaries (EXE). During the testing, I was working with one corrupted binary. It was an MS Visual C++ 6.0 application (a small application, few hundred KB). It had numerous corruptions in the file, but the interesting thing was that all of them were in ASCII strings. I haven't checked this, but I assume that the compiler also uses 0D 0A to end/start lines in text strings. The good thing was that I managed to repair the file :) by "replacing" 0A with 0D 0A (add one byte after each 0A, change 0A to 0D, change the added byte to 0A). I guess in this particular case (IDE/langage/compiler combination), this binary string is reserved, so there is no chance of 0D or 0A being alone in the binary, they will always go in pairs. I have no idea if it will work on any other combination (compressed binary, different IDE/language/compiler), but I plan on digging a little deeper ;).
"This is really a generic concept about human thinking - when faced with large tasks we're naturally inclined to try to break them down into a bunch of smaller tasks that together make up the whole."

"A 500 error loosely translates to the webserver saying, "WTF?"..."

Aleksi "Lex" Kilpinen

That's interesting, and nice to hear you actually have found a chance of saving some of them. :)
Slava
Ukraini!
"Before you allow people access to your forum, especially in an administrative position, you must be aware that that person can seriously damage your forum. Therefore, you should only allow people that you trust, implicitly, to have such access." -Douglas

How you can help SMF

GigaWatt

Yeah, I'm also relieved :), since most of the attachments were image files (JPG, PNG, GIF...).

In the SMF 2.0.x database, I noticed that the script also makes a checksum of all the attachments. I'll try and upload the attachments on the updated version of the forum (2.0.15), run the Attachment Integrity tool and see what it reports. The files that show a cheksum mismatch are corrupted of course.

Can SMF 2.0.15 list the corrupted files? An even better option would be to list them in different data types (images, binary, archives, etc.).
"This is really a generic concept about human thinking - when faced with large tasks we're naturally inclined to try to break them down into a bunch of smaller tasks that together make up the whole."

"A 500 error loosely translates to the webserver saying, "WTF?"..."

Aleksi "Lex" Kilpinen

I'm not truly familiar with the attachment handling on that level, but I don't think SMF offers a report option like that. But - if the checksums are there, and if we already have a tool to check them - I don't think building on that foundation would be too difficult, and it should be possible to do as a mod. (If something doesn't already exist).
Slava
Ukraini!
"Before you allow people access to your forum, especially in an administrative position, you must be aware that that person can seriously damage your forum. Therefore, you should only allow people that you trust, implicitly, to have such access." -Douglas

How you can help SMF

GigaWatt

It would be a nice mod ;). Even if the admin doesn't want/know how to fix the files, he could just delete the corrupted ones and their links to posts in the database... sort of a clean up :).

How about links to posts with attachments. Can SMF 2.0.x delete files that have links in posts, but the attachment doesn't exist in the attachments folder and vice versa (there are files in the attachments folder with something that resembles a hash value, but there are no links to them in the database, excluding .htaccess and index.php of course)?
"This is really a generic concept about human thinking - when faced with large tasks we're naturally inclined to try to break them down into a bunch of smaller tasks that together make up the whole."

"A 500 error loosely translates to the webserver saying, "WTF?"..."

Aleksi "Lex" Kilpinen

Yeah, there is an adminside tool to browse attachments, and it also shows where they are linked to.
The maintenance can also automatically prune links that are broken. The tools are just not so refined as to do everything you wanted.
Slava
Ukraini!
"Before you allow people access to your forum, especially in an administrative position, you must be aware that that person can seriously damage your forum. Therefore, you should only allow people that you trust, implicitly, to have such access." -Douglas

How you can help SMF

GigaWatt

Hmmm... OK, I guess I'll work with what I've got ;).

I'll try and post on the attachment issue as soon as I have more results ;).
"This is really a generic concept about human thinking - when faced with large tasks we're naturally inclined to try to break them down into a bunch of smaller tasks that together make up the whole."

"A 500 error loosely translates to the webserver saying, "WTF?"..."

richardwbb

I like your effort of repair. I've put FileZilla in binary mode and after that upload and download ascii files without problems. I really think the auto transfer option should be set at binary at all times. And what helped me was opening the directory with attachments on a linux environment, that way it can show thumbnails for every image thats wasn't damaged. This way I found quickly a low percentage became damaged, which I had backups for. For Windows you might need a viewer application or backup all those files in attachments and rename all of that to .jpg, which should produce positive results. But again I've always put ftp programs in binary and wonder why that option is there, is it old maybe. Ascii files do transfer properlt as binary. The linux 'file' command might be helpful too. There probably is a Windows equivalent I am not aware of.
If my post in this topic looks ambiguous to you, then I'm with Murphy's law and General Stupidity. In other words, trial and error.

GigaWatt

Quote from: richardwbb on March 09, 2018, 02:43:37 PM
I really think the auto transfer option should be set at binary at all times.

Exactly ;). I had a conversation with the previous "uber admin". He told me that each time the forum was transferred via FTP (we changed hosts a few times), the program was set in Auto mode ::). This is a perfect example that auto mode should be avoided at all costs and you should always set your FTP program to run the transfer in binary mode.

Quote from: richardwbb on March 09, 2018, 02:43:37 PM
And what helped me was opening the directory with attachments on a linux environment, that way it can show thumbnails for every image thats wasn't damaged. This way I found quickly a low percentage became damaged, which I had backups for.

Yeah, I was also thinking the same. I thought, since Linux reads MIME types, not extensions, it should show a preview of the images, at least the ones that aren't corrupt. This was one of my next tests ;). I'll post when I have some new info ;).

Quote from: richardwbb on March 09, 2018, 02:43:37 PM
The linux 'file' command might be helpful too. There probably is a Windows equivalent I am not aware of.

I don't think so... and even if there is one, I don't think it'll be as detailed as the Linux one (distinguish media, documents, binaries, etc.), I think it'll only distinguish between text and... everything else... not to mention that most CAD and image manipulation programs basically use text files (xml or something else... doesn't really matter) as their project file, so it might report, let's say, a CorelDRAW file as a text file ::). Linux will do the same in this case, but at least I can easily determine which files are media files (most of the attachments), text files (most probably project files of some sort), documents (Linux should also report them as something other than a regular text file), archives and everything else (mostly binaries in the case of my forum).
"This is really a generic concept about human thinking - when faced with large tasks we're naturally inclined to try to break them down into a bunch of smaller tasks that together make up the whole."

"A 500 error loosely translates to the webserver saying, "WTF?"..."

richardwbb

I still wonder what ancient use auto or ASCII mode has. I've been using FileZilla for a long time before I noticed some avatars had been damaged because of the default setting which I suspect is ancient. I forget to think about GnuWin32, the file command has been ported, but probably there is a .NET or PowerShell solution I am not aware of.
If my post in this topic looks ambiguous to you, then I'm with Murphy's law and General Stupidity. In other words, trial and error.

GigaWatt

PowerShell is a different story... yeah, I suspect that it might have a command that will do what any Linux distro does out of the box ;).
"This is really a generic concept about human thinking - when faced with large tasks we're naturally inclined to try to break them down into a bunch of smaller tasks that together make up the whole."

"A 500 error loosely translates to the webserver saying, "WTF?"..."

Aleksi "Lex" Kilpinen

At least in theory, the ascii mode is there to protect you from mistakes. For example, upload a config file with wrong line endings, and you could potentially render a server inoperable. It's also said to be faster than binary.
Slava
Ukraini!
"Before you allow people access to your forum, especially in an administrative position, you must be aware that that person can seriously damage your forum. Therefore, you should only allow people that you trust, implicitly, to have such access." -Douglas

How you can help SMF

GigaWatt

Quote from: Aleksi "Lex" Kilpinen on March 16, 2018, 06:22:06 AM
For example, upload a config file with wrong line endings, and you could potentially render a server inoperable.

As far as I know, both Linux and BSD support CR + LF (Windows) line endings. Everything should work without a problem if the server is set up correctly... even if the line endings don't match the native Linux/BSD ones.

Quote from: Aleksi "Lex" Kilpinen on March 16, 2018, 06:22:06 AM
It's also said to be faster than binary.

This doesn't make any sense. In binary mode, the FTP client just transfers the files without analyzing them first to see if they are a text file or not... ASCII mode should be slower.

Anyway, there is some progress on the topic at hand ;).

I did some thinking about the corrupt attachment issue... I may have a solution ;). I just need some information on how the hash of the attachments is calculated (i.e. their file name in the attachments folder). I noticed that even if I upload the same file twice in a certain topic, the hash values for both of the files differ in the database. This was kind of disappointing, since I thought that the hash values are only check sum based, but I guess I was wrong.

My idea was to let the tool (I started coding something) do every possible reconstruction scenario of a file, calculate the check sum and each time a reconstruction scenario is complete, compare the check sum with the generated check sum in SMF. Since there is an attachment integrity tool, I assume there must be some sort of a check sum calculated by the scrip or at least taken into account when calculating the hash value of the attachment. I just need the algorithm that calculates the hash value, reverse it, discard anything that is not the check sum of the file and compare this value to the check sum of the reconstructed file, which is of course calculated in the same way that the script calculates the check sums ;). I know there could be hundreds of possible reconstruction scenarios for each file, especially if the file contained lots of 0D 0A binary strings, but with modern CPUs, I don't think it'll be such a big problem ;).
"This is really a generic concept about human thinking - when faced with large tasks we're naturally inclined to try to break them down into a bunch of smaller tasks that together make up the whole."

"A 500 error loosely translates to the webserver saying, "WTF?"..."

Advertisement: