News:

SMF 2.1.4 has been released! Take it for a spin! Read more.

Main Menu

CR + LF file corruption after FTP transfer

Started by GigaWatt, February 26, 2018, 07:30:45 PM

Previous topic - Next topic

Aleksi "Lex" Kilpinen

Quote from: GigaWatt on March 16, 2018, 06:26:11 PM
Quote from: Aleksi "Lex" Kilpinen on March 16, 2018, 06:22:06 AM
For example, upload a config file with wrong line endings, and you could potentially render a server inoperable.

As far as I know, both Linux and BSD support CR + LF (Windows) line endings. Everything should work without a problem if the server is set up correctly... even if the line endings don't match the native Linux/BSD ones.
But do it the other way around, and Windows boxes will crash.
Quote from: GigaWatt on March 16, 2018, 06:26:11 PM
Quote from: Aleksi "Lex" Kilpinen on March 16, 2018, 06:22:06 AM
It's also said to be faster than binary.

This doesn't make any sense. In binary mode, the FTP client just transfers the files without analyzing them first to see if they are a text file or not... ASCII mode should be slower.
I'm not sure of this either way, but that's what I have heard. The whole protocol was originally designed for transferring text files.

Quote from: GigaWatt on March 16, 2018, 06:26:11 PM
Anyway, there is some progress on the topic at hand ;).

I did some thinking about the corrupt attachment issue... I may have a solution ;). I just need some information on how the hash of the attachments is calculated (i.e. their file name in the attachments folder). I noticed that even if I upload the same file twice in a certain topic, the hash values for both of the files differ in the database. This was kind of disappointing, since I thought that the hash values are only check sum based, but I guess I was wrong.

My idea was to let the tool (I started coding something) do every possible reconstruction scenario of a file, calculate the check sum and each time a reconstruction scenario is complete, compare the check sum with the generated check sum in SMF. Since there is an attachment integrity tool, I assume there must be some sort of a check sum calculated by the scrip or at least taken into account when calculating the hash value of the attachment. I just need the algorithm that calculates the hash value, reverse it, discard anything that is not the check sum of the file and compare this value to the check sum of the reconstructed file, which is of course calculated in the same way that the script calculates the check sums ;). I know there could be hundreds of possible reconstruction scenarios for each file, especially if the file contained lots of 0D 0A binary strings, but with modern CPUs, I don't think it'll be such a big problem ;).
This sounds like a cool experiment really, might even work.
Slava
Ukraini!


"Before you allow people access to your forum, especially in an administrative position, you must be aware that that person can seriously damage your forum. Therefore, you should only allow people that you trust, implicitly, to have such access." -Douglas

How you can help SMF

GigaWatt

Quote from: Aleksi "Lex" Kilpinen on March 17, 2018, 02:31:59 AM
But do it the other way around, and Windows boxes will crash.

Yeah, true ;).

But, on the other hand, I don't think there are many Windows servers out there. Personally, I wouldn't run my forum on a Windows box :D.

In any case, you might be right... it might be safer to work in ASCII mode, but do it only if you're certain that the files you're transfering are text.

Quote from: Aleksi "Lex" Kilpinen on March 17, 2018, 02:31:59 AM
I'm not sure of this either way, but that's what I have heard. The whole protocol was originally designed for transferring text files.

Had no idea this is the case. File Transfer Protocol (at least in my head) associates with... well... just that, a protocol for transfering files, not a protocol specifically designed for transfering text files.... although the Wiki page mentions both ASCII and binary (image) mode so, I guess historically, both modes might have been a part of the original standard.

Quote from: Aleksi "Lex" Kilpinen on March 17, 2018, 02:31:59 AM
This sounds like a cool experiment really, might even work.

Yeah, I'm hoping it will ;).

I'll start digging in the script, see how the hash is calculated and will write back as soon as I've made some progress ;).
"This is really a generic concept about human thinking - when faced with large tasks we're naturally inclined to try to break them down into a bunch of smaller tasks that together make up the whole."

"A 500 error loosely translates to the webserver saying, "WTF?"..."

Kindred

You'd be wrong.
  Most corporate businesses run on Windows and IIS.
Слaва
Украинi

Please do not PM, IM or Email me with support questions.  You will get better and faster responses in the support boards.  Thank you.

"Loki is not evil, although he is certainly not a force for good. Loki is... complicated."

GigaWatt

"This is really a generic concept about human thinking - when faced with large tasks we're naturally inclined to try to break them down into a bunch of smaller tasks that together make up the whole."

"A 500 error loosely translates to the webserver saying, "WTF?"..."

Aleksi "Lex" Kilpinen

Slava
Ukraini!


"Before you allow people access to your forum, especially in an administrative position, you must be aware that that person can seriously damage your forum. Therefore, you should only allow people that you trust, implicitly, to have such access." -Douglas

How you can help SMF

GigaWatt

Some. I'm working with a member on my forum on the issue. Apparently, SMF uses SHA1 and MD5 for it's hash values, but it also uses a time stamp and a random value to generate the hash. I (we) think this is the part that actually generates the hash (located in Subs.php).

// Just make up a nice hash...
if ($new)
    return sha1(md5($filename . time()) . mt_rand());


Sure, I could get the time stamp value (it's stored in the database), but what about the random value :S. I can't recreate it, thus, I can't get the MD5 and SHA1 signature, which of course means that the one the program will calculate is practically useless :S.

Is the mt_rand value stored somewhere in the database? From what we've analyzed, it's not, but just in case, I thought I'd ask.
"This is really a generic concept about human thinking - when faced with large tasks we're naturally inclined to try to break them down into a bunch of smaller tasks that together make up the whole."

"A 500 error loosely translates to the webserver saying, "WTF?"..."

GigaWatt

Just an update on this issue, which is still present, but less painful, since most active members were nice enough to reupload most of the affected files... a big thanks guys ;). The forum wouldn't be the same without you :).

In any case, I might have mentioned before in this tread that I'm working with a member of the forum on this issue (not a team member, a regular one, but he offered to help since he is a lot more involved in web design, PHP, Java, ASP.NET, ect.), and when I saw the mt_rand command... I knew thing weren't looking good. So, just in case I was wrong (and not to bias the guy), I provided this info (part of the code in my previous post) to the member who offered a hand... his thoughts were the same as mine... if the value from the mt_rand command is not kept somewhere in the database, there is no way to get to the original CRC of the file :S. I've searched a bunch of times on different occasions for info on this, but I came to the conclusion that this value is not kept in the database... and if that value should be stored somewhere, it should be in the attachments table, but the attachments table doesn't contain any column that might indicate that it holds a value like this.

So... in short, I gave up on the project.

The good news is that the old admin somehow made a compile of old backups and reuploaded them on the forum, so the number of corrupted files fell down to about 20% (at least that's what the attachment integrity tool reported). Members also chipped in and in 3 months, I think the value fell to about 10% now... which is not that bad actually :).

For future reference, I think it should be taken into consideration to include this random value in a column in the attachments table. This will help a lot in problems like this one. I've given up on writing this tool, but only because there is no point in doing it if I can't get that random value.

Just a thought for the developers to think about. I know they've got bigger issues at the moment, but it might be a good idea to include something like this in future releases of SMF.

I'll just mark this thread as solved ;). Thanks to everyone who joined the discussion ;).
"This is really a generic concept about human thinking - when faced with large tasks we're naturally inclined to try to break them down into a bunch of smaller tasks that together make up the whole."

"A 500 error loosely translates to the webserver saying, "WTF?"..."

Chen Zhen

#27
The issue happens because the SMF 2.0.X branch saves its attachments without a file extension.
Most attachments are usually either an image or compressed file and therefore require binary transfer mode.
FTP clients should be manually set to binary transfer mode so it knows what to use for files without extensions.
The devs already changed this with the SMF 2.1.X branch long ago to a .DAT extension standard which triggers binary mode.

I will give you some advise that is my opinion based on my own experiences.
Do not use Filezilla FTP client even if you set it to binary mode.. one time I did an update that may have changed that setting back to auto.
It caused me some problems such as you describe here in this thread due to me neglecting to double check various backed up files.
I now use and prefer FTP Rush.




My SMF Mods & Plug-Ins

WebDev

"Either you repeat the same conventional doctrines everybody is saying, or else you say something true, and it will sound like it's from Neptune." - Noam Chomsky

GigaWatt

I use Core FTP Pro in binary mode. I wasn't the one who made the mistake, the previous admin did... I just tried to correct the mistake he did during the last backup (when he finally gave up on being the admin on the forum).
"This is really a generic concept about human thinking - when faced with large tasks we're naturally inclined to try to break them down into a bunch of smaller tasks that together make up the whole."

"A 500 error loosely translates to the webserver saying, "WTF?"..."

Chen Zhen

There is a way to fix corrupted compressed file line endings with Linux.
I adapted a script (with a loop on the attachments folder path) to do it using C a while back and can try to dig it up if you want it.
It will possibly fix any of the corrupted compressed files (zip, gzip, etc).

My SMF Mods & Plug-Ins

WebDev

"Either you repeat the same conventional doctrines everybody is saying, or else you say something true, and it will sound like it's from Neptune." - Noam Chomsky

GigaWatt

Don't know of you've read the the whole thread, but it's not that simple. There are numerous corruptions in almost every file which had 0D 0A anywhere in the file. This is pretty common in DOC, PDF and TXT files... not to mention other file types, which also use some sort of a textual file to store the project, plan, vector drawing, etc. Not just that, the FTP program actually "ate" bytes from the file, leaving the file with a smaller file size than it actually had.

Quote from: GigaWatt on February 27, 2018, 01:34:07 PM
Let's say you have a binary string somewhere and let's say this binary string contains the problematic string "0D 0A".

... 00 FF 00 FF 0D 0A 00 FF 00 FF ...

Now, here is what the FTP program did.

... 00 FF 00 FF 0A 00 FF 00 FF ...

Compare the upper with the lower sequence. It's clear that 0D is missing. The FTP program literally "ate" a byte each time it encountered the string 0D 0A.

This is easily fixable if we knew for certain that each time we encounter 0A in the corrupted file, we have to add 0D in front of it and just move all of the other bytes "to the right"... but, it's not that simple.

Let's say we encounter the following string in the corrupted file.

... 00 FF 00 FF 0A 00 FF 00 FF 0A 00 FF 00 FF ...

We would assume that the original should be like this.

... 00 FF 00 FF 0D 0A 00 FF 00 FF 0D 0A 00 FF 00 FF ...

But what if this string in the original file was actually like this.

... 00 FF 00 FF 0A 00 FF 00 FF 0D 0A 00 FF 00 FF ...

We would add an extra 0D in the file and, once more, the file is corrupt :S.

Based on the assumption that these are reserved characters and that most of the time they are used in combination, I guess my question is, what are my chances of being wrong and adding that extra 0D byte if there wasn't one present in the original ???. I would code something like this, but only if I knew that I could save a large portion of the attachments (lets' say 20% and above). If more than 90% would still be corrupt, I wouldn't bother.

That's why I needed the CRCs. I have no way of knowing if I've made the appropriate changes to the file if I don't know the original CRC. Sure, as I wrote before, I could reverse it from the hash... if it wasn't for mt_rand which is not stored in the database, which is why I suggested to be included in a separate column in the attachments table in future releases of SMF, so these sorts of things can be solved easily and accurately (when you can reverse the CRC, you can be 100% sure that what you've corrected in the file is exactly what should have been corrected).

If you can find the script, post it ;). It wouldn't do me much good now, since I have no idea which attachments are affected (the attachment integrity tool doesn't report that... also a good feature to have, list all attachments that have different file sizes than the ones stored in the database) and I wouldn't run the script on all of the files since it might corrupt files that are OK, but it could be a good reference to anyone having similar problems ;).
"This is really a generic concept about human thinking - when faced with large tasks we're naturally inclined to try to break them down into a bunch of smaller tasks that together make up the whole."

"A 500 error loosely translates to the webserver saying, "WTF?"..."

Chen Zhen

The attachments table holds the file extension data along with the actual file name & the current file name that has no extension.
One could write a PHP script that copies all files that have db entries listed as zip & gz to another folder (ensure to name them with the appropriate extension).
After that you run the fixgz program on the new directory.

I dug up something recursive regarding the fixgz but I'm not sure if it's what I used in the end.
This was a while ago & I can't seem to find where I put what I wrote.
fixgz was written in C & the commands are using Linux.
There were some minor dependencies ( stdio.h ) but without testing it I can't remember atm.

fixgz.c

/* fixgz attempts to fix a binary file transferred in ascii mode by
* removing each extra CR when it followed by LF.
* usage: fixgz  bad.gz fixed.gz

* Copyright 1998 Jean-loup Gailly <[email protected]>
*   This software is provided 'as-is', without any express or implied
* warranty.  In no event will the author be held liable for any damages
* arising from the use of this software.

* Permission is granted to anyone to use this software for any purpose,
* including commercial applications, and to alter it and redistribute it
* freely.
*/

#include <stdio.h>

int main(argc, argv)
     int argc;
     char **argv;
{
    int c1, c2; /* input bytes */
    FILE *in;   /* corrupted input file */
    FILE *out;  /* fixed output file */

    if (argc <= 2) {
fprintf(stderr, "usage: fixgz bad.gz fixed.gz\n");
exit(1);
    }
    in  = fopen(argv[1], "rb");
    if (in == NULL) {
fprintf(stderr, "fixgz: cannot open %s\n", argv[1]);
exit(1);
    }
    out = fopen(argv[2], "wb");
    if (in == NULL) {
fprintf(stderr, "fixgz: cannot create %s\n", argv[2]);
exit(1);
    }

    c1 = fgetc(in);

    while ((c2 = fgetc(in)) != EOF) {
if (c1 != '\r' || c2 != '\n') {
    fputc(c1, out);
}
c1 = c2;
    }
    if (c1 != EOF) {
fputc(c1, out);
    }
    exit(0);
    return 0; /* avoid warning */
}


possible recursive command ~ use Linux (CentOS/Ubuntu) from related parent directory:
Assuming compressed files were transferred to: ../attachments2

        cc -o fixgz fixgz.c
        fname=${file##*/}
        fpath=${file%/*}
        dname=${fpath##*/}
        fixgz $file attachments2/${fname}


My SMF Mods & Plug-Ins

WebDev

"Either you repeat the same conventional doctrines everybody is saying, or else you say something true, and it will sound like it's from Neptune." - Noam Chomsky

GigaWatt

#32
Quote from: Chen Zhen on June 04, 2018, 09:36:06 AM
One could write a PHP script that copies all files that have db entries listed as zip & gz to another folder (ensure to name them with the appropriate extension). After that you run the fixgz program on the new directory.

That's only for archive files... and only zip and gzip... which are not the most common desktop archives used these days (almost 100% of the users visiting my forum use Windows, so they usually upload a rar or a 7z archive). What about every other file type, as I mentioned, documents (doc, docx, pdf, proprietary vector file types, even some image formats are affected)? In proprietary file formats, it's not uncommon to use CR, LF or both, CR + LF in different combination to separate certain things from the header or the end of the file. fixgz can't help in those cases.

Back when I was tinkering with this problem, I checked what archive types are usually uploaded by users. 80% were rar archives, the rest were 7z archives. There were only 20 zip archive and 2 gzip archives :D.

This will work on forums that have a large Linux userbase, like programming or Linux support forums. Mine is about electronics... the primary (and in almost all of the cases, only) OS of the users visiting my forum is Windows.

It's great that you posted the code to fixgz ;). I also ran into this program when I was searching for a solution to my problem. The only problem is, in it's original form, this program doesn't add or remove bytes from the file, it just does a "search and replace", which is not what happened to the files in my case. Anyhow, it was a great reference when I started writing my code in C++ to try and repair the files, but mine had to be a lot more complicated, computing every possible combination of CR, LF or CR + LF line endings in a file, adding and removing bytes whenever it encountered an 0D in the file, calculate the CRC of that particular file (all of this done in memory) and compare it with the CRC/hash of the file in the database... which, as I later discovered, is calculated by taking into account the name of the file, the date and time of upload and, unfortunately, a random value which is not stored anywhere in the database :S.

That's why I proposed to the developers (hope some of them read this thread) to make a separate column in the attachments table and to store this random value, so that someone could hopefully recreate his files in cases like this one. In that case, a tool could be made to correct all of the files affected by any kind of FTP corruption. I'd be willing to give it another shot, writing this tool, but only if the algorithm for generating the hash is changed and doesn't include a random value, or if that random value is stored in the database. In any other case, reversing the hash is pointless.

You can download a precompiled Windows binary along with the source here.
"This is really a generic concept about human thinking - when faced with large tasks we're naturally inclined to try to break them down into a bunch of smaller tasks that together make up the whole."

"A 500 error loosely translates to the webserver saying, "WTF?"..."

Advertisement: