Important Notice: Do you use Filezilla? Avatars and Attachments lost?

Arantor · May 03, 2010, 06:18:13 AM

It doesn't actually hurt to transfer everything as binary, really.

The whole concept of text transfer as text is an outdated concept that's really not an issue with modern editors; it's only for the olden days when editors couldn't handle Linux and Windows file endings (like Windows Notepad still can't, heh), if you're using a modern editor like Notepad++, it's a non issue.

~DS~ · May 03, 2010, 06:23:47 AM

Quote from: Arantor on May 03, 2010, 06:18:13 AM
It doesn't actually hurt to transfer everything as binary, really.

The whole concept of text transfer as text is an outdated concept that's really not an issue with modern editors; it's only for the olden days when editors couldn't handle Linux and Windows file endings (like Windows Notepad still can't, heh), if you're using a modern editor like Notepad++, it's a non issue.

I will try everything as binary the next time I back up. It's better than sorry not to try both methods. For now I am leaving it AUTO and check "Treat files without extension as ASCII file" and uncheck it for attachment and avatars only.

When you say binary...you mean not auto and and what about "Treat files without extension as ASCII file"?

Arantor · May 03, 2010, 06:30:01 AM

Text mode == ASCII mode, they're one and the same.

When I'm using either FileZilla or WinSCP, I have it set to treat *everything* as binary.

AncientDragonfly · May 03, 2010, 06:22:16 PM

Quote from: Dismal Shadow on May 02, 2010, 06:26:11 PM
Ok, I haven't backup in a week so I am gonna backup today but I want to make sure the image shown:
http://www.simplemachines.org/community/index.php?topic=377117.msg2594267#msg2594267
is correct? SO that it doesn't not mess up the avatars and attachments.

No, that image in your other post is incorrect.

This image from the first post in this thread is correct:

As Arantor says, however, you should be ok to download ASCII (plain text) files using binary mode. Auto mode is just a list of extensions that specifies which files should be treated as ASCII but since the files we are talking about don't have extensions, it can't be used in this situation.

Basically, if you open a file in a text editor and it looks like non-alphabetic, non-numeric, non-standard-punctuation characters, it is a binary file (see attachment).

ASCII can be transferred as binary without corruption.
Binary files will be corrupted if they are not transferred as binary.

Arantor · May 03, 2010, 06:27:38 PM

QuoteASCII can be transferred as binary without corruption.

Not entirely true, in fact, it depends what you define as 'corruption'.

If you mean 'without corruption' as the actual, byte for byte file it will be corrupted.

The file will be silently converted for line endings. For example, go grab the SMF install package and open any SMF file in Windows Notepad. Not Notepad++, not anything fancy, just normal Notepad. You get a mess where everything is on one line.

Now download the same file from FTP in ASCII mode and open it. Now it's not a mess. Why? Because FileZilla has silently converted the line endings on your behalf.

If you have ANY doubt, use binary mode. I expect the files I send and receive to be the actual files, not a modified-in-ANY-way version.

~DS~ · May 03, 2010, 06:39:07 PM

Ok uncheck the one from the image and set it to binary mode instead of auto. I open with textwrangler, seem fine. No lines messed up.

Arantor · May 03, 2010, 06:40:09 PM

That could be because TextWrangler can handle Windows, Linux and Mac line endings, which would hide if the file format has been changed on you without you being aware of it.

AncientDragonfly · May 03, 2010, 06:43:07 PM

Quote from: Arantor on May 03, 2010, 06:27:38 PM
QuoteASCII can be transferred as binary without corruption.

Not entirely true, in fact, it depends what you define as 'corruption'.

If you mean 'without corruption' as the actual, byte for byte file it will be corrupted.

The file will be silently converted for line endings. For example, go grab the SMF install package and open any SMF file in Windows Notepad. Not Notepad++, not anything fancy, just normal Notepad. You get a mess where everything is on one line.

Well, yes, that is true in the sense it wouldn't pass a CRC (or some other kind of integrity) check, but it will be recoverable with one of the many line-ending converter programs, unlike a binary file like an image, which, once it's lost every 8th bit, it's gone for good.

Quote
Now download the same file from FTP in ASCII mode and open it. Now it's not a mess. Why? Because FileZilla has silently converted the line endings on your behalf.

If you have ANY doubt, use binary mode. I expect the files I send and receive to be the actual files, not a modified-in-ANY-way version.

But you said:

Quote from: Arantor on May 03, 2010, 06:30:01 AM
Text mode == ASCII mode, they're one and the same.

When I'm using either FileZilla or WinSCP, I have it set to treat *everything* as binary.

(I'm on FreeBSD - binary doesn't mess up my line endings.)

Quote from: Dismal Shadow on May 03, 2010, 06:39:07 PM
Ok uncheck the one from the image and set it to binary mode instead of auto. I open with textwrangler, seem fine. No lines messed up.

Good. Glad to hear it.

Arantor · May 03, 2010, 06:49:14 PM

* Arantor thinks people don't understand what text mode is.

It DOESN'T eat every 8th bit.

What it does is convert line endings depending on what the server and client are using.

Windows uses \r\n (ASCII codes 13, 10, aka carriage return and newline) to mean end of line
Linux, *nix generally, including niche OSes derived from that heritage (including AmigaOS, randomly) uses \n
Mac historically used \r, don't know if it still does

Some images are fine, in fact, because they contain neither \r or \n, but it's incredibly rare for that to be the case.

Thus, text == ASCII mode, because it's not stripping bit 7 unless it's REALLY stupid, because it doesn't foul up extended bit characters, or UTF-8 characters for that matter (except line endings which are the same across UTF-8 due to UTF-8 reusing the same 0-127 sequence)

Binary wouldn't mess up your line endings. It would preserve it, that's kind of the point!! Text mode WILL screw your line endings unless you happen to be on the same OS as the server in which case I'd hope your client was smart enough to realise.

AncientDragonfly · May 05, 2010, 05:59:38 PM

* AncientDragonfly doesn't want to argue with Arantor, so will just include a few links with accompanying text.

http://www.faqs.org/rfcs/rfc959.html

Quote
RFC 959
File Transfer Protocol
[snip]
3.4.3. COMPRESSED MODE

There are three kinds of information to be sent: regular data,
sent in a byte string; compressed data, consisting of
replications or filler; and control information, sent in a
two-byte escape sequence. If n>0 bytes (up to 127) of regular
data are sent, these n bytes are preceded by a byte with the
left-most bit set to 0 and the right-most 7 bits containing the
number n.

http://www.faqs.org/rfcs/rfc1635.html

QuoteRFC1635 - How to Use Anonymous FTP

- You may set BINARY mode to transfer executable programs or files of data. Type "binary" to do so. Usually FTP programs assume files use only 7 bits per byte, the norm for standard ASCII-encoded files. The BINARY command allows you to transfer files that use the full 8 bits per byte without error, but this may have implications on how the file is transferred to your local system.

http://courses.wccnet.edu/computer/mod/na36c.htm

QuoteFTP was developed at a time when typical modem speeds were 110 to 300 bits per second (as compared with 28,000 to 56,000 today). Since ASCII only used 7 bits, long files could be transmitted more quickly by not sending all the unused bits.

The big drawback is that if a file that uses all 8 bits in each byte is accidentally sent using ASCII transfer, it will lose 1/8 of its information content. In most files, even the loss of one bit is enough to make it invalid, and losing 1/8 makes them totally unreadable. So ASCII transfer can be fatal to a file's health!

With today's higher speeds, the time lost by sending all 8 bits of an ASCII file is practically unnoticeable. But FPT has incorporated features into ASCII transfer that make it useful for other reasons, so the two modes remain.

http://mc-computing.com/winexplorer/ftp.html

QuoteBinary vs ASCII
The internet was developed to transfer information in 7-bit packets.

Computers store data in 8-bit bytes.

Normal ASCII text data is stored using only 7 bits per byte (the 8th bit is always zero). (This kind of document is created using programs like notepad.exe) As a result, it is fairly easy to transfer this type of data over the internet. This is one reason the html pages are always written in ASCII.

Other types of data (images, programs, MS Word documents, and the like) use all 8 bits to store data. As a result, special algorithms are required to transfer 8-bit data over a 7-bit interface.

When you are using FTP, you need to be aware of the type of data being transferred - ASCII or binary (7-bit or 8-bit).
If you try to transfer 8-bit data in ASCII mode, data WILL be lost (the 8th bit will be set to zero).

http://www.ietf.org/mail-archive/web/ietf/current/msg40628.html

Quote
To: Sandeep Srivastava , ietf at ietf . org
Subject: Re: [RFC 959] FTP in ASCII mode
From: John C Klensin
Date: Tue, 21 Feb 2006 03:34:27 -0500
[snip]
> I knew that a FTP transfer in ASCII mode does EOL and EOF
> conversions based on the OS of the receiving system.

No, it doesn't. That was part of the point. It does no EOF
conversions at all. The command and data channels were
separated for several reasons, but the desire to stay out of the
EOF business was an important one. And the server is required
to convert whatever line-end convention it uses to CRLF, and any
characters it uses to ASCII, and transmit that over the wire.
If the client then converts from CRLF and ASCII to some local
convention, that is its business, not that of the protocol. In
other words, there are, at most, conversions to and from CRLF
and ASCII. There are no FTP-specified conversions based on the
properties of the receiving system.

* AncientDragonfly thinks this esoteric discussion may confuse people who just want to make sure they don't lose their files.

Bottom line, or tl;dr: Transfer your avatars and attachments directory as binary files.

Arantor · May 05, 2010, 06:07:58 PM

Originally, yes that was the case. But 959 has been superceded multiple times since 1985, meaning that in the current specifications, it's not a strict 7-bit safe only environment.

If it were, every instance of SMF would be broken by FTP clients that treat PHP files as 'text' because there is a function in Subs.php that has 8-bit characters in it. As would any language file that isn't English.

AncientDragonfly · May 05, 2010, 08:30:54 PM

Quote from: Arantor on May 05, 2010, 06:07:58 PM
Originally, yes that was the case. But 959 has been superceded multiple times since 1985, meaning that in the current specifications, it's not a strict 7-bit safe only environment.

Do you mean to say that the only difference between an ASCII transfer and a binary transfer is what happens with line endings? John Klensin says that any translation besides that to CR/LF for transfer (specified by network ASCII) is performed by the client for the local system, not by the server (if the server follows protocol). So wouldn't that imply that between two identical systems, there would be no difference in line endings once the transfer is complete? So if I transfer a binary file between 2 FreeBSD systems in ASCII, the LFs will be converted to CR/LFs for transfer, then the client will convert them back to LFs, so the file would essentially be the same, wouldn't it? So a binary file (say, like an avatar) transferred as ASCII between identical systems will not be corrupted? (No, I haven't tested this yet, I'll have to do that later sometime.)

Arantor, I am the kind of person who likes to understand things, not just take someone else's word for them (not at all implying, though, that your word isn't good), and there is no question that you are far smarter than I am or ever will be, so how about sharing the source of these new (to me) revelations?

I have found RFC 2640, which is titled "Internationalization of the File Transfer Protocol," and says this:

QuoteAbstract

The File Transfer Protocol, as defined in RFC 959 [RFC959] and RFC
1123 Section 4 [RFC1123], is one of the oldest and widely used
protocols on the Internet. The protocol's primary character set, 7
bit ASCII, has served the protocol well through the early growth
years of the Internet. However, as the Internet becomes more global,
there is a need to support character sets beyond 7 bit ASCII.

This document addresses the internationalization (I18n) of FTP, which
includes supporting the multiple character sets and languages found
throughout the Internet community. This is achieved by extending the
FTP specification and giving recommendations for proper
internationalization support.

but it also says this:

Quote1 Introduction

As the Internet grows throughout the world the requirement to support
character sets outside of the ASCII [ASCII] / Latin-1 [ISO-8859]
character set becomes ever more urgent. For FTP, because of the
large installed base, it is paramount that this is done without
breaking existing clients and servers. This document addresses this
need. In doing so it defines a solution which will still allow the
installed base to interoperate with new clients and servers.

This document enhances the capabilities of the File Transfer Protocol
by removing the 7-bit restrictions on pathnames used in client
commands and server responses, RECOMMENDs the use of a Universal
Character Set (UCS) ISO/IEC 10646 [ISO-10646], RECOMMENDs a UCS
transformation format (UTF) UTF-8 [UTF-8], and defines a new command
for language negotiation.

The recommendations made in this document are consistent with the
recommendations expressed by the IETF policy related to character
sets and languages as defined in RFC 2277 [RFC2277].

and this:

Quote2 Internationalization

The File Transfer Protocol was developed when the predominate
character sets were 7 bit ASCII and 8 bit EBCDIC. Today these
character sets cannot support the wide range of characters needed by
multinational systems. Given that there are a number of character
sets in current use that provide more characters than 7-bit ASCII, it
makes sense to decide on a convenient way to represent the union of
those possibilities. To work globally either requires support of a
number of character sets and to be able to convert between them, or
the use of a single preferred character set. To assure global
interoperability this document RECOMMENDS the latter approach and
defines a single character set, in addition to NVT ASCII and EBCDIC,
which is understandable by all systems. For FTP this character set
SHALL be ISO/IEC 10646:1993. For support of global compatibility it
is STRONGLY RECOMMENDED that clients and servers use UTF-8 encoding
when exchanging pathnames. Clients and servers are, however, under
no obligation to perform any conversion on the contents of a file for
operations such as STOR or RETR.

The character set used to store files SHALL remain a local decision
and MAY depend on the capability of local operating systems. Prior to
the exchange of pathnames they SHOULD be converted into a ISO/IEC
10646 format and UTF-8 encoded. This approach, while allowing
international exchange of pathnames, will still allow backward
compatibility with older systems because the code set positions for
ASCII characters are identical to the one byte sequence in UTF-8.

This seems to mean it is only talking about pathnames changing, not the ASCII spec. But are the local systems changing the way things are done depending on their localization? They know that some characters in that localization require all 8 bits? Or the local system determines that a file has non-ASCII characters, and adapts accordingly?

I have found RFC 3659, which updates (but does not obsolete) 959, and it says this:

QuoteThis document also uses notation defined in STD 9, RFC 959 [3]. In
particular, the terms "reply", "user", "NVFS" (Network Virtual File
System), "file", "pathname", "FTP commands", "DTP" (data transfer
process), "user-FTP process", "user-PI" (user protocol interpreter),
"user-DTP", "server-FTP process", "server-PI", "server-DTP", "mode",
"type", "NVT" (Network Virtual Terminal), "control connection", "data
connection", and "ASCII", are all used here as defined there.

which is defined there as:

QuoteASCII

The ASCII character set is as defined in the ARPA-Internet
Protocol Handbook. In FTP, ASCII characters are defined to be
the lower half of an eight-bit code set (i.e., the most
significant bit is zero).

This is knowledge-seeking on my part, not a "my * is bigger than your *" challenge to you.

Quote from: Arantor on May 05, 2010, 06:07:58 PM
If it were, every instance of SMF would be broken by FTP clients that treat PHP files as 'text' because there is a function in Subs.php that has 8-bit characters in it. As would any language file that isn't English.

Then if the 8th bit in ASCII transfers can now be 0 or 1, how come the images in the Avatars and Attachments directory get broken?

Arantor · May 05, 2010, 08:37:13 PM

OK, hereafter I'm going to throw out the rulebook and the specs and go purely on what I have seen and observed because when it comes down to it, quoting spec after spec doesn't solve the issue - because it's how things implement the spec that matters.

(Heck, if everyone observed the spec right, how come IE6 is such a pile of ****?)

Anyway.

There are two VERY DIFFERENT issues here.

One is 8-bit sanctity, one is line ending sanctity.

The former is essentially a non issue for modern FTP servers and clients because funnily enough the spec is ignored, and they send 8 bit files as is. In today's UTF-8 world, it's pretty much a necessity to send 8-bit files, because UTF-8 is a full 8-bit text format.

The latter is still very much an issue because line ending conversions get done. THIS is what breaks attachments, not loss of the 8th bit.

How do I know this? I've seen attachments broken by this, specifically I've observed the differences in the files, and viewed them in more tolerant viewers. The files get broken part way through, not totally. If 8th bit loss were the case, on average 50% of the file would be damaged since the odds of any one bit being set are non exclusive 50%. But that's not the case.

From experience, FTP between two Linux servers hasn't damaged such files. That isn't to say it wouldn't but in my experience that was the case.

AncientDragonfly · May 07, 2010, 02:32:45 PM

Quote from: Arantor on May 05, 2010, 08:37:13 PM
OK, hereafter I'm going to throw out the rulebook and the specs and go purely on what I have seen and observed because when it comes down to it, quoting spec after spec doesn't solve the issue - because it's how things implement the spec that matters.

Ok, fair enough.

Quote
(Heck, if everyone observed the spec right, how come IE6 is such a pile of ****?)

Microsoft wanted to write their own spec for the internet - they thought they were an exception to the existing specs, that they were powerful enough to write their own, and that everyone else would adjust to it because they were Microsoft.

This was obvious in their early implementations of FrontPage - I was working tech support at a web hosting company at the time, and we had many, many customers call to find out why their pages written with FrontPage which displayed perfectly in IE (3? 4?) would not display in Netscape. Try telling a bunch of customers who thought Microsoft invented the internet that FrontPage and IE didn't follow the already established standards!

Microsoft has since realized that they need to work with the internet technical community, instead of trying to take it over.

You might find this thread on the ASCII issue interesting (that is, unless you're bored with the whole thing by now): http://www.securityfocus.com/archive/1/437948/100/0/threaded

Quote
Anyway.

There are two VERY DIFFERENT issues here.

One is 8-bit sanctity, one is line ending sanctity.

The former is essentially a non issue for modern FTP servers and clients because funnily enough the spec is ignored, and they send 8 bit files as is. In today's UTF-8 world, it's pretty much a necessity to send 8-bit files, because UTF-8 is a full 8-bit text format.

The latter is still very much an issue because line ending conversions get done. THIS is what breaks attachments, not loss of the 8th bit.

How do I know this? I've seen attachments broken by this, specifically I've observed the differences in the files, and viewed them in more tolerant viewers. The files get broken part way through, not totally. If 8th bit loss were the case, on average 50% of the file would be damaged since the odds of any one bit being set are non exclusive 50%. But that's not the case.

From experience, FTP between two Linux servers hasn't damaged such files. That isn't to say it wouldn't but in my experience that was the case.

Ok, fwiw, I tested 2 files on my BSD machines. I did this with both the command line FTP client and with Filezilla, with the same result. One was a plain text file, one was a png file. I downloaded both using both ASCII and binary. The ASCII downloaded png file wouldn't open in GIMP; GIMP suggested that the file might be corrupt. I then opened both the ASCII-downloaded and the binary-downloaded png files with Diffuse, a file comparison editor (similar to Diff, but graphical), and the only differences in the two files that I could see were some blank lines in one but not the other. There were about 8-10 instances of this in the whole file. With the plain text file, there were no differences apparent in Diffuse.

This suggests you're right about the line ending translation being what messes up the binary files instead of the 8th bit, since every other character was the same, and none of the 8-bit characters had been converted to standard ASCII characters. If the 8th bit had been changed to 0 on all the bytes, I would have expected to see only alpha-numeric and punctuation characters in the ASCII-downloaded binary file.

Anyway, thanks for the stimulating discussion, and for bringing my FTP/ASCII knowledge up to date. I never mind having what I think I know challenged if it might be wrong.

Arantor · May 07, 2010, 06:47:36 PM

*nods* Experience definitely wins - and it's great to get more experience on the matter, and good to have references to things to read up on.

That's the thing - you appreciate the journey as well as the destination itself, and it's been a pleasure discussing this

jimbouk1977 · May 20, 2010, 03:51:03 PM

LOL so on or off, auto or binary that's all we needed to know !

Arantor · May 20, 2010, 03:53:39 PM

If in doubt, binary.

bahgheera · May 31, 2010, 03:17:34 AM

Here's a question: in the original post, it says the option should be UNCHECKED in the image. Then directly after, there are some instructions that state the option should be CHECKED. So which is it, checked or unchecked??

EDIT: Doh... I'm an idiot. Guess I couldn't make my brain go past what I thought was a confusing error.

Forum Guy · May 31, 2010, 03:52:35 AM

nice glitch, really!

that check box could be completely removed as it should transfer everything in BINARY until known extensions transfer in ASCII.

Consequentially this is a complete mess and got it all wrong

knightofdoom · June 10, 2010, 05:44:33 AM

thanks for sharing

News:

Important Notice: Do you use Filezilla? Avatars and Attachments lost?