News:

Want to get involved in developing SMF, then why not lend a hand on our github!

Main Menu

Important Notice: Do you use Filezilla? Avatars and Attachments lost?

Started by AncientDragonfly, April 02, 2010, 11:52:18 AM

Previous topic - Next topic

Arantor

It doesn't actually hurt to transfer everything as binary, really.

The whole concept of text transfer as text is an outdated concept that's really not an issue with modern editors; it's only for the olden days when editors couldn't handle Linux and Windows file endings (like Windows Notepad still can't, heh), if you're using a modern editor like Notepad++, it's a non issue.
Holder of controversial views, all of which my own.


~DS~

Quote from: Arantor on May 03, 2010, 06:18:13 AM
It doesn't actually hurt to transfer everything as binary, really.

The whole concept of text transfer as text is an outdated concept that's really not an issue with modern editors; it's only for the olden days when editors couldn't handle Linux and Windows file endings (like Windows Notepad still can't, heh), if you're using a modern editor like Notepad++, it's a non issue.
I will try everything as binary the next time I back up. It's better than sorry not to try both methods. For now I am leaving it AUTO and check "Treat files without extension as ASCII file" and uncheck it for attachment and avatars only.

When you say binary...you mean not auto and and what about "Treat files without extension as ASCII file"?
"There is no god, and that's the simple truth. If every trace of any single religion were wiped out and nothing were passed on, it would never be created exactly that way again. There might be some other nonsense in its place, but not that exact nonsense. If all of science were wiped out, it would still be true and someone would find a way to figure it all out again."
~Penn Jillette – God, NO! – 2011

Arantor

Text mode == ASCII mode, they're one and the same.

When I'm using either FileZilla or WinSCP, I have it set to treat *everything* as binary.
Holder of controversial views, all of which my own.


AncientDragonfly

Quote from: Dismal Shadow on May 02, 2010, 06:26:11 PM
Ok, I haven't backup in a week so I am gonna backup today but I want to make sure the image shown:
http://www.simplemachines.org/community/index.php?topic=377117.msg2594267#msg2594267
is correct? SO that it doesn't not mess up the avatars and attachments.

No, that image in your other post is incorrect

This image from the first post in this thread is correct:



As Arantor says, however, you should be ok to download ASCII (plain text) files using binary mode.  Auto mode is just a list of extensions that specifies which files should be treated as ASCII but since the files we are talking about don't have extensions, it can't be used in this situation.

Basically, if you open a file in a text editor and it looks like non-alphabetic, non-numeric, non-standard-punctuation characters, it is a binary file (see attachment).

ASCII can be transferred as binary without corruption.
Binary files will be corrupted if they are not transferred as binary.

Arantor

QuoteASCII can be transferred as binary without corruption.

Not entirely true, in fact, it depends what you define as 'corruption'.

If you mean 'without corruption' as the actual, byte for byte file it will be corrupted.

The file will be silently converted for line endings. For example, go grab the SMF install package and open any SMF file in Windows Notepad. Not Notepad++, not anything fancy, just normal Notepad. You get a mess where everything is on one line.

Now download the same file from FTP in ASCII mode and open it. Now it's not a mess. Why? Because FileZilla has silently converted the line endings on your behalf.

If you have ANY doubt, use binary mode. I expect the files I send and receive to be the actual files, not a modified-in-ANY-way version.
Holder of controversial views, all of which my own.


~DS~

Ok uncheck the one from the image and set it to binary mode instead of auto. I open with textwrangler, seem fine. No lines messed up. 
"There is no god, and that's the simple truth. If every trace of any single religion were wiped out and nothing were passed on, it would never be created exactly that way again. There might be some other nonsense in its place, but not that exact nonsense. If all of science were wiped out, it would still be true and someone would find a way to figure it all out again."
~Penn Jillette – God, NO! – 2011

Arantor

That could be because TextWrangler can handle Windows, Linux and Mac line endings, which would hide if the file format has been changed on you without you being aware of it.
Holder of controversial views, all of which my own.


AncientDragonfly

Quote from: Arantor on May 03, 2010, 06:27:38 PM
QuoteASCII can be transferred as binary without corruption.

Not entirely true, in fact, it depends what you define as 'corruption'.

If you mean 'without corruption' as the actual, byte for byte file it will be corrupted.

The file will be silently converted for line endings. For example, go grab the SMF install package and open any SMF file in Windows Notepad. Not Notepad++, not anything fancy, just normal Notepad. You get a mess where everything is on one line.

Well, yes, that is true in the sense it wouldn't pass a CRC (or some other kind of integrity) check, but it will be recoverable with one of the many line-ending converter programs, unlike a binary file like an image, which, once it's lost every 8th bit, it's gone for good.

Quote
Now download the same file from FTP in ASCII mode and open it. Now it's not a mess. Why? Because FileZilla has silently converted the line endings on your behalf.

If you have ANY doubt, use binary mode. I expect the files I send and receive to be the actual files, not a modified-in-ANY-way version.

But you said:
Quote from: Arantor on May 03, 2010, 06:30:01 AM
Text mode == ASCII mode, they're one and the same.

When I'm using either FileZilla or WinSCP, I have it set to treat *everything* as binary.

:P  (I'm on FreeBSD - binary doesn't mess up my line endings.)

Quote from: Dismal Shadow on May 03, 2010, 06:39:07 PM
Ok uncheck the one from the image and set it to binary mode instead of auto. I open with textwrangler, seem fine. No lines messed up. 

Good.  Glad to hear it.

Arantor

* Arantor thinks people don't understand what text mode is.

It DOESN'T eat every 8th bit.

What it does is convert line endings depending on what the server and client are using.

Windows uses \r\n (ASCII codes 13, 10, aka carriage return and newline) to mean end of line
Linux, *nix generally, including niche OSes derived from that heritage (including AmigaOS, randomly) uses \n
Mac historically used \r, don't know if it still does

Some images are fine, in fact, because they contain neither \r or \n, but it's incredibly rare for that to be the case.

Thus, text == ASCII mode, because it's not stripping bit 7 unless it's REALLY stupid, because it doesn't foul up extended bit characters, or UTF-8 characters for that matter (except line endings which are the same across UTF-8 due to UTF-8 reusing the same 0-127 sequence)


Binary wouldn't mess up your line endings. It would preserve it, that's kind of the point!! Text mode WILL screw your line endings unless you happen to be on the same OS as the server in which case I'd hope your client was smart enough to realise.
Holder of controversial views, all of which my own.


AncientDragonfly

* AncientDragonfly doesn't want to argue with Arantor, so will just include a few links with accompanying text.

http://www.faqs.org/rfcs/rfc959.html
Quote
RFC 959
File Transfer Protocol
[snip]
3.4.3.  COMPRESSED MODE

         There are three kinds of information to be sent:  regular data,
         sent in a byte string; compressed data, consisting of
         replications or filler; and control information, sent in a
         two-byte escape sequence.  If n>0 bytes (up to 127) of regular
         data are sent, these n bytes are preceded by a byte with the
         left-most bit set to 0 and the right-most 7 bits containing the
         number n.

http://www.faqs.org/rfcs/rfc1635.html
QuoteRFC1635 - How to Use Anonymous FTP

- You may set BINARY mode to transfer executable programs or files of data. Type "binary" to do so. Usually FTP programs assume files use only 7 bits per byte, the norm for standard ASCII-encoded files. The BINARY command allows you to transfer files that use the full 8 bits per byte without error, but this may have implications on how the file is transferred to your local system.

http://courses.wccnet.edu/computer/mod/na36c.htm
QuoteFTP was developed at a time when typical modem speeds were 110 to 300 bits per second (as compared with 28,000 to 56,000 today). Since ASCII only used 7 bits, long files could be transmitted more quickly by not sending all the unused bits.

The big drawback is that if a file that uses all 8 bits in each byte is accidentally sent using ASCII transfer, it will lose 1/8 of its information content. In most files, even the loss of one bit is enough to make it invalid, and losing 1/8 makes them totally unreadable. So ASCII transfer can be fatal to a file's health!

With today's higher speeds, the time lost by sending all 8 bits of an ASCII file is practically unnoticeable. But FPT has incorporated features into ASCII transfer that make it useful for other reasons, so the two modes remain.

http://mc-computing.com/winexplorer/ftp.html
QuoteBinary vs ASCII
The internet was developed to transfer information in 7-bit packets.

Computers store data in 8-bit bytes.

Normal ASCII text data is stored using only 7 bits per byte (the 8th bit is always zero). (This kind of document is created using programs like notepad.exe) As a result, it is fairly easy to transfer this type of data over the internet. This is one reason the html pages are always written in ASCII.

Other types of data (images, programs, MS Word documents, and the like) use all 8 bits to store data. As a result, special algorithms are required to transfer 8-bit data over a 7-bit interface.

When you are using FTP, you need to be aware of the type of data being transferred - ASCII or binary (7-bit or 8-bit).
If you try to transfer 8-bit data in ASCII mode, data WILL be lost (the 8th bit will be set to zero).

http://www.ietf.org/mail-archive/web/ietf/current/msg40628.html
Quote
To: Sandeep Srivastava , ietf at ietf . org
Subject: Re: [RFC 959] FTP in ASCII mode
From: John C Klensin
Date: Tue, 21 Feb 2006 03:34:27 -0500
[snip]
> I knew that a FTP transfer in ASCII mode does EOL and EOF
> conversions based on the OS of the receiving system.

No, it doesn't.  That was part of the point.  It does no EOF
conversions at all.   The command and data channels were
separated for several reasons, but the desire to stay out of the
EOF business was an important one.  And the server is required
to convert whatever line-end convention it uses to CRLF, and any
characters it uses to ASCII, and transmit that over the wire.
If the client then converts from CRLF and ASCII to some local
convention, that is its business, not that of the protocol.  In
other words, there are, at most, conversions to and from CRLF
and ASCII. There are no FTP-specified conversions based on the
properties of the receiving system. 

* AncientDragonfly thinks this esoteric discussion may confuse people who just want to make sure they don't lose their files.

Bottom line, or tl;dr: Transfer your avatars and attachments directory as binary files. 

Arantor

Originally, yes that was the case. But 959 has been superceded multiple times since 1985, meaning that in the current specifications, it's not a strict 7-bit safe only environment.

If it were, every instance of SMF would be broken by FTP clients that treat PHP files as 'text' because there is a function in Subs.php that has 8-bit characters in it. As would any language file that isn't English.
Holder of controversial views, all of which my own.


AncientDragonfly

Quote from: Arantor on May 05, 2010, 06:07:58 PM
Originally, yes that was the case. But 959 has been superceded multiple times since 1985, meaning that in the current specifications, it's not a strict 7-bit safe only environment.

Do you mean to say that the only difference between an ASCII transfer and a binary transfer is what happens with line endings?  John Klensin says that any translation besides that to CR/LF for transfer (specified by network ASCII) is performed by the client for the local system, not by the server (if the server follows protocol).  So wouldn't that imply that between two identical systems, there would be no difference in line endings once the transfer is complete?  So if I transfer a binary file between 2 FreeBSD systems in ASCII, the LFs will be converted to CR/LFs for transfer, then the client will convert them back to LFs, so the file would essentially be the same, wouldn't it?  So a binary file (say, like an avatar) transferred as ASCII between identical systems will not be corrupted?  (No, I haven't tested this yet, I'll have to do that later sometime.)

Arantor,  I am the kind of person who likes to understand things, not just take someone else's word for them (not at all implying, though, that your word isn't good), and there is no question that you are far smarter than I am or ever will be, so how about sharing the source of these new (to me) revelations? 

I have found RFC 2640, which is titled "Internationalization of the File Transfer Protocol," and says this:
QuoteAbstract

   The File Transfer Protocol, as defined in RFC 959 [RFC959] and RFC
   1123 Section 4 [RFC1123], is one of the oldest and widely used
   protocols on the Internet. The protocol's primary character set, 7
   bit ASCII, has served the protocol well through the early growth
   years of the Internet. However, as the Internet becomes more global,
   there is a need to support character sets beyond 7 bit ASCII.

   This document addresses the internationalization (I18n) of FTP, which
   includes supporting the multiple character sets and languages found
   throughout the Internet community.  This is achieved by extending the
   FTP specification and giving recommendations for proper
   internationalization support.

but it also says this:
Quote1 Introduction


   As the Internet grows throughout the world the requirement to support
   character sets outside of the ASCII [ASCII] / Latin-1 [ISO-8859]
   character set becomes ever more urgent.  For FTP, because of the
   large installed base, it is paramount that this is done without
   breaking existing clients and servers. This document addresses this
   need. In doing so it defines a solution which will still allow the
   installed base to interoperate with new clients and servers.

   This document enhances the capabilities of the File Transfer Protocol
   by removing the 7-bit restrictions on pathnames used in client
   commands and server responses, RECOMMENDs the use of a Universal
   Character Set (UCS) ISO/IEC 10646 [ISO-10646], RECOMMENDs a UCS
   transformation format (UTF) UTF-8 [UTF-8], and defines a new command
   for language negotiation.

   The recommendations made in this document are consistent with the
   recommendations expressed by the IETF policy related to character
   sets and languages as defined in RFC 2277 [RFC2277].

and this:

Quote2 Internationalization


   The File Transfer Protocol was developed when the predominate
   character sets were 7 bit ASCII and 8 bit EBCDIC. Today these
   character sets cannot support the wide range of characters needed by
   multinational systems. Given that there are a number of character
   sets in current use that provide more characters than 7-bit ASCII, it
   makes sense to decide on a convenient way to represent the union of
   those possibilities. To work globally either requires support of a
   number of character sets and to be able to convert between them, or
   the use of a single preferred character set. To assure global
   interoperability this document RECOMMENDS the latter approach and
   defines a single character set, in addition to NVT ASCII and EBCDIC,
   which is understandable by all systems. For FTP this character set
   SHALL be ISO/IEC 10646:1993.  For support of global compatibility it
   is STRONGLY RECOMMENDED that clients and servers use UTF-8 encoding
   when exchanging pathnames.  Clients and servers are, however, under
   no obligation to perform any conversion on the contents of a file for
   operations such as STOR or RETR.


   The character set used to store files SHALL remain a local decision
   and MAY depend on the capability of local operating systems.
Prior to
   the exchange of pathnames they SHOULD be converted into a ISO/IEC
   10646 format and UTF-8 encoded. This approach, while allowing
   international exchange of pathnames, will still allow backward
   compatibility with older systems because the code set positions for
   ASCII characters are identical to the one byte sequence in UTF-8.

This seems to mean it is only talking about pathnames changing, not the ASCII spec.  But are the local systems changing the way things are done depending on their localization?  They know that some characters in that localization require all 8 bits?  Or the local system determines that a file has non-ASCII characters, and adapts accordingly?

I have found RFC 3659, which updates (but does not obsolete) 959, and it says this:

QuoteThis document also uses notation defined in STD 9, RFC 959 [3].  In
   particular, the terms "reply", "user", "NVFS" (Network Virtual File
   System), "file", "pathname", "FTP commands", "DTP" (data transfer
   process), "user-FTP process", "user-PI" (user protocol interpreter),
   "user-DTP", "server-FTP process", "server-PI", "server-DTP", "mode",
   "type", "NVT" (Network Virtual Terminal), "control connection", "data
   connection", and "ASCII", are all used here as defined there.

which is defined there as:
QuoteASCII

         The ASCII character set is as defined in the ARPA-Internet
         Protocol Handbook.  In FTP, ASCII characters are defined to be
         the lower half of an eight-bit code set (i.e., the most
         significant bit is zero).

This  is knowledge-seeking on my part, not a "my * is bigger than your *" challenge to you.

Quote from: Arantor on May 05, 2010, 06:07:58 PM
If it were, every instance of SMF would be broken by FTP clients that treat PHP files as 'text' because there is a function in Subs.php that has 8-bit characters in it. As would any language file that isn't English.

Then if the 8th bit in ASCII transfers can now be 0 or 1, how come the images in the Avatars and Attachments directory get broken?


Arantor

OK, hereafter I'm going to throw out the rulebook and the specs and go purely on what I have seen and observed because when it comes down to it, quoting spec after spec doesn't solve the issue - because it's how things implement the spec that matters.

(Heck, if everyone observed the spec right, how come IE6 is such a pile of ****?)

Anyway.

There are two VERY DIFFERENT issues here.

One is 8-bit sanctity, one is line ending sanctity.

The former is essentially a non issue for modern FTP servers and clients because funnily enough the spec is ignored, and they send 8 bit files as is. In today's UTF-8 world, it's pretty much a necessity to send 8-bit files, because UTF-8 is a full 8-bit text format.

The latter is still very much an issue because line ending conversions get done. THIS is what breaks attachments, not loss of the 8th bit.

How do I know this? I've seen attachments broken by this, specifically I've observed the differences in the files, and viewed them in more tolerant viewers. The files get broken part way through, not totally. If 8th bit loss were the case, on average 50% of the file would be damaged since the odds of any one bit being set are non exclusive 50%. But that's not the case.

From experience, FTP between two Linux servers hasn't damaged such files. That isn't to say it wouldn't but in my experience that was the case.
Holder of controversial views, all of which my own.


AncientDragonfly

Quote from: Arantor on May 05, 2010, 08:37:13 PM
OK, hereafter I'm going to throw out the rulebook and the specs and go purely on what I have seen and observed because when it comes down to it, quoting spec after spec doesn't solve the issue - because it's how things implement the spec that matters.

Ok, fair enough.

Quote
(Heck, if everyone observed the spec right, how come IE6 is such a pile of ****?)

Microsoft wanted to write their own spec for the internet - they thought they were an exception to the existing specs, that they were powerful enough to write their own, and that everyone else would adjust to it because they were Microsoft. 

This was obvious in their early implementations of FrontPage - I was working tech support at a web hosting company at the time, and we had many, many customers call to find out why their pages written with FrontPage which displayed perfectly in IE (3? 4?) would not display in Netscape.  Try telling a bunch of customers who thought Microsoft invented the internet that FrontPage and IE didn't follow the already established standards!   ::)

Microsoft has since realized that they need to work with the internet technical community, instead of trying to take it over.   :)

You might find this thread on the ASCII issue interesting (that is, unless you're bored with the whole thing by now): http://www.securityfocus.com/archive/1/437948/100/0/threaded

Quote
Anyway.

There are two VERY DIFFERENT issues here.

One is 8-bit sanctity, one is line ending sanctity.

The former is essentially a non issue for modern FTP servers and clients because funnily enough the spec is ignored, and they send 8 bit files as is. In today's UTF-8 world, it's pretty much a necessity to send 8-bit files, because UTF-8 is a full 8-bit text format.

The latter is still very much an issue because line ending conversions get done. THIS is what breaks attachments, not loss of the 8th bit.

How do I know this? I've seen attachments broken by this, specifically I've observed the differences in the files, and viewed them in more tolerant viewers. The files get broken part way through, not totally. If 8th bit loss were the case, on average 50% of the file would be damaged since the odds of any one bit being set are non exclusive 50%. But that's not the case.

From experience, FTP between two Linux servers hasn't damaged such files. That isn't to say it wouldn't but in my experience that was the case.

Ok, fwiw, I tested 2 files on my BSD machines.  I did this with both the command line FTP client and with Filezilla, with the same result.  One was a plain text file, one was a png file.  I downloaded both using both ASCII and binary.  The ASCII downloaded png file wouldn't open in GIMP; GIMP suggested that the file might be corrupt.  I then opened both the ASCII-downloaded and the binary-downloaded png files with Diffuse, a file comparison editor (similar to Diff, but graphical), and the only differences in the two files that I could see were some blank lines in one but not the other.  There were about 8-10 instances of this in the whole file.  With the plain text file, there were no differences apparent in Diffuse.

This suggests you're right about the line ending translation being what messes up the binary files instead of the 8th bit, since every other character was the same, and none of the 8-bit characters had been converted to standard ASCII characters.  If the 8th bit had been changed to 0 on all the bytes, I would have expected to see only alpha-numeric and punctuation characters in the ASCII-downloaded binary file.

Anyway, thanks for the stimulating discussion, and for bringing my FTP/ASCII knowledge up to date.  I never mind having what I think I know challenged if it might be wrong. 

Arantor

*nods* Experience definitely wins - and it's great to get more experience on the matter, and good to have references to things to read up on.

That's the thing - you appreciate the journey as well as the destination itself, and it's been a pleasure discussing this :)
Holder of controversial views, all of which my own.


jimbouk1977

LOL so on or off, auto or binary that's all we needed to know ! :P
Might be a stupid question but its how you learn

Using SMF 2.0.1  & SP

Arantor

Holder of controversial views, all of which my own.


bahgheera

Here's a question: in the original post, it says the option should be UNCHECKED in the image. Then directly after, there are some instructions that state the option should be CHECKED. So which is it, checked or unchecked??

EDIT: Doh... I'm an idiot. Guess I couldn't make my brain go past what I thought was a confusing error.  ::)

Forum Guy

nice glitch, really!

that check box could be completely removed as it should transfer everything in BINARY until known extensions transfer in ASCII.

Consequentially this is a complete mess and got it all wrong  :o



knightofdoom

Glory is fleeting, but obscurity is forever.
Web Designer Sri Lanka

Advertisement: