[SMF Converter] vBulletin 3.5

Started by JayBachatero, January 11, 2008, 02:07:55 AM

Previous topic - Next topic

Red G. Brown

I figured out what's wrong. SMF isn't converting HTML reserved characters into HTML entities. In other words, a line like this:

<nickname> Hi

is displayed by the forum verbatim, and the web browser will render it like this:

Hi

The forum needs to convert the HTML reserved characters < and > into HTML entities that will be displayed by the browser. Right now, browsers are treating <nickname> as an unknown HTML tag and not displaying it. SMF needs to convert them according to this:

http://www.w3schools.com/tags/ref_entities.asp

I noticed that the problem doesn't appear in THIS forum, so I checked my freshly installed forum for some option to turn off HTML in posts, and I found "Enable basic HTML in posts", which is already off by default.

The "tags" are in the forum page source, so the pages will no longer validate. It looks like an easy problem to fix, even though it's quite serious.

Norv

Actually, SMF is converting html entities, in every posts, PMs, etc, made using SMF.
However it remains to be seen what the post looked like in the former database, and what it looks like after conversion, the raw database data, that is.

I'll test myself too, to see, as soon as possible.
To-do lists are for deferral. The more things you write down the later they're done... until you have 100s of lists of things you don't do.

File a security report | Developers' Blog | Bug Tracker


Also known as Norv on D* | Norv N. on G+ | Norv on Github

Red G. Brown

#42
Database before conversion:

[23:21] <nickname> hey...\r\n

Database after conversion:

[23:21] <nickname> hey...\r\n

Source code of the forum webpage:

[23:21] <nickname> hey...<br />

It's that source code part that's the problem. It should be:

[23:21] &lt;nickname&gt; hey...<br />

UPDATE:

I just found that in the forum title section, if I do

<a href='websitemainpage.com'>Website Title</a>

It will put that string into a meta tag without escaping the < and > with &lt; and &gt; just as before, which will cause the page to not validate. I'm curious what it would do if the URL had an ampersand in it, like many URLs do. They all have to be escaped with &amp; to keep the site valid, even when they're in URLs.

UPDATE:

I just tried putting & in a forum title, and the board converts it to &amp; fine. It looks like the way SMF handles that issue is inconsistent. It ought check everything, but it looks as though some things are checked and fixed, and some aren't.

UPDATE:

I found this post

http://www.simplemachines.org/community/index.php?topic=306815.0

If I understand it correctly, it seems that this might be a pervasive bug in SMF, and it's not related exclusively to conversions from other software.

Red G. Brown

I found another bug, but I don't know if it's related to conversion or not. My VB forum used BB code, but I tried to have BB code turned off in my new SMF forum. When it's completely off, paragraph formatting isn't obeyed in displaying the posts. If I "modify" the post, I can see clearly that the formatting is still there, just not being displayed properly. If I then save the "modification" without actually changing anything, it fixes it. If I turn just one BB code item on, that fixes it too. Is this a bug in SMF as a whole, or just in the conversion?

Red G. Brown

I've found another problem. Even in post text, reserved HTML characters are not being converted to HTML entities. For example, someone saying in their post " 5 > 4" will cause a validation error on the page. Here's an example:

http://www.livebusinesschat.com/smf/index.php/topic,85.0.html

Is this a problem with an SMF bug, or is it a problem with the conversion?

Norv

It's definitely with the conversion, I will look into it as soon as possible. You can see for yourself that if you post a new reply saying " 5 > 4", then SMF will handle it. It seems it's in the converted posts that it's not saved properly. I don't have reports for any other converter so it might be a problem restricted to this one.
I will have to see if I can replicate it. Sorry for the delay on this one.
To-do lists are for deferral. The more things you write down the later they're done... until you have 100s of lists of things you don't do.

File a security report | Developers' Blog | Bug Tracker


Also known as Norv on D* | Norv N. on G+ | Norv on Github

Red G. Brown

OK, thank you. Please let me know how to fix this. It's really giving me headaches, since I'm getting bad data out all over the place.

Red G. Brown

The main page is suddenly not validating anymore. It says there's a misplaced </table> tag that's probably nested improperly. I would be willing to be that's because of the HTML entity problem in the conversion, where we already know there's a > character that didn't get converted. See for yourself:

http://www.livebusinesschat.com/smf/index.php

Norv

#48
Please, try using the converter to 3.6 or even 3.7. I'm not aware of database changes in vbulletin (not sure they aren't, I don't have a 3.5 version myself, but I just used the 3.7 converter for both vbulletin 3.6 and 3.8 successfully), still, they're fixed to work without such issues.
To-do lists are for deferral. The more things you write down the later they're done... until you have 100s of lists of things you don't do.

File a security report | Developers' Blog | Bug Tracker


Also known as Norv on D* | Norv N. on G+ | Norv on Github

Red G. Brown

#49
The forum has been active since before this bug was discovered. Is there a way to do this without losing the current data set?

UPDATE:

Since it seems to be only a few rare instances of the < and > characters causing me problems, would it be safe to make a backup of the sql file, put the forum in maintenance mode, and then do a search and replace on the SQL file? Should I replace the characters with HTML entities?

Norv

That's what I would do, yes, replace the occurrences directly in a SQL backup file.

There is no easy way to redo the conversion and reimport correctly the posts / members / etc meanwhile.(though not impossible, but there would be quite a lot of manual database work)
To-do lists are for deferral. The more things you write down the later they're done... until you have 100s of lists of things you don't do.

File a security report | Developers' Blog | Bug Tracker


Also known as Norv on D* | Norv N. on G+ | Norv on Github

Red G. Brown

#51
OK, since there's few occurrences, I should be able to manually check each replacement to make sure it's good before proceeding. I should replace with HTML entities of the form &amp; ? Or, should it be something like #039; ? I'm not sure what the difference is. I converted to UTF-8 from ISO 8859-1 (i think).

UPDATE:

I just had an idea. SMF should include this in its "check for errors" maintenance system. Right now it tells me there's no errors, when clearly there are. This functionality should be added to at least identify such problems in the future (repair is easy after that).

Norv

#52
I just tested: I can actually reproduce the problem on a 3.6 database even with the latest converter, so it wouldn't help for now. Hopefully it will be possible to be considered for the next version. (of the converter)
Perhaps you're right that there might be also rather few cases (generally), and it didn't happen to be reported as yet, so we didn't know about this bug.
In any case, SMF sets &lt; and &gt; in the raw database data, for random occurrences of '<' and '>', and needs to find them as such in the database.

UPDATE:
Quote from: qwasty on October 23, 2009, 04:38:52 PM
I should replace with HTML entities of the form &amp; ? Or, should it be something like #039; ? I'm not sure what the difference is. I converted to UTF-8 from ISO 8859-1 (i think).
Any is okay (it's simply either the name for those which have names, or the underlying code). There is also a task in maintenance area of the forum, to convert html entities to UTF-8 characters, it might be useful to run afterwards.
To-do lists are for deferral. The more things you write down the later they're done... until you have 100s of lists of things you don't do.

File a security report | Developers' Blog | Bug Tracker


Also known as Norv on D* | Norv N. on G+ | Norv on Github

Red G. Brown

#53
I think I might have found another bug, and I don't know if I should change it or not. Usually I see \r\n\r\n in the database everywhere that someone makes a new paragraph. I just stumbled across the (wrong) HTML equivalent of <br /><br /> in the actual database, where I think it should be \r\n\r\n instead. Or vice versa. It looks like the convert missed something involving that, correct? What should be change, and what should it change to, if anything?

It doesn't appear to be related, but I made a post about a similar topic here:

http://www.simplemachines.org/community/index.php?topic=344103.msg2324495#msg2324495

In that topic, I think the problem is in one of the functions in SSI.php, not in the actual operation of the forum, or the conversion process, like in this case. The similarities might just be coincidental.

UPDATE:

Now that I look closer at the database, it seems only the posts made with SMF originally, after the conversion, are using <br />. It appears \r\n was used in vBulletin, but never got converted to the SMF way. If that's correct, then my related post mentioned above suddenly becomes relevant.

<br /> is fine for line endings, but <br /><br /> ought to be replaced with <p></p> tags around the block that was ended with <br /><br />. That's a flaw, if not a bug per se, and is relevant to conversion only because the convert didn't keep everything consistent, flawed or otherwise.

UPDATE:

I found some more problems with the conversion. This bizarre undefined character, , shows up in one of the posts and breaks validation on this page:

http://www.livebusinesschat.com/smf/index.php/topic,16.0/all.html

I don't know where it came from or why it's there, but I don't think it was there before the conversion. That's another thing the "find errors" function ought to find is undefined characters. It could be a sign of data corruption, although in this case it does not appear to be.

The other problem I have found looks like bb code that doesn't do anything like [ right ][/ right ] that probably ought to be removed in the conversion or settings changes, but wasn't. And then there's odd things like bbcode closing tags that don't seem to have an opening tag. I'm not certain, but it looks like some or all opening [ quote ] tags have been lost too. It looks like the converter tried to remove them, but only got half of it right, like [/ b ]. And one last problem seems to be bbcode that nothing gets done with it, like [ snapback ][/ snapback ]. All of those issues are visible here:

http://www.livebusinesschat.com/smf/index.php/topic,9.0/all.html

How do I get all of these problems fixed? I managed to fix the < and > html entities by hand, and it doesn't appear that I've accidentally screwed up the <br /> tags that SMF is gratuitously littering everywhere, but there weren't many of those, and all of the postings with the problem were from VB so there were no competing < and > characters to interfere with find and replace. I'm not sure how to fix these new problems, since I'm not sure what's going on with them.

Can I add here that after this experience, I have discovered that \r\n  is much preferred over HTML tags in the database code since it makes it SO MUCH easier to find and fix problems without HTML everywhere. The HTML ought to be added after reading the data from the database.

Red G. Brown


Norv

Sorry for the delay. I've been dealing with other things but I'll take my time for this converter these days. On some issues, I am quite at loss how to replicate the problems (like having a closing "[/b]", not sure why that is; I wonder what is the full original text of that post).  Other issues however should be dealt with, as generally html entities should be replaced properly.
I'll post an update as soon as possible.
To-do lists are for deferral. The more things you write down the later they're done... until you have 100s of lists of things you don't do.

File a security report | Developers' Blog | Bug Tracker


Also known as Norv on D* | Norv N. on G+ | Norv on Github

Red G. Brown

Here's my original VB database backup that I converted from, to SMF:

http://www.livebusinesschat.com/vbulletinbackup.zip

Let me know when you've got it so I can take it down.

Norv

To-do lists are for deferral. The more things you write down the later they're done... until you have 100s of lists of things you don't do.

File a security report | Developers' Blog | Bug Tracker


Also known as Norv on D* | Norv N. on G+ | Norv on Github

Red G. Brown

OK, I've taken it down. Let me know what you find out.

Norv

To-do lists are for deferral. The more things you write down the later they're done... until you have 100s of lists of things you don't do.

File a security report | Developers' Blog | Bug Tracker


Also known as Norv on D* | Norv N. on G+ | Norv on Github

Advertisement: