News:

Want to get involved in developing SMF, then why not lend a hand on our github!

Main Menu

Converter shiz: WTF does this mean?

Started by Antechinus, April 17, 2022, 03:32:49 AM

Previous topic - Next topic

Antechinus

Lol. Well whatever the reason, they start and end post content with a custom "r" tag, and the converter is letting it through, so it needs to be rewritten to kill those tags. No big deal, just another thing on the list.

Antechinus

Aha! Just checked a bit further. These guys are even more bonkers than I suspected.

The post content appears to have an <r> and </r> at the beginning and end of every alternate post. At least in some parts of the db. I suspect it gets assigned when the post is submitted, so that if a post is deleted later the sequence for the entire db is not updated.

Anyway, first post in the db has actual post content that begins and ends with <r> and </r>. Alternating posts in some parts of the db have actual post content that begins and ends with <t> and </t>. So that is four superfluous tags the converter needs to eliminate, not just two. Still easy enough to do.

Antechinus

Oh yeah, re possibly sorting this thing to go straight to 2.1.x: does anyone know offhand if edits would be required to convert.php?

Or would convert.php be ok as is, with all the work to be done limited to the phpbb_to_smf.sql?

ETA: Well, apart from a couple of obvious minor details, like minimum PHP and MySQL version. And IIRC 2.1.x doesn't support PostgreSQL, so that would be dropped from the convert.php globals.

Arantor

Quote from: Antechinus on April 18, 2022, 08:55:54 PMAnd IIRC 2.1.x doesn't support PostgreSQL

Yes it does; it was only SQLite that was dropped.

The converter really deserves a lot more time and effort than anyone can give it right now. Though honestly I feel like throwing the time and energy at OpenImporter to make that more awesome would be better than trying to patch up the converters because god knows how old and crusty they really are.

Antechinus

#24
Hmm. Well I'm prepared to have a crack at the converter, just so I can use it myself, but as you can tell I'll need some guidance here and there. And if making it go straight to 2.1 is really going to be a drama, going from 2.0 to 2.1 within SMF is a piece of cake anyway.

Personally I'm wary of looking at OI, because at the moment it doesn't have any version of SMF as a destination, so that is probably well above my pay grade unless I go to Elk instead. Whereas I can probably hack the existing phpBB_to_smf converter to work well enough for my purposes.

ETA: Although (obvious stupid thought which will probably not work) since early Elk was basically 2.1 Alpha anyway, with the same db structure, would it be relatively easy to adopt the existing OI Elk script to SMF 2.1? IIRC db changes are not much, and for conversions it's presumably the db changes that count. Things like BBC syntax weren't changed.

Antechinus

Question about the 2.0.x regex for font-size, in Subs.php:
array(
'tag' => 'size',
'type' => 'unparsed_equals',
'test' => '([1-9][\d]?p[xt]|small(?:er)?|large[r]?|x[x]?-(?:small|large)|medium|(0\.[1-9]|[1-9](\.[\d][\d]?)?)?em)\]',
'before' => '<span style="font-size: $1;" class="bbc_size">',
'after' => '</span>',
),

Just checking that I understand this correctly: is it saying that for font-size values < 1em it will only accept a single decimal place (ie: .9em, but not .85em) while for font-size > 1em it will accept two decimal places (ie: 1.85em will work)?

Arantor

So...

* 1-9 followed by an optional digit followed by px or pt
* or small (optionally followed by er)
* or large (optionally followed by r)
* or x-small, x-large, xx-small, xx-large
* or medium
* or 0. with 1 decimal place followed by em
* 1-9 followed by a decimal place followed by one or two digits followed by em

Yup, you have it right.

if you want 0.85em I think this should do it:
array(
'tag' => 'size',
'type' => 'unparsed_equals',
'test' => '([1-9][\d]?p[xt]|small(?:er)?|large[r]?|x[x]?-(?:small|large)|medium|(0\.[1-9][\d]?|[1-9](\.[\d][\d]?)?)?em)\]',
'before' => '<span style="font-size: $1;" class="bbc_size">',
'after' => '</span>',
),

Antechinus

Ok cool. At least I understood it all. This is progress. :D

I'm not sure I need two decimal places for < 1em, but OTOH it is an odd inconsistency in the default code. Offhand it seems allowing .75em makes as much sense as allowing 1.75em.

I did test .85em on a 2.0.x test site a while back, and noticed that it appeared to just default to 1em. Now I know why. That value would be dropped by the regex, resulting in no change in font-size.

Reason I asked is that the phpBB db I'm looking at contains some examples of size = 85%, and a few other odd sizes. I may just write something that changes them to a more convenient value (SQL query before conversion, or find/replace on the db dump, or new regex in hacked converter).

ETA: On second thought, the easiest option would be to just change the test regex in Subs.php. Less code to mess with, less chance of anything screwing up, and more versatility for the future.

Antechinus

Ok, next question. This:
'~\[b\:(.+?)\]~is',
I know what it is saying now, but I don't understand why it was written to look for literal [b then a literal colon. Seems to me that this would do exactly the same job:
'~\[b:(.+?)\]~is',
Am I missing something?

Arantor

There are times when : can make up part of an expression in regex syntax (namely the ?: and its variant syntaxes) so escaping it with the backslash makes it very clear it's not part of any of those, not even accidentally.

Tyrsson

I can not even begin to believe what Im reading in this thread.... Ant, deep diving into the world of hardcore php.... Never thought I would see it.... Historically it was always.... "*#$*#@* that, I cant be bothered with *#&@^$ *#*#($ #&#&@ *# $$*$*$ that s*#t!!!!" and that was on his good days roflmao..

So far, well done Ant!!!  :D  :D

If you can get your head around regex... You can do just bout any of it.
PM at your own risk, some I answer, if they are interesting, some I ignore.

Antechinus

:P :P :P :P :P :P :P

I've always been fine with learning whatever PHP I needed to do whatever job I needed to do at the time. I just don't want to waste time learning everything about PHP all in one go, when I only want to get something specific sorted so I can move on.

Sesquipedalian

I promise you nothing.

Sesqu... Sesqui... what?
Sesquipedalian, the best word in the English language.

Antechinus

Bonzer. Swing past on yer Tarzan rope and have a squiz at this. :D

phpBB quote tag, for an all-bells-and-whistles quote, goes like this:
<QUOTE author=\"Username\" post_id=\"12345\" time=\"1234567890\" user_id=\"1234\"><s>[quote=Username post_id=12345 time=1234567890 user_id=1234]</s>
Actual quoted text.
<e>[/quote]</e></QUOTE>

So obviously to convert that you need to grab the username, post_id and time (user_id is not relevant to SMF quotes). And with the \ already being in the db to escape all the " that means both the \ and the " need to be designated as literal in the regex. Ends up with quite a few backslashes running around like headless chickens. :P

As far as I can figure, it should look like this in the converter SQL file:
$row['body'] = preg_replace(
array(
'~\<QUOTE author=\\\"(.+?)\\\" post_id=\\\"(.+?)\\\" time=\\\"(.+?)\\\" user_id=\\\"(.+?)\\\"\>\<s\>\[quote=(.+?) post_id=(.+?) time=(.+?) user_id=(.+?)\]</s>~s',
'<e>[/quote]</e></QUOTE>',
),
array(
'[quote author=$1 link=msg=$2 date=$3]',
'[/quote]',
), $row['body']);

This seems to not give the magic regex checker any indigestion. The closing tag shouldn't need regex, because it's always exactly the same content.

A basic find/replace should sort the closing tag, but I'm not sure if it can be done just like that, as part of the preg_replace. Would it need to be done like these?
/* This just does the stuff that it isn't work parsing in a regex. */
$row['body'] = strtr($row['body'], array(
'[list type=1]' => '[list type=decimal]',
'[list type=a]' => '[list type=lower-alpha]',
));

Antechinus

Ok, that seemed to be TMI. :D  Let me put it like this:

Since all current phpBB closing tags will not require regex, is it more sensible to keep them in the same preg_replace array as the opening tags (which will require regex)?

Or, it is more sensible to split the closing tags out to another array that uses strtr?

Doesn't phase me either way. Happy to go with whatever is best for performance/sanity/etc.

Sesquipedalian

Your regexes look okay when I read them, but I haven't tested them, so I promise you nothing.

As for dealing with the closing tags, it won't make much difference which way you do it. If you weren't already using any regexes, it would be faster to use simple string replacement. But once you've paid the initial overhead of calling preg_replace(), it won't make any difference whether the PCRE engine or the simple string engine is the one stepping through the string looking for a basic substring.
I promise you nothing.

Sesqu... Sesqui... what?
Sesquipedalian, the best word in the English language.

Antechinus

Ok cool. That's what I wanted to know. Thanks.

And yep, I get I'll have to test my own regex. ;)

Antechinus

Magic regex checker thing says it should be like this:

$row['body'] = preg_replace(
array(
'~\<QUOTE author=\\\"(.+?)\\\" post_id=\\\"(.+?)\\\" time=\\\"(.+?)\\\" user_id=\\\"(.+?)\\\"\>\<s\>\[quote=(.+?) post_id=(.+?) time=(.+?) user_id=(.+?)\]</s>~s',
'~<e>\[\/quote]<\/e><\/QUOTE>',
),
array(
'[quote author=$1 link=msg=$2 date=$3]',
'[/quote]',
), $row['body']);

Which is cool. I think I have the hang of escaping all the necessary basics. Time to whip up a hacked copy of the script and a basic test db, and run some live tests. :)

Sesquipedalian

You are missing the closing ~ in your second regex.
I promise you nothing.

Sesqu... Sesqui... what?
Sesquipedalian, the best word in the English language.

Antechinus

Yup. I got a little bit more done on them last night. Turns out the syntax can be simpler than I expected. According to the magic regex checker, it's only necessary to escape the first [, and it doesn't require you to also escape the following ] bracket. Also, no need to escape any instances of " AFAICT, or a few other things. This passes the checker:

$row['body'] = preg_replace(
array(
'~<QUOTE author=\\"(.+?)\\" post_id=\\"(.+?)\\" time=\\"(.+?)\\" user_id=\\"(.+?)\\"><s>\[quote=(.+?) post_id=(.+?) time=(.+?) user_id=(.+?)]</s>~s',
'~<QUOTE author=\\"(.+?)\\"><s>\[quote=\\"(.+?)\\"]</s>~s',
'~<QUOTE><s>\[quote]</s>~',
'~<e>\[/quote]</e></QUOTE>~',
'~<B><s>\[b]</s>~',
'~<e>\[/b]</e></B>~',
'~<I><s>\[i]</s>~',
'~<e>\[/i]</e></I>~',
'~<U><s>\[u]</s>~',
'~<e>\[/u]</e></U>~',
),
array(
'[quote author=$1 link=msg=$2 date=$3]',
'[quote author=$1]',
'[quote]',
'[/quote]',
'[b]',
'[/b]',
'[i]',
'[/i]',
'[u]',
'[/u]',
), $row['body']);

Advertisement: