Converter shiz: WTF does this mean?

Antechinus · April 17, 2022, 03:32:49 AM

Ok, since I am apparently daft enough to try a conversion from phpBB (may its innards rot in Hell) to SMF 2.0.x (required step on the way to 2.1.x)...

...with a converter that is already known to not be fully functional...

...I wanna know some stuffz. Like, what does this mean, exactly?

Code Select

$row['body'] = preg_replace(
	array(
		'~\[b\:(.+?)\]~is',
		'~\[/b\:(.+?)\]~is',
	),
	array(
		'[b]',
		'[/b]',
	), $row['body']);

That is a shortened example of the whole kaboodle, just to get my head around the basics.

ETA: What I am trying to understand (and yes, I have searched the web) is what the ~\ means, and what the \: (.+?)\ means.

I assume the ~\ means something like "any crud in front of this" but do not understand how it decides how much crud (since it should only target the actual phpBB B tag and not everything in the post).

I have no idea what the \: (.+?)\ is doing there (since in practice that chunk of the phpBB tag is ever only [ b ] and nothing else).

I also do not get what ~is is supposed to be doing, exactly.

phpBB syntax for a basic b tag is this:

Code Select

<B><s>[b]</s>Actual text content here<e>[/b]</e></B>
Which for SMF has to end up as this...

Code Select

[b]Actual text content here[/b]...but apparently doesn't, if all the grumbling in the converter board is true.

Edited for typos.

live627 · April 17, 2022, 03:57:27 AM

Looks like two regexes to convert [b:stuff] to [b]

live627 · April 17, 2022, 04:22:34 AM

\[ matches the character [
b matches the character b
\: matches the character :
1st Capturing Group (.+?)
- . matches any character
- +? matches the previous token between one and unlimited times, as few times as possible, expanding as needed (lazy)
- \] matches the character ]

Global pattern flags

i modifier: insensitive. Case insensitive match (ignores case of [a-zA-Z])
s modifier: single line. Dot matches newline characters

Arantor · April 17, 2022, 04:25:47 AM

Yes, it means [b:stuff] where stuff can be anything. phpBB went through a phase of writing tags as [b:abcd1234] where the alphanumeric code was per-tag for some reason.

~ marks the start of the expression, then the expression itself happens, then the second ~ says "right, I'm done defining the regular expression" and the letters after it indicate what options to apply. i means "case insensitive" and s means "dots match all characters including line breaks" (something they do not do by default; dots normally mean match "almost" any character)

~\[b\

.+?)\]~is means literal [, literal b, literal colon, then match any character 1 or more times but the fewest number of times you can before you hit a literal ]. (Otherwise it would match from the first instance of [b: until the *last* ] rather than the first, which is what this is doing.

Crazy amount of complexity, very small space. Go figure the Perl community would come up with that.

Antechinus · April 17, 2022, 04:32:04 AM

Ok, thanks, both of you.

Yeah I figured it looked crazy bonkers for what it had to do.

So really, you could do it like this:

Code Select

$row['body'] = preg_replace(
	array(
		'<B><s>[b]</s>',
		'<e>[/b]</e></B>',
	),
	array(
		'[b]',
		'[/b]',
	), $row['body']);

That should work, yes? Since it is exactly what is has to do in practice. Or is that likely to screw up somehow?

Also, if this converter script needs to be frigged around with anyway, is there any obvious reason why it could not be made to convert straight to 2.1.x?

Arantor · April 17, 2022, 04:33:41 AM

I'd leave both in there to be honest, as phpBB isn't stellar at updating legacy posts.

It would be nice to make the importer go direct to 2.1, I don't remember how much work that is.

Antechinus · April 17, 2022, 04:42:47 AM

Ok but, from the grumblings in the other board, the catch is that the current code is skipping a lot of crud. For example it's leaving in <s > and </s > tags, which the browser renders as standard HTML strikethrough text, and it also leaves <e > and </e > tags, which are just pointless crud in HTML.

So leaving the existing preg_replace is just going to create problems with post content after conversion. It really needs to be cleaned up somehow.

ETA: I suspect it is also leaving the uppercase B tags as well, which in practice has no ill effect, since HTML is case-insensitive and nested b tags are just bold anyway. But still, it's extra crud in the markup which does not need to be there, and really should be dealt with by the conversion script.

Sesquipedalian · April 17, 2022, 04:47:38 AM

Before you try to mess with those regular expressions, you really should understand what the current ones do. Futzing around with them when you don't is a recipe for pain.

https://www.php.net/manual/en/reference.pcre.pattern.syntax.php

https://www.regular-expressions.info/

Antechinus · April 17, 2022, 04:51:00 AM

Ahem. 'Tis SMF coding. Pain is inevitable.

But ok, fair point. Will take a look at the lynx.

live627 · April 17, 2022, 05:13:51 AM

a good online tester that I use a lot is https://regex101.com/

Antechinus · April 17, 2022, 05:23:25 AM

Ta. Will take a look at that too.

Was also thinking the smart move is probably to set up a new phpBB 3.3.x installation with minimal content, just to get all the BBC tags sorted as a starting point. Should convert in next to no time, and then comparisons can be made.

Arantor · April 17, 2022, 05:26:58 AM

This one isn't SMF's fault

Antechinus · April 17, 2022, 05:31:24 AM

Meh.

As long as the damned thing ends up working one way or another, it doesn't much matter.

Antechinus · April 17, 2022, 05:51:13 AM

Speaking of which, there are also the infamous quote tags. The current script does this:

Code Select

$row['body'] = preg_replace(
    array(
        '~\[quote=&quot;(.+?)&quot;\:(.+?)\]~is',
        '~\[quote\:(.+?)\]~is',
        '~\[/quote\:(.+?)\]~is',
    ),
    array(
        '[quote author="$1"]',
        '[quote]',
        '[/quote]',
    ), $row['body']);

Which is clearly inadequate, since the phpBB code allows a link to the original (quoted) post, and SMF allows this too, but the conversion script is limiting it to author only, with no link to the original post.

So that needs extending, to allow a better conversion. And no, I'm not asking for a solution on a platter at this stage. I'm pretty sure I can figure it out with enough swearing. It doesn't look particularly complex, and it should be done.

Probably can't pull the date of the original, like SMF does, because phpBB quote tags do not seem to include that information, but a functioning link would be sufficient.

Antechinus · April 17, 2022, 06:38:47 PM

Speaking of times: how does the ten figure timestamp in SMF quote headers work?
Example from a test site:

Code Select

	[quote author=antechinus link=topic=1.msg3#msg3 date=1420864115]
		The custom textareas for some of the settings are potentially a security risk.
	[/quote]

Yep, have looked around the web for PHP and SQL ten figure time codes, but found nothing that makes sense.

(Oh and it turns out that some phpBB quote headers do contain a similar timestamp, so that should be doable as part of a conversion.)

Arantor · April 17, 2022, 06:41:02 PM

The 10-figure thing is a UNIX timestamp. It's the number of seconds since 1/1/1970 00:00:00 and there are plenty of libraries around to shuffle dates to and from that format. It's the same format that SMF stores against almost any time measurement in the database.

Antechinus · April 17, 2022, 06:53:30 PM

Ok thanks. Makes sense (didn't think of looking under Unix). So that would be directly convertible from phpBB to SMF. No problem.

Antechinus · April 18, 2022, 06:13:35 AM

Been digging around a bit...

Quote from: Arantor on April 17, 2022, 04:25:47 AMYes, it means [b:stuff] where stuff can be anything. phpBB went through a phase of writing tags as [b:abcd1234] where the alphanumeric code was per-tag for some reason.

I've checked the db dump for the site I want to convert. Earliest post is in late 2006, when phpBB 2.0.x was current. It's possible the forum started off with a beta of 3.0.x (first RC wasn't released until May 2007). I can check that with the old admin.

So, the thing is the BBC tags in the earliest posts exactly match the syntax of the same tags in the latest posts. IOW, no literal colon and no alphanumeric stuff per tag. This indicates that phpBB maybe haven't been doing anything that silly for many years. Or, alternatively, they were being that silly more recently (wouldn't surprise me) but somehow the upgrade script has dealt with the problem.

Which then leads to: since the current regex is looking to match a literal colon (in all BBC tags) which will never exist, the current converter is specifically set up to not match any existing phpBB BBC tags. It's specifically set up to skip them entirely and leave broken markup everywhere. Which, according to all the grumbles in other threads, is exactly what it does. So it's working perfectly, if perfectly means "totally screwing everything, as intended".

Anyway, point is that all of the current code for converting BBC tags is utterly useless for the (3.3.*) site I want to convert. It all has to be rewritten. Which makes me wonder when this converter script was written, and why it has taken until 2022 for someone to notice the bleeding obvious.

Way to go, SMF team.

And if all of the code for converting BBC tags has to be rewritten it's a safe bet there will be other parts of the code that need rewriting too. In which case, not making it convert straight to 2.1.x really would be kinda stupid. It can't be that much more work, really.

Antechinus · April 18, 2022, 06:42:46 PM

Just making a list of all potential crapfights, and there's this:

Quote from: shadav on December 01, 2021, 02:50:00 PM...what I found was it was mostly url's, embedded videos, and images that got messed up for me it threw in random s, r, and e tags

The "s" and "e" tags are obvious: those are part of the current phpBB tag syntax. The "r" tags are not, but I know what they are. Example (completely innocuous) from a phpBB 3.3.x db dump follows:

Code Select

(46745, 4261, 1, 1640, 0, '50.45.209.43', 1572934068, 0, 1, 1, 1, 1, '', 'Re: A few small changes.', '<r>Thanks for the strike code. <E>:D</E><br/>\n<br/>\nOff-topic: What are the restrictions on the advanced search and ignored words? Some words like \"industry\" are ignored. Which maybe is fine because who would search just that. But sometimes when I do a search I want to put something like \"industry profit\" and it\'s not possible from what I can tell. Is that something that saves server bandwidth (a good thing). I have no idea, just wondering if the update did anything for search capabilities?</r>', 'd221291048038a68825e43feffef919c', 0, '', '1mtkv0nj', 1, 0, '', 0, 0, 0, 1, 0, '', 0, 0),

The "r" tags appear to be phpBB's way of signalling the beginning and end of another row in the posts table, for the post content itself. At least, that's the only logic I can think of for choosing "r" there (obviously each db row includes all the other crud, before and after the actual post content).

Anyway, point is that these are being missed by the current conversion code. Getting rid of them is simple enough. It's just not being done at the moment.

Arantor · April 18, 2022, 07:00:09 PM

I don't think it means that at all. Another row in the posts table would be, well, another row in the posts table, not multiple things inside one row, even phpBB isn't that out there.

News:

Converter shiz: WTF does this mean?