Simple Machines Community Forum

SMF Support => Converting to SMF => phpBB => Topic started by: Antechinus on April 17, 2022, 03:32:49 AM

Title: Converter shiz: WTF does this mean?
Post by: Antechinus on April 17, 2022, 03:32:49 AM
Ok, since I am apparently daft enough to try a conversion from phpBB (may its innards rot in Hell) to SMF 2.0.x (required step on the way to 2.1.x)...

...with a converter that is already known to not be fully functional... :P

...I wanna know some stuffz. Like, what does this mean, exactly?
$row['body'] = preg_replace(
array(
'~\[b\:(.+?)\]~is',
'~\[/b\:(.+?)\]~is',
),
array(
'[b]',
'[/b]',
), $row['body']);

That is a shortened example of the whole kaboodle, just to get my head around the basics.

ETA: What I am trying to understand (and yes, I have searched the web) is what the ~\ means, and what the \: (.+?)\ means.

I assume the ~\ means something like "any crud in front of this" but do not understand how it decides how much crud (since it should only target the actual phpBB B tag and not everything in the post).

I have no idea what the \: (.+?)\ is doing there (since in practice that chunk of the phpBB tag is ever only [ b ] and nothing else).

I also do not get what ~is is supposed to be doing, exactly. 

phpBB syntax for a basic b tag is this:
<B><s>[b]</s>Actual text content here<e>[/b]</e></B>
Which for SMF has to end up as this...
[b]Actual text content here[/b]...but apparently doesn't, if all the grumbling in the converter board is true. :P

Edited for typos.
Title: Re: Converter shiz: WTF does this mean?
Post by: live627 on April 17, 2022, 03:57:27 AM
Looks like two regexes to convert [b:stuff] to [b]
Title: Re: Converter shiz: WTF does this mean?
Post by: live627 on April 17, 2022, 04:22:34 AM

Global pattern flags
Title: Re: Converter shiz: WTF does this mean?
Post by: Arantor on April 17, 2022, 04:25:47 AM
Yes, it means [b:stuff] where stuff can be anything. phpBB went through a phase of writing tags as [b:abcd1234] where the alphanumeric code was per-tag for some reason.

~ marks the start of the expression, then the expression itself happens, then the second ~ says "right, I'm done defining the regular expression" and the letters after it indicate what options to apply. i means "case insensitive" and s means "dots match all characters including line breaks" (something they do not do by default; dots normally mean match "almost" any character)

~\[b\:(.+?)\]~is means literal [, literal b, literal colon, then match any character 1 or more times but the fewest number of times you can before you hit a literal ]. (Otherwise it would match from the first instance of [b: until the *last* ] rather than the first, which is what this is doing.

Crazy amount of complexity, very small space. Go figure the Perl community would come up with that.
Title: Re: Converter shiz: WTF does this mean?
Post by: Antechinus on April 17, 2022, 04:32:04 AM
Ok, thanks, both of you.

Yeah I figured it looked crazy bonkers for what it had to do. :D
So really, you could do it like this:
$row['body'] = preg_replace(
array(
'<B><s>[b]</s>',
'<e>[/b]</e></B>',
),
array(
'[b]',
'[/b]',
), $row['body']);
That should work, yes? Since it is exactly what is has to do in practice. Or is that likely to screw up somehow?

Also, if this converter script needs to be frigged around with anyway, is there any obvious reason why it could not be made to convert straight to 2.1.x?
Title: Re: Converter shiz: WTF does this mean?
Post by: Arantor on April 17, 2022, 04:33:41 AM
I'd leave both in there to be honest, as phpBB isn't stellar at updating legacy posts.

It would be nice to make the importer go direct to 2.1, I don't remember how much work that is.
Title: Re: Converter shiz: WTF does this mean?
Post by: Antechinus on April 17, 2022, 04:42:47 AM
Ok but, from the grumblings in the other board, the catch is that the current code is skipping a lot of crud. For example it's leaving in <s > and </s > tags, which the browser renders as standard HTML strikethrough text, and it also leaves <e > and </e > tags, which are just pointless crud in HTML.

So leaving the existing preg_replace is just going to create problems with post content after conversion. It really needs to be cleaned up somehow.

ETA: I suspect it is also leaving the uppercase B tags as well, which in practice has no ill effect, since HTML is case-insensitive and nested b tags are just bold anyway. But still, it's extra crud in the markup which does not need to be there, and really should be dealt with by the conversion script.
Title: Re: Converter shiz: WTF does this mean?
Post by: Sesquipedalian on April 17, 2022, 04:47:38 AM
Before you try to mess with those regular expressions, you really should understand what the current ones do. Futzing around with them when you don't is a recipe for pain.

https://www.php.net/manual/en/reference.pcre.pattern.syntax.php

https://www.regular-expressions.info/

Title: Re: Converter shiz: WTF does this mean?
Post by: Antechinus on April 17, 2022, 04:51:00 AM
Ahem. 'Tis SMF coding. Pain is inevitable. :D

But ok, fair point. Will take a look at the lynx.
Title: Re: Converter shiz: WTF does this mean?
Post by: live627 on April 17, 2022, 05:13:51 AM
a good online tester that I use a lot is  https://regex101.com/
Title: Re: Converter shiz: WTF does this mean?
Post by: Antechinus on April 17, 2022, 05:23:25 AM
Ta. Will take a look at that too.

Was also thinking the smart move is probably to set up a new phpBB 3.3.x installation with minimal content, just to get all the BBC tags sorted as a starting point. Should convert in next to no time, and then comparisons can be made.
Title: Re: Converter shiz: WTF does this mean?
Post by: Arantor on April 17, 2022, 05:26:58 AM
This one isn't SMF's fault :P
Title: Re: Converter shiz: WTF does this mean?
Post by: Antechinus on April 17, 2022, 05:31:24 AM
Meh. :P
As long as the damned thing ends up working one way or another, it doesn't much matter.
Title: Re: Converter shiz: WTF does this mean?
Post by: Antechinus on April 17, 2022, 05:51:13 AM
Speaking of which, there are also the infamous quote tags. The current script does this:

$row['body'] = preg_replace(
    array(
        '~\[quote=&quot;(.+?)&quot;\:(.+?)\]~is',
        '~\[quote\:(.+?)\]~is',
        '~\[/quote\:(.+?)\]~is',
    ),
    array(
        '[quote author="$1"]',
        '[quote]',
        '[/quote]',
    ), $row['body']);

Which is clearly inadequate, since the phpBB code allows a link to the original (quoted) post, and SMF allows this too, but the conversion script is limiting it to author only, with no link to the original post.

So that needs extending, to allow a better conversion. And no, I'm not asking for a solution on a platter at this stage. I'm pretty sure I can figure it out with enough swearing. It doesn't look particularly complex, and it should be done.

Probably can't pull the date of the original, like SMF does, because phpBB quote tags do not seem to include that information, but  a functioning link would be sufficient.
Title: Re: Converter shiz: WTF does this mean?
Post by: Antechinus on April 17, 2022, 06:38:47 PM
Speaking of times: how does the ten figure timestamp in SMF quote headers work?
Example from a test site:
[quote author=antechinus link=topic=1.msg3#msg3 date=1420864115]
The custom textareas for some of the settings are potentially a security risk.
[/quote]

Yep, have looked around the web for PHP and SQL ten figure time codes, but found nothing that makes sense.

(Oh and it turns out that some phpBB quote headers do contain a similar timestamp, so that should be doable as part of a conversion.)
Title: Re: Converter shiz: WTF does this mean?
Post by: Arantor on April 17, 2022, 06:41:02 PM
The 10-figure thing is a UNIX timestamp. It's the number of seconds since 1/1/1970 00:00:00 and there are plenty of libraries around to shuffle dates to and from that format. It's the same format that SMF stores against almost any time measurement in the database.
Title: Re: Converter shiz: WTF does this mean?
Post by: Antechinus on April 17, 2022, 06:53:30 PM
Ok thanks. Makes sense (didn't think of looking under Unix). So that would be directly convertible from phpBB to SMF. No problem.
Title: Re: Converter shiz: WTF does this mean?
Post by: Antechinus on April 18, 2022, 06:13:35 AM
Been digging around a bit...
Quote from: Arantor on April 17, 2022, 04:25:47 AMYes, it means [b:stuff] where stuff can be anything. phpBB went through a phase of writing tags as [b:abcd1234] where the alphanumeric code was per-tag for some reason.
I've checked the db dump for the site I want to convert. Earliest post is in late 2006, when phpBB 2.0.x was current. It's possible the forum started off with a beta of 3.0.x (first RC wasn't released until May 2007). I can check that with the old admin.

So, the thing is the BBC tags in the earliest posts exactly match the syntax of the same tags in the latest posts. IOW, no literal colon and no alphanumeric stuff per tag. This indicates that phpBB maybe haven't been doing anything that silly for many years. Or, alternatively, they were being that silly more recently (wouldn't surprise me) but somehow the upgrade script has dealt with the problem.

Which then leads to: since the current regex is looking to match a literal colon (in all BBC tags) which will never exist, the current converter is specifically set up to not match any existing phpBB BBC tags. It's specifically set up to skip them entirely and leave broken markup everywhere. Which, according to all the grumbles in other threads, is exactly what it does. So it's working perfectly, if perfectly means "totally screwing everything, as intended". :D

Anyway, point is that all of the current code for converting BBC tags is utterly useless for the (3.3.*) site I want to convert. It all has to be rewritten. Which makes me wonder when this converter script was written, and why it has taken until 2022 for someone to notice the bleeding obvious.

Way to go, SMF team. :P

And if all of the code for converting BBC tags has to be rewritten it's a safe bet there will be other parts of the code that need rewriting too. In which case, not making it convert straight to 2.1.x really would be kinda stupid. It can't be that much more work, really.
Title: Re: Converter shiz: WTF does this mean?
Post by: Antechinus on April 18, 2022, 06:42:46 PM
Just making a list of all potential crapfights, and there's this:

Quote from: shadav on December 01, 2021, 02:50:00 PM...what I found was it was mostly url's, embedded videos, and images that got messed up for me it threw in random s, r, and e tags  ???

The "s" and "e" tags are obvious: those are part of the current phpBB tag syntax. The "r" tags are not, but I know what they are. Example (completely innocuous) from a phpBB 3.3.x db dump follows:

(46745, 4261, 1, 1640, 0, '50.45.209.43', 1572934068, 0, 1, 1, 1, 1, '', 'Re: A few small changes.', '<r>Thanks for the strike code. <E>:D</E><br/>\n<br/>\nOff-topic: What are the restrictions on the advanced search and ignored words? Some words like \"industry\" are ignored. Which maybe is fine because who would search just that. But sometimes when I do a search I want to put something like \"industry profit\" and it\'s not possible from what I can tell. Is that something that saves server bandwidth (a good thing). I have no idea, just wondering if the update did anything for search capabilities?</r>', 'd221291048038a68825e43feffef919c', 0, '', '1mtkv0nj', 1, 0, '', 0, 0, 0, 1, 0, '', 0, 0),
The "r" tags appear to be phpBB's way of signalling the beginning and end of another row in the posts table, for the post content itself. At least, that's the only logic I can think of for choosing "r" there (obviously each db row includes all the other crud, before and after the actual post content).

Anyway, point is that these are being missed by the current conversion code. Getting rid of them is simple enough. It's just not being done at the moment.
Title: Re: Converter shiz: WTF does this mean?
Post by: Arantor on April 18, 2022, 07:00:09 PM
I don't think it means that at all. Another row in the posts table would be, well, another row in the posts table, not multiple things inside one row, even phpBB isn't that out there.
Title: Re: Converter shiz: WTF does this mean?
Post by: Antechinus on April 18, 2022, 07:02:20 PM
Lol. Well whatever the reason, they start and end post content with a custom "r" tag, and the converter is letting it through, so it needs to be rewritten to kill those tags. No big deal, just another thing on the list.
Title: Re: Converter shiz: WTF does this mean?
Post by: Antechinus on April 18, 2022, 07:33:03 PM
Aha! Just checked a bit further. These guys are even more bonkers than I suspected.

The post content appears to have an <r> and </r> at the beginning and end of every alternate post. At least in some parts of the db. I suspect it gets assigned when the post is submitted, so that if a post is deleted later the sequence for the entire db is not updated.

Anyway, first post in the db has actual post content that begins and ends with <r> and </r>. Alternating posts in some parts of the db have actual post content that begins and ends with <t> and </t>. So that is four superfluous tags the converter needs to eliminate, not just two. Still easy enough to do.
Title: Re: Converter shiz: WTF does this mean?
Post by: Antechinus on April 18, 2022, 08:55:54 PM
Oh yeah, re possibly sorting this thing to go straight to 2.1.x: does anyone know offhand if edits would be required to convert.php?

Or would convert.php be ok as is, with all the work to be done limited to the phpbb_to_smf.sql?

ETA: Well, apart from a couple of obvious minor details, like minimum PHP and MySQL version. And IIRC 2.1.x doesn't support PostgreSQL, so that would be dropped from the convert.php globals.
Title: Re: Converter shiz: WTF does this mean?
Post by: Arantor on April 18, 2022, 09:11:27 PM
Quote from: Antechinus on April 18, 2022, 08:55:54 PMAnd IIRC 2.1.x doesn't support PostgreSQL

Yes it does; it was only SQLite that was dropped.

The converter really deserves a lot more time and effort than anyone can give it right now. Though honestly I feel like throwing the time and energy at OpenImporter to make that more awesome would be better than trying to patch up the converters because god knows how old and crusty they really are.
Title: Re: Converter shiz: WTF does this mean?
Post by: Antechinus on April 18, 2022, 09:22:54 PM
Hmm. Well I'm prepared to have a crack at the converter, just so I can use it myself, but as you can tell I'll need some guidance here and there. And if making it go straight to 2.1 is really going to be a drama, going from 2.0 to 2.1 within SMF is a piece of cake anyway.

Personally I'm wary of looking at OI, because at the moment it doesn't have any version of SMF as a destination, so that is probably well above my pay grade unless I go to Elk instead. Whereas I can probably hack the existing phpBB_to_smf converter to work well enough for my purposes.

ETA: Although (obvious stupid thought which will probably not work) since early Elk was basically 2.1 Alpha anyway, with the same db structure, would it be relatively easy to adopt the existing OI Elk script to SMF 2.1? IIRC db changes are not much, and for conversions it's presumably the db changes that count. Things like BBC syntax weren't changed.
Title: Re: Converter shiz: WTF does this mean?
Post by: Antechinus on April 19, 2022, 06:15:25 PM
Question about the 2.0.x regex for font-size, in Subs.php:
array(
'tag' => 'size',
'type' => 'unparsed_equals',
'test' => '([1-9][\d]?p[xt]|small(?:er)?|large[r]?|x[x]?-(?:small|large)|medium|(0\.[1-9]|[1-9](\.[\d][\d]?)?)?em)\]',
'before' => '<span style="font-size: $1;" class="bbc_size">',
'after' => '</span>',
),

Just checking that I understand this correctly: is it saying that for font-size values < 1em it will only accept a single decimal place (ie: .9em, but not .85em) while for font-size > 1em it will accept two decimal places (ie: 1.85em will work)?
Title: Re: Converter shiz: WTF does this mean?
Post by: Arantor on April 19, 2022, 06:19:15 PM
So...

* 1-9 followed by an optional digit followed by px or pt
* or small (optionally followed by er)
* or large (optionally followed by r)
* or x-small, x-large, xx-small, xx-large
* or medium
* or 0. with 1 decimal place followed by em
* 1-9 followed by a decimal place followed by one or two digits followed by em

Yup, you have it right.

if you want 0.85em I think this should do it:
array(
'tag' => 'size',
'type' => 'unparsed_equals',
'test' => '([1-9][\d]?p[xt]|small(?:er)?|large[r]?|x[x]?-(?:small|large)|medium|(0\.[1-9][\d]?|[1-9](\.[\d][\d]?)?)?em)\]',
'before' => '<span style="font-size: $1;" class="bbc_size">',
'after' => '</span>',
),
Title: Re: Converter shiz: WTF does this mean?
Post by: Antechinus on April 19, 2022, 06:30:26 PM
Ok cool. At least I understood it all. This is progress. :D

I'm not sure I need two decimal places for < 1em, but OTOH it is an odd inconsistency in the default code. Offhand it seems allowing .75em makes as much sense as allowing 1.75em.

I did test .85em on a 2.0.x test site a while back, and noticed that it appeared to just default to 1em. Now I know why. That value would be dropped by the regex, resulting in no change in font-size.

Reason I asked is that the phpBB db I'm looking at contains some examples of size = 85%, and a few other odd sizes. I may just write something that changes them to a more convenient value (SQL query before conversion, or find/replace on the db dump, or new regex in hacked converter).

ETA: On second thought, the easiest option would be to just change the test regex in Subs.php. Less code to mess with, less chance of anything screwing up, and more versatility for the future.
Title: Re: Converter shiz: WTF does this mean?
Post by: Antechinus on April 19, 2022, 08:23:01 PM
Ok, next question. This:
'~\[b\:(.+?)\]~is',
I know what it is saying now, but I don't understand why it was written to look for literal [b then a literal colon. Seems to me that this would do exactly the same job:
'~\[b:(.+?)\]~is',
Am I missing something?
Title: Re: Converter shiz: WTF does this mean?
Post by: Arantor on April 19, 2022, 08:37:07 PM
There are times when : can make up part of an expression in regex syntax (namely the ?: and its variant syntaxes) so escaping it with the backslash makes it very clear it's not part of any of those, not even accidentally.
Title: Re: Converter shiz: WTF does this mean?
Post by: Tyrsson on April 19, 2022, 09:45:00 PM
I can not even begin to believe what Im reading in this thread.... Ant, deep diving into the world of hardcore php.... Never thought I would see it.... Historically it was always.... "*#$*#@* that, I cant be bothered with *#&@^$ *#*#($ #&#&@ *# $$*$*$ that s*#t!!!!" and that was on his good days roflmao..

So far, well done Ant!!!  :D  :D

If you can get your head around regex... You can do just bout any of it.
Title: Re: Converter shiz: WTF does this mean?
Post by: Antechinus on April 19, 2022, 11:17:07 PM
:P :P :P :P :P :P :P

I've always been fine with learning whatever PHP I needed to do whatever job I needed to do at the time. I just don't want to waste time learning everything about PHP all in one go, when I only want to get something specific sorted so I can move on.
Title: Re: Converter shiz: WTF does this mean?
Post by: Sesquipedalian on April 20, 2022, 12:16:54 AM
(https://imgs.xkcd.com/comics/regular_expressions.png)
Title: Re: Converter shiz: WTF does this mean?
Post by: Antechinus on April 20, 2022, 06:50:43 AM
Bonzer. Swing past on yer Tarzan rope and have a squiz at this. :D

phpBB quote tag, for an all-bells-and-whistles quote, goes like this:
<QUOTE author=\"Username\" post_id=\"12345\" time=\"1234567890\" user_id=\"1234\"><s>[quote=Username post_id=12345 time=1234567890 user_id=1234]</s>
Actual quoted text.
<e>[/quote]</e></QUOTE>

So obviously to convert that you need to grab the username, post_id and time (user_id is not relevant to SMF quotes). And with the \ already being in the db to escape all the " that means both the \ and the " need to be designated as literal in the regex. Ends up with quite a few backslashes running around like headless chickens. :P

As far as I can figure, it should look like this in the converter SQL file:
$row['body'] = preg_replace(
array(
'~\<QUOTE author=\\\"(.+?)\\\" post_id=\\\"(.+?)\\\" time=\\\"(.+?)\\\" user_id=\\\"(.+?)\\\"\>\<s\>\[quote=(.+?) post_id=(.+?) time=(.+?) user_id=(.+?)\]</s>~s',
'<e>[/quote]</e></QUOTE>',
),
array(
'[quote author=$1 link=msg=$2 date=$3]',
'[/quote]',
), $row['body']);

This seems to not give the magic regex checker any indigestion. The closing tag shouldn't need regex, because it's always exactly the same content.

A basic find/replace should sort the closing tag, but I'm not sure if it can be done just like that, as part of the preg_replace. Would it need to be done like these?
/* This just does the stuff that it isn't work parsing in a regex. */
$row['body'] = strtr($row['body'], array(
'[list type=1]' => '[list type=decimal]',
'[list type=a]' => '[list type=lower-alpha]',
));
Title: Re: Converter shiz: WTF does this mean?
Post by: Antechinus on April 20, 2022, 07:53:41 PM
Ok, that seemed to be TMI. :D  Let me put it like this:

Since all current phpBB closing tags will not require regex, is it more sensible to keep them in the same preg_replace array as the opening tags (which will require regex)?

Or, it is more sensible to split the closing tags out to another array that uses strtr?

Doesn't phase me either way. Happy to go with whatever is best for performance/sanity/etc.
Title: Re: Converter shiz: WTF does this mean?
Post by: Sesquipedalian on April 20, 2022, 09:51:54 PM
Your regexes look okay when I read them, but I haven't tested them, so I promise you nothing.

As for dealing with the closing tags, it won't make much difference which way you do it. If you weren't already using any regexes, it would be faster to use simple string replacement. But once you've paid the initial overhead of calling preg_replace(), it won't make any difference whether the PCRE engine or the simple string engine is the one stepping through the string looking for a basic substring.
Title: Re: Converter shiz: WTF does this mean?
Post by: Antechinus on April 20, 2022, 09:57:06 PM
Ok cool. That's what I wanted to know. Thanks.

And yep, I get I'll have to test my own regex. ;)
Title: Re: Converter shiz: WTF does this mean?
Post by: Antechinus on April 21, 2022, 08:51:35 PM
Magic regex checker thing says it should be like this:

$row['body'] = preg_replace(
array(
'~\<QUOTE author=\\\"(.+?)\\\" post_id=\\\"(.+?)\\\" time=\\\"(.+?)\\\" user_id=\\\"(.+?)\\\"\>\<s\>\[quote=(.+?) post_id=(.+?) time=(.+?) user_id=(.+?)\]</s>~s',
'~<e>\[\/quote]<\/e><\/QUOTE>',
),
array(
'[quote author=$1 link=msg=$2 date=$3]',
'[/quote]',
), $row['body']);

Which is cool. I think I have the hang of escaping all the necessary basics. Time to whip up a hacked copy of the script and a basic test db, and run some live tests. :)
Title: Re: Converter shiz: WTF does this mean?
Post by: Sesquipedalian on April 22, 2022, 01:58:52 AM
You are missing the closing ~ in your second regex.
Title: Re: Converter shiz: WTF does this mean?
Post by: Antechinus on April 22, 2022, 07:14:37 PM
Yup. I got a little bit more done on them last night. Turns out the syntax can be simpler than I expected. According to the magic regex checker, it's only necessary to escape the first [, and it doesn't require you to also escape the following ] bracket. Also, no need to escape any instances of " AFAICT, or a few other things. This passes the checker:

$row['body'] = preg_replace(
array(
'~<QUOTE author=\\"(.+?)\\" post_id=\\"(.+?)\\" time=\\"(.+?)\\" user_id=\\"(.+?)\\"><s>\[quote=(.+?) post_id=(.+?) time=(.+?) user_id=(.+?)]</s>~s',
'~<QUOTE author=\\"(.+?)\\"><s>\[quote=\\"(.+?)\\"]</s>~s',
'~<QUOTE><s>\[quote]</s>~',
'~<e>\[/quote]</e></QUOTE>~',
'~<B><s>\[b]</s>~',
'~<e>\[/b]</e></B>~',
'~<I><s>\[i]</s>~',
'~<e>\[/i]</e></I>~',
'~<U><s>\[u]</s>~',
'~<e>\[/u]</e></U>~',
),
array(
'[quote author=$1 link=msg=$2 date=$3]',
'[quote author=$1]',
'[quote]',
'[/quote]',
'[b]',
'[/b]',
'[i]',
'[/i]',
'[u]',
'[/u]',
), $row['body']);
Title: Re: Converter shiz: WTF does this mean?
Post by: Antechinus on April 22, 2022, 11:15:27 PM
This seems to be the most sensible way of dealing with font-size:

$row['body'] = preg_replace(
array(
'~<SIZE size=\\"(.+?)\\"><s>\[size=(.+?)]</s>~s',
'~<e>\[/size]</e></SIZE><e>~',
),
array(
'[size=' . (round($1 / 100, 1)) . 'em]',
'[/size]',
), $row['body']);

Will handle phpBB sizes < 100% or > 100%, and will fit into SMF's default test for sizes < 1em. Easy.

ETA: Figured out what the dodgy <t > </t > and <r > </r > tags are in the database. The OP in any topic has its post body content wrapped in the "t" tags, for "topic". Any replies to that topic get the "r" tags, for "reply". They're still no use to SMF though. Just more crud to get rid of when converting.
Title: Re: Converter shiz: WTF does this mean?
Post by: Sesquipedalian on April 23, 2022, 12:42:17 AM
Quote from: Antechinus on April 22, 2022, 11:15:27 PMThis seems to be the most sensible way of dealing with font-size:

$row['body'] = preg_replace(
array(
'~<SIZE size=\\"(.+?)\\"><s>\[size=(.+?)]</s>~s',
'~<e>\[/size]</e></SIZE><e>~',
),
array(
'[size=' . (round($1 / 100, 1)) . 'em]',
'[/size]',
), $row['body']);

Will handle phpBB sizes < 100% or > 100%, and will fit into SMF's default test for sizes < 1em. Easy.

That won't work. If you want to perform calculations using the matched data like that, you will need to use preg_replace_callback() instead of preg_replace().
Title: Re: Converter shiz: WTF does this mean?
Post by: Antechinus on April 23, 2022, 01:20:55 AM
Meh. Ok. Can mess with that some more. Sounds like an easy fix.

However, there is a trickier one. I have all the default phpBB tags, except one, sorted for correct regex (AFAICT without live testing a conversion). They should all go to the SMF equivalents with any issues (famous last words). Will test it soon, but...

The tricky one is the BBC URL tag. The reason it is tricky is that when converting the phpBB board to SMF you will obviously want to change all internal links from ***/viewtopic.php?*** to ***/index.php?*** but, if some of the links are external links to another phpBB board you obviously do not want to convert those. Those have to stay in default phpBB syntax to remain functional.

So... I will need to write regex that will distinguish between this:
<URL url=\"http://the_board_to_convert.com/viewtopic.php?f=231&amp;t=379&amp;p=44733#p44733\">and this:
<URL url=\"http://the_other_board.com/viewtopic.php?f=31&amp;t=3879&amp;p=40470#p40470\">
IOW: it has to know and remember the existing board's url ($scripturl, in SMF terms) so it can wallop anything starting with that, while not walloping links to the other board(s). Although links to the other board(s) still need some walloping to get their tags into SMF BBC URL syntax. They just don't want the actual URL being changed.

Obviously it would be possible to manually enter the existing board's base url into a regex. Wouldn't bother me doing it that way, but it might be a bit messy for general use. Better if it can just pick it up from somewhere handy, without the user having to frig around editing the SQL file.
Title: Re: Converter shiz: WTF does this mean?
Post by: Sesquipedalian on April 23, 2022, 01:32:35 AM
Here again preg_replace_callback() is your friend. Set up the regular expression to find the url tags, and make sure to capture the URL string itself in a capturing group. Then inside the callback function you can analyze the URL string further in order to choose your replacement dynamically.
Title: Re: Converter shiz: WTF does this mean?
Post by: Antechinus on April 23, 2022, 01:41:16 AM
(https://bookriot.com/wp-content/uploads/2012/09/what-dogs-hear-863x1024.jpg)
Title: Re: Converter shiz: WTF does this mean?
Post by: Antechinus on April 23, 2022, 01:49:47 AM
Oh yeah, here's another pretty obvious thing...

The current conversion script has the BBC conversion arrays in two places: one for converting signatures, and the other for converting posts. That's ok, but it is missing one for converting PM content, so AFAICT it needs a third instance added for that purpose.

That's easy enough (I reckon I could handle that myself ) but it's getting a tad silly, because then you'd have three identical arrays in three different places in the SQL file. That means chasing and editing all three of them if, for example, you want to add in code for custom BBC tags.

It would be saner to write one function for converting all BBC tags in all database tables, then call that function where appropriate. This is actually how it is done in OpenImporter (I know coz I peeked). Same result, less fuss. :)
Title: Re: Converter shiz: WTF does this mean?
Post by: Sesquipedalian on April 23, 2022, 02:43:57 AM
Heh.

This should get you started. But I have no idea how phpBB specifies its equivalent to $boardurl, so you will need to replace $phpBB_boardurl with whatever the actual variable is. If you also want to handle viewforum.php URLs, you can add an elsif (...) block to deal with those before the // Not a URL for this forum, so do nothing to it comment.


$row['body'] = preg_replace_callback(
    '~<URL url=\"(.+?)\">~',
    function ($matches) use ($scripturl, $phpBB_boardurl)
    {
        // Extract the query string from the URL.
        $query = parse_url($matches[1], PHP_URL_QUERY);

        // If the URL points to viewtopic.php on this forum,
        // rewrite it to the corresponding URL for SMF.
        if (strpos($matches[1], $phpBB_boardurl . '/viewtopic.php') === 0)
        {
            // Need to find the topic id.
            if (preg_match('~\bt=(\d+)~', $query, $sub_matches))
            {
                $topic = $sub_matches[1];

                // Next find the specific post, if present.
                if (preg_match('~\bp=(\d+)~', $query, $sub_matches))
                    $msg = $sub_matches[1];
                else
                    $msg = 0;

                // Now find the start, if present.
                if (preg_match('~\bstart=(\d+)~', $query, $sub_matches))
                    $start = $sub_matches[1];
                else
                    $start = 0;

                // Build the new URL.
                // First part is simple.
                $new_url = $scripturl . '?topic=' . $topic;

                // Append the msg bit if we have one.
                if (!empty($msg))
                    $new_url .= '.msg' . $msg . '#msg' . $msg;
                // Otherwise, append the start.
                else
                    $new_url .= '.' . $start;
            }
            // If no topic id was found, we can't do anything.
            else
                $new_url = $matches[1];
        }
        // Not a URL for this forum, so do nothing to it.
        else
        {
            $new_url = $matches[1];
        }

        return '[' . 'url=' . $new_url . ']';
    },
    $row['body']
);

Title: Re: Converter shiz: WTF does this mean?
Post by: Antechinus on April 23, 2022, 03:06:20 AM
Ok thanks. I'll mess around with that. I had an idea about the font size conversion too. The existing script uses this:
$row['body'] = preg_replace(
array(
'~\[size=(.+?)\:(.+?)\]~is',
'~\[/size\:(.+?)?\]~is',
),
array(
'[size=' . convert_percent_to_px("\1") . 'px]',
'[/size]',
), $row['body']);

I assume the "\1" is the equivalent, for this purpose, of the usual $1, and convert_percent_to_px is called in from convert.php:
// Convert percent to pixels. Thanks Elberet.
function convert_percent_to_px($percent)
{
return intval(11*(intval($percent)/100.0));
}

So, offhand I see no reason why another function could not be added to convert.php:
// Convert percent to em, mate.
function convert_percent_to_em_mate($percent)
{
return round($percent / 100, 1);
}

Or something similar (not sure if I have the correct syntax). Then the revamped array just becomes:
$row['body'] = preg_replace(
array(
'~<SIZE size=\\"(.+?)\\"><s>\[size=(.+?)]</s>~s',
'~<e>\[/size]</e></SIZE><e>~',
),
array(
'[size=' . convert_percent_to_em_mate("\1") . 'em]',
'[/size]',
), $row['body']);

Which should work, AFAICT. :)
Title: Re: Converter shiz: WTF does this mean?
Post by: Sesquipedalian on April 23, 2022, 11:42:35 AM
The function you want to add should work, but again that preg_replace() needs to be a preg_replace_callback(). If what you posted above is what currently exists in the converter script, then the converter script is wrong and won't work as intended.
Title: Re: Converter shiz: WTF does this mean?
Post by: Antechinus on April 23, 2022, 04:36:06 PM
Lol. That does not surprise me at all, given we already know the converter script is wrong* and does not work as intended. :D

*Well, wrong for phpBB 3.2.x and 3.3.x anyway.
Title: Re: Converter shiz: WTF does this mean?
Post by: Sesquipedalian on April 23, 2022, 06:42:09 PM
And also apparently wrong in how it thinks preg_replace() works.

Once upon a time, preg_replace() suppprted an 'e' modifier that would treat the replacement as a string of PHP code to eval() in order to generate the final replacement string. I suspect that what happened here is that when the 'e' modifier became deprecated, someone tried to update this converter script to deal with that, but they botched the job.
Title: Re: Converter shiz: WTF does this mean?
Post by: tinoest on April 24, 2022, 02:18:10 PM
Quote from: Sesquipedalian on April 23, 2022, 11:42:35 AMThe function you want to add should work, but again that preg_replace() needs to be a preg_replace_callback(). If what you posted above is what currently exists in the converter script, then the converter script is wrong and won't work as intended.

Why should it be the preg_replace_callback? I don't see any reason that preg_replace wouldn't work in that example. Unless I'm missing the obvious. It accepts an array as the regex, and one for replacement then will do it on the subject.
Title: Re: Converter shiz: WTF does this mean?
Post by: Arantor on April 24, 2022, 02:32:02 PM
Because the replacement requires a computation to be done on the text to be replaced.

There's a rounding + division by 100 in there, which means you're operating on the replacement before replacing it into the result, which can't be put into the result using preg_replace.

At least, if you're converting 120% to 1.2em or similar - you could possibly refactor this into multiple matches but then you need to accommodate where you have 2 digits percentage (e.g. size=80 -> [size=0.8em]) as well as 3 digits (e.g. size=120 -> [size=1.2em])
Title: Re: Converter shiz: WTF does this mean?
Post by: tinoest on April 24, 2022, 02:45:33 PM
Quote from: Arantor on April 24, 2022, 02:32:02 PMBecause the replacement requires a computation to be done on the text to be replaced.

There's a rounding + division by 100 in there, which means you're operating on the replacement before replacing it into the result, which can't be put into the result using preg_replace.

At least, if you're converting 120% to 1.2em or similar - you could possibly refactor this into multiple matches but then you need to accommodate where you have 2 digits percentage (e.g. size=80 -> [size=0.8em]) as well as 3 digits (e.g. size=120 -> [size=1.2em])

Ahh I missed the function call on the replace. Thanks for highlighting that.
Title: Re: Converter shiz: WTF does this mean?
Post by: Antechinus on April 24, 2022, 06:32:12 PM
Ok, here's some fun and games. The internal links on the forum to be converted need to go from phpBB f=**;t=***;p****#p**** format to SMF c=**;board=***;topic=****;msg=***** format. There are several variations that need to be dealt with. For added fun and games, phpBB uses the same format for categories and boards: both are coded as f=** in the database. Yay!

So, really it needs some way of distinguishing which are phpBB categories and which are phpBB boards. IOW, somehow the ID's for categories have to be entered into an array first, so those links can be converted as category links, then a second regex deals with the remainder as boards. Whoopee. :P

And, it needs a way of entering the existing forum's path, so that can be converted to SMF's *****/index.php?etc... from phpBB's *****/index.php?etc... and *****/viewtopic.php?etc... and whatever else (hence the humungous capture groups in some of these expression - kill anything not required) without touching any links to other phpBB boards.

So https:// existing \.com/forums needs to be stashed somewhere once, so it can be called where needed.

And TBH I think this is the simplest way of dealing with it, or at least the simplest that I can actually understand. Sesqui's code earlier did my head in (see cartoon about dogs). :D

$row['body'] = preg_replace_callback(
array(
/*--- Internal links: to be converted to SMF c/board/topic/msg format. ---*/
/*--- Internal links to categories (Need to know ID's -phpBB quirk!). ---*/
'~<URL (.+?)<s>\[url=https://existing\.com/forums/(.+?)f=(\d+?)]</s>~s',
/*--- Internal links to boards (Need to know ID's -phpBB quirk!). ---*/
'~<URL (.+?)<s>\[url=https://existing\.com/forums/(.+?)f=(\d+?)]</s>~s',
/*--- Internal links to topics. ---*/
'~<URL (.+?)<s>\[url=https://existing\.com/forums/(.+?)t=(\d+?)]</s>~s',
/*--- Internal links to posts. ---*/
'~<URL (.+?)<s>\[url=https://existing\.com/forums/(.+?)#p(\d+?)]</s>~s',
/*--- Internal links to members. ---*/
'~<URL (.+?)<s>\[url=https://existing\.com/forums/(.+?);u=(\d+?)]</s>~s',
/*--- End of internal links. ---*/
/*--- External links: automatically truncated url as linked text. ---*/
'~<URL url=\\"(.+?)\\"(.+?)LINK_TEXT text=\\"(.+?)\\"(.+?)</URL>~s',
/*--- External links: with linked text not truncated. ---*/
'~<URL url=\\"(.+?)\\"><s>(.+?)</s>(.+?)<e>\[/url]</e></URL>~s',
/*--- External links: legacy format. ---*/
'~<URL url=\\"(.+?)\\">(.+?)</URL>~s',
/*--- Stray url end tags. ---*/
'~<e>\[/url]</e></URL>~s',
),
array(
/*--- Internal links: to be converted to SMF c/board/topic/msg format. ---*/
/*--- Internal links to categories (Need to know ID's -phpBB quirk!). ---*/
'[url=https://existing.com/forums/index.php?c=$3]',
/*--- Internal links to boards (Need to know ID's -phpBB quirk!). ---*/
'[url=https://existing.com/forums/index.php?board=$3]',
/*--- Internal links to topics. ---*/
'[url=https://existing.com/forums/index.php?topic=$3]',
/*--- Internal links to posts. ---*/
'[url=https://existing.com/forums/index.php?msg=$3]',
/*--- Internal links to members. ---*/
'[url=https://existing.com/forums/index.php?action=profile;u=$3]',
/*--- End of internal links. ---*/
/*--- External links: automatically truncated url as linked text. ---*/
'[url=$1]$3[/url]',
/*--- External links: with linked text not truncated. ---*/
'[url=$1]$3[/url]',
/*--- External links: legacy format. ---*/
'[url=$1]$2[/url]',
/*--- Stray url end tags. ---*/
'[/url]',
), $row['body']);
Title: Re: Converter shiz: WTF does this mean?
Post by: Antechinus on April 24, 2022, 07:42:51 PM
I can probably figure out the above by myself (it's pretty simple, and I can check the PHP docs). But, there's another potential issue.

Converting the phpBB BBC for inline attachments is straightforward in terms of actual tag syntax. The catch is that SMF 2.1 (which is where any sane person will be ending up) does the inline attachments numbered relative to the existing number of images in the attachments directory:
[attach id=58]banner_audi.jpg[/attach]

[attach id=60]banner_aurora.jpg[/attach]

[attach id=62]banner_creek.jpg[/attach]

[attach id=64]banner_fjord.jpg[/attach]

[attach id=66]banner_fluffy.jpg[/attach]

[attach id=68]banner_gnarly.jpg[/attach]
With the numbering for inline attachments jumping 2 at a time, due to a corresponding thumbnail automatically being stashed for each image.

On the other hand, phpBB has the inline attachment tags numbered relative to the parent post:
<ATTACHMENT filename=\"banner_6.jpg\" index=\"0\"><s>[attachment=0]</s>banner_6.jpg<e>[/attachment]</e></ATTACHMENT>
<ATTACHMENT filename=\"banner_5.jpg\" index=\"1\"><s>[attachment=1]</s>banner_5.jpg<e>[/attachment]</e></ATTACHMENT>
<ATTACHMENT filename=\"banner_4.jpg\" index=\"2\"><s>[attachment=2]</s>banner_4.jpg<e>[/attachment]</e></ATTACHMENT>
<ATTACHMENT filename=\"banner_3.jpg\" index=\"3\"><s>[attachment=3]</s>banner_3.jpg<e>[/attachment]</e></ATTACHMENT>
<ATTACHMENT filename=\"banner_2.jpg\" index=\"4\"><s>[attachment=4]</s>banner_2.jpg<e>[/attachment]</e></ATTACHMENT>
<ATTACHMENT filename=\"banner_1.jpg\" index=\"5\"><s>[attachment=5]</s>banner_1.jpg<e>[/attachment]</e></ATTACHMENT>

This means the two tags systems are fundamentally incompatible. To make a conversion work, it would be necessary to change the attachment id's to match up with what SMF expects. I don't doubt that it is possible (almost anything is) but it doesn't appear to be simple.

Hey ho. :P
Title: Re: Converter shiz: WTF does this mean?
Post by: Diego Andrés on April 24, 2022, 08:12:20 PM
That's odd, would it not use the attachment id somewhere, just like smf does?
Title: Re: Converter shiz: WTF does this mean?
Post by: Antechinus on April 24, 2022, 08:31:31 PM
Quote from: Diego Andrés on April 24, 2022, 08:12:20 PMThat's odd...
It's phpBB. :D

Quote...would it not use the attachment id somewhere, just like smf does?
It does, somewhere, but not in the actual BBC. Not in the database dump for the post either, AFAICT. It just has an integer for the number of attachments in that post. I assume assigning attachments to posts must be done via post ID in the attachments table.

ETA: Table structure for attachments.
CREATE TABLE `phpbb_attachments` (
  `attach_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `post_msg_id` int(10) unsigned NOT NULL DEFAULT '0',
  `topic_id` int(10) unsigned NOT NULL DEFAULT '0',
  `in_message` tinyint(1) unsigned NOT NULL DEFAULT '0',
  `poster_id` int(10) unsigned NOT NULL DEFAULT '0',
  `is_orphan` tinyint(1) unsigned NOT NULL DEFAULT '1',
  `physical_filename` varchar(255) COLLATE utf8_bin NOT NULL DEFAULT '',
  `real_filename` varchar(255) COLLATE utf8_bin NOT NULL DEFAULT '',
  `download_count` mediumint(8) unsigned NOT NULL DEFAULT '0',
  `attach_comment` text COLLATE utf8_bin NOT NULL,
  `extension` varchar(100) COLLATE utf8_bin NOT NULL DEFAULT '',
  `mimetype` varchar(100) COLLATE utf8_bin NOT NULL DEFAULT '',
  `filesize` int(20) unsigned NOT NULL DEFAULT '0',
  `filetime` int(11) unsigned NOT NULL DEFAULT '0',
  `thumbnail` tinyint(1) unsigned NOT NULL DEFAULT '0',
  PRIMARY KEY (`attach_id`),
  KEY `filetime` (`filetime`),
  KEY `post_msg_id` (`post_msg_id`),
  KEY `topic_id` (`topic_id`),
  KEY `poster_id` (`poster_id`),
  KEY `is_orphan` (`is_orphan`)
) ENGINE=MyISAM AUTO_INCREMENT=6530 DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

My guess is this determines if it's an inline attachment, and what order it is in:
`in_message` tinyint(1) unsigned NOT NULL DEFAULT '0',
Although that is still odd, because the first inline attachment in any post is always indexed as 0. Like this:
<ATTACHMENT filename=\"banner_6.jpg\" index=\"0\"><s>[attachment=0]</s>banner_6.jpg<e>[/attachment]</e></ATTACHMENT>
ETA: Nope, that's not it either. Got me stumped. Looks like it just indexes them relative to the post, in the reverse of the order they are uploaded, starting at zero.
Attachments table:
(6461,48130,4261,0,1358,0,'1358_c4553767e8b0ee9078d6e4e3c39a6255','banner_1.jpg',3556,'','jpg','image/jpeg',105555,1614122796,1),
(6462,48130,4261,0,1358,0,'1358_c45dc4e3c29b8870a95186ab6f1d85bb','banner_2.jpg',3556,'','jpg','image/jpeg',69510,1614122801,1),
(6463,48130,4261,0,1358,0,'1358_37e8f4dd7f291b23a84796487bd966e8','banner_3.jpg',3556,'','jpg','image/jpeg',63468,1614122806,1),
(6464,48130,4261,0,1358,0,'1358_1b767931acc6cd0ff356978e4b93f6c6','banner_4.jpg',3556,'','jpg','image/jpeg',108579,1614122812,1),
(6465,48130,4261,0,1358,0,'1358_f8f7cc790ce69a57188d97e2fbbca71d','banner_5.jpg',3556,'','jpg','image/jpeg',122395,1614122817,1),
(6466,48130,4261,0,1358,0,'1358_7ff630b44c160801967fbcbb52ffcf27','banner_6.jpg',3556,'','jpg','image/jpeg',74076,1614122821,1),

Posts table - BBC tags:
<ATTACHMENT filename=\"banner_6.jpg\" index=\"0\"><s>[attachment=0]</s>banner_6.jpg<e>[/attachment]</e></ATTACHMENT>
<ATTACHMENT filename=\"banner_5.jpg\" index=\"1\"><s>[attachment=1]</s>banner_5.jpg<e>[/attachment]</e></ATTACHMENT>
<ATTACHMENT filename=\"banner_4.jpg\" index=\"2\"><s>[attachment=2]</s>banner_4.jpg<e>[/attachment]</e></ATTACHMENT>
<ATTACHMENT filename=\"banner_3.jpg\" index=\"3\"><s>[attachment=3]</s>banner_3.jpg<e>[/attachment]</e></ATTACHMENT>
<ATTACHMENT filename=\"banner_2.jpg\" index=\"4\"><s>[attachment=4]</s>banner_2.jpg<e>[/attachment]</e></ATTACHMENT>
<ATTACHMENT filename=\"banner_1.jpg\" index=\"5\"><s>[attachment=5]</s>banner_1.jpg<e>[/attachment]</e></ATTACHMENT>

So pretty much impossible to translate to SMF 2.1 format, AFAICT at the moment.
Title: Re: Converter shiz: WTF does this mean?
Post by: Tyrsson on April 24, 2022, 08:53:13 PM
Is there a lookup table that relates all of the attachments to each post maybe? Its just a wild guess.
Title: Re: Converter shiz: WTF does this mean?
Post by: Antechinus on April 24, 2022, 08:56:33 PM
That's done in the attachments table, AFAICT:
CREATE TABLE `phpbb_attachments` (
  `attach_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `post_msg_id` int(10) unsigned NOT NULL DEFAULT '0',
  `topic_id` int(10) unsigned NOT NULL DEFAULT '0',
    blah blah blah...

Like I said, it just bungs them into post_msg_id in the reverse of the order in which they were uploaded. Why reverse order? No idea, but that's what the BBC tags for that post are saying (indexing relative to post is arse about to attachment ID numbers).

ETA: Easy solution would be to just delete all inline attachment BBC on conversion. Not ideal, but should be workable if attachments themselves convert.
$row['body'] = preg_replace_callback(
array(
/*--- Note: Will need something to convert inline attachments ---*/
/*--- May be impossible, due to incompatibility in numbering. ---*/
/*--- Worst case scenario: delete all inline attachments BBC. ---*/
'~<ATTACHMENT(.+?)</ATTACHMENT>~s',
),
array(
'',
), $row['body']);
Title: Re: Converter shiz: WTF does this mean?
Post by: Tyrsson on April 24, 2022, 09:05:14 PM
LIFO, Last in, first out.
Title: Re: Converter shiz: WTF does this mean?
Post by: Antechinus on April 24, 2022, 09:13:49 PM
SNAFU :D
Title: Re: Converter shiz: WTF does this mean?
Post by: Tyrsson on April 24, 2022, 09:29:32 PM
Quote from: Antechinus on April 24, 2022, 09:13:49 PMSNAFU :D
Lol, yea. I was jokin, cause I really do not see how that would simplify anything there other than, that is just "how" they do it, so its expected to be that way. Im sure there is a reason, I just dont know it roflmao.
Title: Re: Converter shiz: WTF does this mean?
Post by: Sesquipedalian on April 24, 2022, 09:42:08 PM
Once again you will need preg_replace_callback(). This time, however, you will need to perform a database query inside the callback function in order to find the correct attachment ID number.

Since the string itself doesn't include the post ID, you'll need to pass that to your callback function via the use keyword. If you look at https://www.php.net/manual/en/function.preg-replace-callback.php, the first user comment shows you exactly how to do this.
Title: Re: Converter shiz: WTF does this mean?
Post by: Antechinus on April 24, 2022, 09:51:08 PM
Ooooooo kkkkkkkkk. So, in essence...

1/ Get the post ID
2/ Use that to query the attachments table
3/ Get the array of attachments that match that post ID.
4/ Reverse sort them by attachment ID number
5/ That should, if plugged back into the posts table, give you the right attachments in the right order.
6/ BBC tags can then have numbers assigned by attachment ID number, a la SMF 2.1.

I get the general idea, I think.
Title: Re: Converter shiz: WTF does this mean?
Post by: Antechinus on April 24, 2022, 10:06:42 PM
Ok, simple stuff (because I can just tell the attachments are going to be "fun")...

I want to call in the existing forum directory url as a variable. I assume it can be done like this:

/* NOTE: Escape any full stops, etc in the url (as \. ). */
$old_forum = 'https://existing\.com/forums';
/* NOTE: Do NOT escape them in this one! */
$new_forum = 'https://existing.com/forums';
/* This does the major work first. */
$row['body'] = preg_replace_callback(
array(
/*--- Internal links: to be converted to SMF c/board/topic/msg format. ---*/
/*--- Internal links to categories (Need to know ID's -phpBB quirk!). ---*/
'~<URL (.+?)<s>\[url=' . $existing_forum . '/(.+?)f=(\d+?)]</s>~s',
/*--- Internal links to boards (Need to know ID's -phpBB quirk!). ---*/
'~<URL (.+?)<s>\[url=' . $old_forum . '/(.+?)f=(\d+?)]</s>~s',
/*--- Internal links to topics. ---*/
'~<URL (.+?)<s>\[url=' . $old_forum . '/(.+?)t=(\d+?)]</s>~s',
/*--- Internal links to posts. ---*/
'~<URL (.+?)<s>\[url=' . $old_forum . '/(.+?)#p(\d+?)]</s>~s',
/*--- Internal links to members. ---*/
'~<URL (.+?)<s>\[url=' . $existing_forum . '/(.+?);u=(\d+?)]</s>~s',
/*--- End of internal links. ---*/
),
array(
/*--- Internal links: to be converted to SMF c/board/topic/msg format. ---*/
/*--- Internal links to categories (Need to know ID's -phpBB quirk!). ---*/
'[url=' . $new_forum . '/index.php?c=$3]',
/*--- Internal links to boards (Need to know ID's -phpBB quirk!). ---*/
'[url=' . $new_forum . '/index.php?board=$3]',
/*--- Internal links to topics. ---*/
'[url=' . $new_forum . '/index.php?topic=$3]',
/*--- Internal links to posts. ---*/
'[url=' . $new_forum . '/index.php?msg=$3]',
/*--- Internal links to members. ---*/
'[url=' . $new_forum . '/index.php?action=profile;u=$3]',
/*--- End of internal links. ---*/
), $row['body']);
Title: Re: Converter shiz: WTF does this mean?
Post by: Sesquipedalian on April 25, 2022, 12:14:57 AM
Er, no. When using preg_replace_callback(), the second parameter needs to be a function, not just an array of strings. Look again at the example I gave you here (https://www.simplemachines.org/community/index.php?msg=4121653), and at the manual page (https://www.php.net/manual/en/function.preg-replace-callback.php) for preg_replace_callback().
Title: Re: Converter shiz: WTF does this mean?
Post by: Antechinus on April 25, 2022, 12:53:05 AM
I did. They don't make a damned bit of sense to me (see cartoon about dogs). :D

Ok, the only reason I went to pre_replace_callback was because of the font-size conversion. TBH, if it's going to mean writing functions everywhere instead of being able to drop in a simple variable, I'd be inclined to just strip the font-size tags, or default them all to 1em. IMO they're not critical content anyway. 99% of people won't care if old posts are all standard font-size.

I'm just wanting this thing usable, not perfect.
Title: Re: Converter shiz: WTF does this mean?
Post by: live627 on April 25, 2022, 04:54:11 AM
Quote from: Antechinus on April 24, 2022, 09:51:08 PMOoooooo kkkkkkkkk. So, in essence...

1/ Get the post ID
2/ Use that to query the attachments table
3/ Get the array of attachments that match that post ID.
4/ Reverse sort them by attachment ID number
5/ That should, if plugged back into the posts table, give you the right attachments in the right order.
6/ BBC tags can then have numbers assigned by attachment ID number, a la SMF 2.1.

I get the general idea, I think.
This is the same idea employed by the old ILA mod by @Spuds (and I think is in his forum platform)
Title: Re: Converter shiz: WTF does this mean?
Post by: Antechinus on April 25, 2022, 06:47:34 PM
I'm sure it's very cool, but frankly I have no idea how to write such code, and at the moment I don't really want to have to spend a lot of time learning how to write such code. I'm after a quick and effective compromise, mainly based on code I can realistically write myself (with a few simple pointers from others).

At the moment I'm thinking that as long as attachments per se convert without issues, it may be best to deal with the inline attachment tags by just converting them to display the actual file name. That way editing a post to get the right attachments back in the right places will be easy. Editing would only be done on posts where people think it matters, with the rest being ignored, but if you have the relevant file name where the inline attachment used to be you can still follow what was intended by referring to the standard (not inline) attachments beneath a post.

So offhand, convert this:
<ATTACHMENT filename=\"banner_6.jpg\" index=\"0\"><s>[attachment=0]</s>banner_6.jpg<e>[/attachment]</e></ATTACHMENT>
To this:
[attach]banner_6.jpg[/attach]
Which is fairly tidy, and IMO is probably good enough.
Title: Re: Converter shiz: WTF does this mean?
Post by: Sesquipedalian on April 25, 2022, 07:59:09 PM
Quote from: Antechinus on April 25, 2022, 12:53:05 AMI did. They don't make a damned bit of sense to me (see cartoon about dogs). :D

:) Okay then. So, for the URL replacements you posted here (https://www.simplemachines.org/community/index.php?msg=4121981), just stick with plain old preg_replace() rather than preg_replace_callback(). You can do that because you are simply replacing one static string that you already know with another that you already know.

Quote from: Antechinus on April 25, 2022, 06:47:34 PMAt the moment I'm thinking that as long as attachments per se convert without issues, it may be best to deal with the inline attachment tags by just converting them to display the actual file name. [snip...]

Well, you could do that, but you'll just end up with literal "[attach]banner_6.jpg[/attach]" strings being displayed in the posts, because SMF will skip any [attach] BBCode that doesn't have an id attribute. So if that's the plan, you might as well remove the BBCode tags entirely.

But don't give up on a proper conversion yet. ;)

Have the attachments already been converted before the post text is converted?
Title: Re: Converter shiz: WTF does this mean?
Post by: Antechinus on April 25, 2022, 08:09:38 PM
Quote from: Sesquipedalian on April 25, 2022, 07:59:09 PMWell, you could do that, but you'll just end up with literal "[attach]banner_6.jpg[/attach]" strings being displayed in the posts...
I know. :) My thinking was...
QuoteThat way editing a post to get the right attachments back in the right places will be easy. Editing would only be done on posts where people think it matters, with the rest being ignored, but if you have the relevant file name where the inline attachment used to be you can still follow what was intended by referring to the standard (not inline) attachments beneath a post.
IOW, I think it is better than deleting the tags entirely (which was my first thought).


QuoteBut don't give up on a proper conversion yet. ;)

Have the attachments already been converted before the post text is converted?
Not in the existing script. It does attachments last. However, this may just be the order someone threw things together. I don't know if attachments have to be done last, or if they can be done earlier. Just looking at what the script contains, my gut is telling me it could be done earlier.
/******************************************************************************/
--- Converting attachments...
/******************************************************************************/

---* {$to_prefix}attachments
---{
$no_add = true;

if (!isset($oldAttachmentDir))
{
$result = convert_query("
SELECT config_value
FROM {$from_prefix}config
WHERE config_name = 'upload_path'
LIMIT 1");
list ($oldAttachmentDir) = convert_fetch_row($result);
convert_free_result($result);

if (empty($oldAttachmentDir) || !file_exists($_POST['path_from'] . '/' . $oldAttachmentDir))
$oldAttachmentDir = $_POST['path_from'] . '/file';
else
$oldAttachmentDir = $_POST['path_from'] . '/' . $oldAttachmentDir;
}

/* Get $id_attach. */
if (empty($id_attach))
{
$result = convert_query("
SELECT MAX(id_attach) + 1
FROM {$to_prefix}attachments");
list ($id_attach) = convert_fetch_row($result);
convert_free_result($result);

$id_attach = empty($id_attach) ? 1 : $id_attach;
}

/* Set the default empty values. */
$width = 0;
$height = 0;

/* Is it an image? */
$attachmentExtension = strtolower(substr(strrchr($row['filename'], '.'), 1));
if (!in_array($attachmentExtension, array('jpg', 'jpeg', 'gif', 'png', 'bmp')))
$attachmentExtension = '';

$file_hash = getAttachmentFilename($row['filename'], $id_attach, null, true);
$physical_filename = $id_attach . '_' . $file_hash;

if (strlen($physical_filename) > 255)
return;
if (copy($oldAttachmentDir . '/' . $row['physical_filename'], $attachmentUploadDir . '/' . $physical_filename))
{
/* Is it an image? */
if (!empty($attachmentExtension))
{
list ($width, $height) = getimagesize($attachmentUploadDir . '/' . $physical_filename);
/* This shouldn't happen but apparently it might. */
if(empty($width))
$width = 0;
if(empty($height))
$height = 0;
}
$rows[] = array(
'id_attach' => $id_attach,
'size' => filesize($attachmentUploadDir . '/' . $physical_filename),
'filename' => $row['filename'],
'file_hash' => $file_hash,
'id_msg' => $row['id_msg'],
'downloads' => $row['downloads'],
'width' => $width,
'height' => $height,
);
$id_attach++;
}
---}
SELECT
post_msg_id AS id_msg, download_count AS downloads,
real_filename AS filename, physical_filename, filesize AS size
FROM {$from_prefix}attachments;
---*

AFAICT any variables called in that lot should keep the same values on conversion, so no drama.

Although it would be necessary to create the required db columns for inline attachments if you want a full conversion. This script currently only converts to 2.0.x, so no inline attachments stuff in the db by default. It'd need a custom query to add a couple of 2.1 bits (easy enough).
Title: Re: Converter shiz: WTF does this mean?
Post by: Tyrsson on April 25, 2022, 08:32:19 PM
@Antechinus To simplify it. The use keyword just makes the variables you pass to it usable in the functions scope (inside the function, or closure, anonymous function yada yada).

In the reference on php.net this is the callback in the example:
function ($matches) {
            return strtolower($matches[0]);
        }
Its been awhile since I looked but I think a closure can only except two arguments, $matches, would be arg1 and you could pass another ($matches, $arg2), after that you need to use the "use":
function($matches, $arg2) use($other, $variables, $here): string {
   // if you include the return type ): string
   // you must insure that the function returns a string
   // or php is gonna throw a fit ;)
}
Title: Re: Converter shiz: WTF does this mean?
Post by: Arantor on April 26, 2022, 03:29:00 AM
preg_replace_callback only gets one argument supplied to it, the $matches array.

Closures in general are normal functions and can accept any number of arguments, including variadics, while the use clause is to import things from current scope into the closure.
Title: Re: Converter shiz: WTF does this mean?
Post by: Tyrsson on April 26, 2022, 01:52:36 PM
Quote from: Arantor on April 26, 2022, 03:29:00 AMpreg_replace_callback only gets one argument supplied to it, the $matches array.

Closures in general are normal functions and can accept any number of arguments, including variadics, while the use clause is to import things from current scope into the closure.
Thanks for the clarification @Arantor, not sure why I had it stuck in my head that they could only accept two arguments...

The variadics have been added since I last did any coding, which tells ya how long that has been lmao... Way back round php 5.2 - 5.3
Title: Re: Converter shiz: WTF does this mean?
Post by: Antechinus on April 26, 2022, 06:59:20 PM
Did some more thinking about font sizes. Really, trying to convert them to a format that SMF understands by default is a bit stupid, particularly given that any members using the converted forum will be used to using phpBB syntax for font sizes.

So, on that basis, the sanest option is to install Doug's mod: phpBB-style Font Size BBCode (https://custom.simplemachines.org/index.php?mod=3714). Then the conversion just becomes:
$row['body'] = preg_replace(
    array(
        /*--- Font size: install Doug's mod - https://custom.simplemachines.org/index.php?mod=3714 ;) ---*/
        '~<SIZE size=\\"(.+?)\\"><s>\[size=(.+?)]</s>~s',
        '~<e>\[/size]</e></SIZE><e>~s',
    ),
    array(
        /*--- Font size: install Doug's mod - https://custom.simplemachines.org/index.php?mod=3714 ;) ---*/
        '[size=$1]',
        '[/size]',
    ), $row['body']);

Which means anyone can use the normal range of font size definitions that SMF allows by default, or use the phpBB version (eg: size=150) or use standard CSS ( eg: size=150%). I think this is an instance where requiring the use of a small mod after conversion really makes the most sense. I'd be inclined to include this advice in any instructions for the script.

Also, regarding conversion of category links vs board links: it really needs something like this:
$row['body'] = preg_replace_callback(
array(
/*--- Internal links: to be converted to SMF c/board/topic/msg format. ---*/
/*--- Internal links to categories. Need to know ID's (phpBB quirk!): ID's are f=(1|2|3|4|5|6). ---*/
'~<URL (.+?)<s>\[url=https://www.old_forum\.com/(.+?)f=(1|2|3|4|5|6)]</s>~s',
/*--- Internal links to boards. ---*/
'~<URL (.+?)<s>\[url=https://www.old_forum\.com/(.+?)f=(\d+?)]</s>~s',
),
array(
/*--- Internal links: to be converted to SMF c/board/topic/msg format. ---*/
/*--- Internal links to categories. ---*/
'[url=https://www.new_forum.com/index.php?c=$3]',
/*--- Internal links to boards. ---*/
'[url=https://www.new_forum.com/index.php?board=$3]',
), $row['body']);

That will deal with any BBC links directly to categories, before any BBC links directly to boards are dealt with. This is necessary due to phpBB using the same syntax (f=****) for category links and board links (they are distinguished in the back end, by forum type).