News:

Bored?  Looking to kill some time?  Want to chat with other SMF users?  Join us in IRC chat or Discord

Main Menu

Converter shiz: WTF does this mean?

Started by Antechinus, April 17, 2022, 03:32:49 AM

Previous topic - Next topic

Antechinus

#40
This seems to be the most sensible way of dealing with font-size:

$row['body'] = preg_replace(
array(
'~<SIZE size=\\"(.+?)\\"><s>\[size=(.+?)]</s>~s',
'~<e>\[/size]</e></SIZE><e>~',
),
array(
'[size=' . (round($1 / 100, 1)) . 'em]',
'[/size]',
), $row['body']);

Will handle phpBB sizes < 100% or > 100%, and will fit into SMF's default test for sizes < 1em. Easy.

ETA: Figured out what the dodgy <t > </t > and <r > </r > tags are in the database. The OP in any topic has its post body content wrapped in the "t" tags, for "topic". Any replies to that topic get the "r" tags, for "reply". They're still no use to SMF though. Just more crud to get rid of when converting.

Sesquipedalian

Quote from: Antechinus on April 22, 2022, 11:15:27 PMThis seems to be the most sensible way of dealing with font-size:

$row['body'] = preg_replace(
array(
'~<SIZE size=\\"(.+?)\\"><s>\[size=(.+?)]</s>~s',
'~<e>\[/size]</e></SIZE><e>~',
),
array(
'[size=' . (round($1 / 100, 1)) . 'em]',
'[/size]',
), $row['body']);

Will handle phpBB sizes < 100% or > 100%, and will fit into SMF's default test for sizes < 1em. Easy.

That won't work. If you want to perform calculations using the matched data like that, you will need to use preg_replace_callback() instead of preg_replace().
I promise you nothing.

Sesqu... Sesqui... what?
Sesquipedalian, the best word in the English language.

Antechinus

Meh. Ok. Can mess with that some more. Sounds like an easy fix.

However, there is a trickier one. I have all the default phpBB tags, except one, sorted for correct regex (AFAICT without live testing a conversion). They should all go to the SMF equivalents with any issues (famous last words). Will test it soon, but...

The tricky one is the BBC URL tag. The reason it is tricky is that when converting the phpBB board to SMF you will obviously want to change all internal links from ***/viewtopic.php?*** to ***/index.php?*** but, if some of the links are external links to another phpBB board you obviously do not want to convert those. Those have to stay in default phpBB syntax to remain functional.

So... I will need to write regex that will distinguish between this:
<URL url=\"http://the_board_to_convert.com/viewtopic.php?f=231&amp;t=379&amp;p=44733#p44733\">and this:
<URL url=\"http://the_other_board.com/viewtopic.php?f=31&amp;t=3879&amp;p=40470#p40470\">
IOW: it has to know and remember the existing board's url ($scripturl, in SMF terms) so it can wallop anything starting with that, while not walloping links to the other board(s). Although links to the other board(s) still need some walloping to get their tags into SMF BBC URL syntax. They just don't want the actual URL being changed.

Obviously it would be possible to manually enter the existing board's base url into a regex. Wouldn't bother me doing it that way, but it might be a bit messy for general use. Better if it can just pick it up from somewhere handy, without the user having to frig around editing the SQL file.

Sesquipedalian

Here again preg_replace_callback() is your friend. Set up the regular expression to find the url tags, and make sure to capture the URL string itself in a capturing group. Then inside the callback function you can analyze the URL string further in order to choose your replacement dynamically.
I promise you nothing.

Sesqu... Sesqui... what?
Sesquipedalian, the best word in the English language.


Antechinus

Oh yeah, here's another pretty obvious thing...

The current conversion script has the BBC conversion arrays in two places: one for converting signatures, and the other for converting posts. That's ok, but it is missing one for converting PM content, so AFAICT it needs a third instance added for that purpose.

That's easy enough (I reckon I could handle that myself ) but it's getting a tad silly, because then you'd have three identical arrays in three different places in the SQL file. That means chasing and editing all three of them if, for example, you want to add in code for custom BBC tags.

It would be saner to write one function for converting all BBC tags in all database tables, then call that function where appropriate. This is actually how it is done in OpenImporter (I know coz I peeked). Same result, less fuss. :)

Sesquipedalian

#46
Heh.

This should get you started. But I have no idea how phpBB specifies its equivalent to $boardurl, so you will need to replace $phpBB_boardurl with whatever the actual variable is. If you also want to handle viewforum.php URLs, you can add an elsif (...) block to deal with those before the // Not a URL for this forum, so do nothing to it comment.


$row
['body'] = preg_replace_callback(
    
'~<URL url=\"(.+?)\">~',
    function (
$matches) use ($scripturl$phpBB_boardurl)
    {
        
// Extract the query string from the URL.
        
$query parse_url($matches[1], PHP_URL_QUERY);

        
// If the URL points to viewtopic.php on this forum, 
        // rewrite it to the corresponding URL for SMF.
        
if (strpos($matches[1], $phpBB_boardurl '/viewtopic.php') === 0)
        {
            
// Need to find the topic id.
            
if (preg_match('~\bt=(\d+)~'$query$sub_matches))
            {
                
$topic $sub_matches[1];

                
// Next find the specific post, if present.
                
if (preg_match('~\bp=(\d+)~'$query$sub_matches))
                    
$msg $sub_matches[1];
                else
                    
$msg 0;

                
// Now find the start, if present.
                
if (preg_match('~\bstart=(\d+)~'$query$sub_matches))
                    
$start $sub_matches[1];
                else
                    
$start 0;

                
// Build the new URL.
                // First part is simple.
                
$new_url $scripturl '?topic=' $topic;

                
// Append the msg bit if we have one.
                
if (!empty($msg))
                    
$new_url .= '.msg' $msg '#msg' $msg;
                
// Otherwise, append the start.
                
else
                    
$new_url .= '.' $start;
            }
            
// If no topic id was found, we can't do anything.
            
else
                
$new_url $matches[1];
        }
        
// Not a URL for this forum, so do nothing to it.
        
else
        {
            
$new_url $matches[1];
        }

        return 
'[' 'url=' $new_url ']';
    }, 
    
$row['body']
);

I promise you nothing.

Sesqu... Sesqui... what?
Sesquipedalian, the best word in the English language.

Antechinus

#47
Ok thanks. I'll mess around with that. I had an idea about the font size conversion too. The existing script uses this:
$row['body'] = preg_replace(
array(
'~\[size=(.+?)\:(.+?)\]~is',
'~\[/size\:(.+?)?\]~is',
),
array(
'[size=' . convert_percent_to_px("\1") . 'px]',
'[/size]',
), $row['body']);

I assume the "\1" is the equivalent, for this purpose, of the usual $1, and convert_percent_to_px is called in from convert.php:
// Convert percent to pixels. Thanks Elberet.
function convert_percent_to_px($percent)
{
return intval(11*(intval($percent)/100.0));
}

So, offhand I see no reason why another function could not be added to convert.php:
// Convert percent to em, mate.
function convert_percent_to_em_mate($percent)
{
return round($percent / 100, 1);
}

Or something similar (not sure if I have the correct syntax). Then the revamped array just becomes:
$row['body'] = preg_replace(
array(
'~<SIZE size=\\"(.+?)\\"><s>\[size=(.+?)]</s>~s',
'~<e>\[/size]</e></SIZE><e>~',
),
array(
'[size=' . convert_percent_to_em_mate("\1") . 'em]',
'[/size]',
), $row['body']);

Which should work, AFAICT. :)

Sesquipedalian

The function you want to add should work, but again that preg_replace() needs to be a preg_replace_callback(). If what you posted above is what currently exists in the converter script, then the converter script is wrong and won't work as intended.
I promise you nothing.

Sesqu... Sesqui... what?
Sesquipedalian, the best word in the English language.

Antechinus

Lol. That does not surprise me at all, given we already know the converter script is wrong* and does not work as intended. :D

*Well, wrong for phpBB 3.2.x and 3.3.x anyway.

Sesquipedalian

And also apparently wrong in how it thinks preg_replace() works.

Once upon a time, preg_replace() suppprted an 'e' modifier that would treat the replacement as a string of PHP code to eval() in order to generate the final replacement string. I suspect that what happened here is that when the 'e' modifier became deprecated, someone tried to update this converter script to deal with that, but they botched the job.
I promise you nothing.

Sesqu... Sesqui... what?
Sesquipedalian, the best word in the English language.

tinoest

Quote from: Sesquipedalian on April 23, 2022, 11:42:35 AMThe function you want to add should work, but again that preg_replace() needs to be a preg_replace_callback(). If what you posted above is what currently exists in the converter script, then the converter script is wrong and won't work as intended.

Why should it be the preg_replace_callback? I don't see any reason that preg_replace wouldn't work in that example. Unless I'm missing the obvious. It accepts an array as the regex, and one for replacement then will do it on the subject.

Arantor

Because the replacement requires a computation to be done on the text to be replaced.

There's a rounding + division by 100 in there, which means you're operating on the replacement before replacing it into the result, which can't be put into the result using preg_replace.

At least, if you're converting 120% to 1.2em or similar - you could possibly refactor this into multiple matches but then you need to accommodate where you have 2 digits percentage (e.g. size=80 -> [size=0.8em]) as well as 3 digits (e.g. size=120 -> [size=1.2em])

tinoest

Quote from: Arantor on April 24, 2022, 02:32:02 PMBecause the replacement requires a computation to be done on the text to be replaced.

There's a rounding + division by 100 in there, which means you're operating on the replacement before replacing it into the result, which can't be put into the result using preg_replace.

At least, if you're converting 120% to 1.2em or similar - you could possibly refactor this into multiple matches but then you need to accommodate where you have 2 digits percentage (e.g. size=80 -> [size=0.8em]) as well as 3 digits (e.g. size=120 -> [size=1.2em])

Ahh I missed the function call on the replace. Thanks for highlighting that.

Antechinus

Ok, here's some fun and games. The internal links on the forum to be converted need to go from phpBB f=**;t=***;p****#p**** format to SMF c=**;board=***;topic=****;msg=***** format. There are several variations that need to be dealt with. For added fun and games, phpBB uses the same format for categories and boards: both are coded as f=** in the database. Yay!

So, really it needs some way of distinguishing which are phpBB categories and which are phpBB boards. IOW, somehow the ID's for categories have to be entered into an array first, so those links can be converted as category links, then a second regex deals with the remainder as boards. Whoopee. :P

And, it needs a way of entering the existing forum's path, so that can be converted to SMF's *****/index.php?etc... from phpBB's *****/index.php?etc... and *****/viewtopic.php?etc... and whatever else (hence the humungous capture groups in some of these expression - kill anything not required) without touching any links to other phpBB boards.

So https:// existing \.com/forums needs to be stashed somewhere once, so it can be called where needed.

And TBH I think this is the simplest way of dealing with it, or at least the simplest that I can actually understand. Sesqui's code earlier did my head in (see cartoon about dogs). :D

$row['body'] = preg_replace_callback(
array(
/*--- Internal links: to be converted to SMF c/board/topic/msg format. ---*/
/*--- Internal links to categories (Need to know ID's -phpBB quirk!). ---*/
'~<URL (.+?)<s>\[url=https://existing\.com/forums/(.+?)f=(\d+?)]</s>~s',
/*--- Internal links to boards (Need to know ID's -phpBB quirk!). ---*/
'~<URL (.+?)<s>\[url=https://existing\.com/forums/(.+?)f=(\d+?)]</s>~s',
/*--- Internal links to topics. ---*/
'~<URL (.+?)<s>\[url=https://existing\.com/forums/(.+?)t=(\d+?)]</s>~s',
/*--- Internal links to posts. ---*/
'~<URL (.+?)<s>\[url=https://existing\.com/forums/(.+?)#p(\d+?)]</s>~s',
/*--- Internal links to members. ---*/
'~<URL (.+?)<s>\[url=https://existing\.com/forums/(.+?);u=(\d+?)]</s>~s',
/*--- End of internal links. ---*/
/*--- External links: automatically truncated url as linked text. ---*/
'~<URL url=\\"(.+?)\\"(.+?)LINK_TEXT text=\\"(.+?)\\"(.+?)</URL>~s',
/*--- External links: with linked text not truncated. ---*/
'~<URL url=\\"(.+?)\\"><s>(.+?)</s>(.+?)<e>\[/url]</e></URL>~s',
/*--- External links: legacy format. ---*/
'~<URL url=\\"(.+?)\\">(.+?)</URL>~s',
/*--- Stray url end tags. ---*/
'~<e>\[/url]</e></URL>~s',
),
array(
/*--- Internal links: to be converted to SMF c/board/topic/msg format. ---*/
/*--- Internal links to categories (Need to know ID's -phpBB quirk!). ---*/
'[url=https://existing.com/forums/index.php?c=$3]',
/*--- Internal links to boards (Need to know ID's -phpBB quirk!). ---*/
'[url=https://existing.com/forums/index.php?board=$3]',
/*--- Internal links to topics. ---*/
'[url=https://existing.com/forums/index.php?topic=$3]',
/*--- Internal links to posts. ---*/
'[url=https://existing.com/forums/index.php?msg=$3]',
/*--- Internal links to members. ---*/
'[url=https://existing.com/forums/index.php?action=profile;u=$3]',
/*--- End of internal links. ---*/
/*--- External links: automatically truncated url as linked text. ---*/
'[url=$1]$3[/url]',
/*--- External links: with linked text not truncated. ---*/
'[url=$1]$3[/url]',
/*--- External links: legacy format. ---*/
'[url=$1]$2[/url]',
/*--- Stray url end tags. ---*/
'[/url]',
), $row['body']);

Antechinus

I can probably figure out the above by myself (it's pretty simple, and I can check the PHP docs). But, there's another potential issue.

Converting the phpBB BBC for inline attachments is straightforward in terms of actual tag syntax. The catch is that SMF 2.1 (which is where any sane person will be ending up) does the inline attachments numbered relative to the existing number of images in the attachments directory:
[attach id=58]banner_audi.jpg[/attach]

[attach id=60]banner_aurora.jpg[/attach]

[attach id=62]banner_creek.jpg[/attach]

[attach id=64]banner_fjord.jpg[/attach]

[attach id=66]banner_fluffy.jpg[/attach]

[attach id=68]banner_gnarly.jpg[/attach]
With the numbering for inline attachments jumping 2 at a time, due to a corresponding thumbnail automatically being stashed for each image.

On the other hand, phpBB has the inline attachment tags numbered relative to the parent post:
<ATTACHMENT filename=\"banner_6.jpg\" index=\"0\"><s>[attachment=0]</s>banner_6.jpg<e>[/attachment]</e></ATTACHMENT>
<ATTACHMENT filename=\"banner_5.jpg\" index=\"1\"><s>[attachment=1]</s>banner_5.jpg<e>[/attachment]</e></ATTACHMENT>
<ATTACHMENT filename=\"banner_4.jpg\" index=\"2\"><s>[attachment=2]</s>banner_4.jpg<e>[/attachment]</e></ATTACHMENT>
<ATTACHMENT filename=\"banner_3.jpg\" index=\"3\"><s>[attachment=3]</s>banner_3.jpg<e>[/attachment]</e></ATTACHMENT>
<ATTACHMENT filename=\"banner_2.jpg\" index=\"4\"><s>[attachment=4]</s>banner_2.jpg<e>[/attachment]</e></ATTACHMENT>
<ATTACHMENT filename=\"banner_1.jpg\" index=\"5\"><s>[attachment=5]</s>banner_1.jpg<e>[/attachment]</e></ATTACHMENT>

This means the two tags systems are fundamentally incompatible. To make a conversion work, it would be necessary to change the attachment id's to match up with what SMF expects. I don't doubt that it is possible (almost anything is) but it doesn't appear to be simple.

Hey ho. :P

Diego Andrés

That's odd, would it not use the attachment id somewhere, just like smf does?

SMF Tricks - Free & Premium Responsive Themes for SMF.

Antechinus

#57
Quote from: Diego Andrés on April 24, 2022, 08:12:20 PMThat's odd...
It's phpBB. :D

Quote...would it not use the attachment id somewhere, just like smf does?
It does, somewhere, but not in the actual BBC. Not in the database dump for the post either, AFAICT. It just has an integer for the number of attachments in that post. I assume assigning attachments to posts must be done via post ID in the attachments table.

ETA: Table structure for attachments.
CREATE TABLE `phpbb_attachments` (
  `attach_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `post_msg_id` int(10) unsigned NOT NULL DEFAULT '0',
  `topic_id` int(10) unsigned NOT NULL DEFAULT '0',
  `in_message` tinyint(1) unsigned NOT NULL DEFAULT '0',
  `poster_id` int(10) unsigned NOT NULL DEFAULT '0',
  `is_orphan` tinyint(1) unsigned NOT NULL DEFAULT '1',
  `physical_filename` varchar(255) COLLATE utf8_bin NOT NULL DEFAULT '',
  `real_filename` varchar(255) COLLATE utf8_bin NOT NULL DEFAULT '',
  `download_count` mediumint(8) unsigned NOT NULL DEFAULT '0',
  `attach_comment` text COLLATE utf8_bin NOT NULL,
  `extension` varchar(100) COLLATE utf8_bin NOT NULL DEFAULT '',
  `mimetype` varchar(100) COLLATE utf8_bin NOT NULL DEFAULT '',
  `filesize` int(20) unsigned NOT NULL DEFAULT '0',
  `filetime` int(11) unsigned NOT NULL DEFAULT '0',
  `thumbnail` tinyint(1) unsigned NOT NULL DEFAULT '0',
  PRIMARY KEY (`attach_id`),
  KEY `filetime` (`filetime`),
  KEY `post_msg_id` (`post_msg_id`),
  KEY `topic_id` (`topic_id`),
  KEY `poster_id` (`poster_id`),
  KEY `is_orphan` (`is_orphan`)
) ENGINE=MyISAM AUTO_INCREMENT=6530 DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

My guess is this determines if it's an inline attachment, and what order it is in:
`in_message` tinyint(1) unsigned NOT NULL DEFAULT '0',
Although that is still odd, because the first inline attachment in any post is always indexed as 0. Like this:
<ATTACHMENT filename=\"banner_6.jpg\" index=\"0\"><s>[attachment=0]</s>banner_6.jpg<e>[/attachment]</e></ATTACHMENT>
ETA: Nope, that's not it either. Got me stumped. Looks like it just indexes them relative to the post, in the reverse of the order they are uploaded, starting at zero.
Attachments table:
(6461,48130,4261,0,1358,0,'1358_c4553767e8b0ee9078d6e4e3c39a6255','banner_1.jpg',3556,'','jpg','image/jpeg',105555,1614122796,1),
(6462,48130,4261,0,1358,0,'1358_c45dc4e3c29b8870a95186ab6f1d85bb','banner_2.jpg',3556,'','jpg','image/jpeg',69510,1614122801,1),
(6463,48130,4261,0,1358,0,'1358_37e8f4dd7f291b23a84796487bd966e8','banner_3.jpg',3556,'','jpg','image/jpeg',63468,1614122806,1),
(6464,48130,4261,0,1358,0,'1358_1b767931acc6cd0ff356978e4b93f6c6','banner_4.jpg',3556,'','jpg','image/jpeg',108579,1614122812,1),
(6465,48130,4261,0,1358,0,'1358_f8f7cc790ce69a57188d97e2fbbca71d','banner_5.jpg',3556,'','jpg','image/jpeg',122395,1614122817,1),
(6466,48130,4261,0,1358,0,'1358_7ff630b44c160801967fbcbb52ffcf27','banner_6.jpg',3556,'','jpg','image/jpeg',74076,1614122821,1),

Posts table - BBC tags:
<ATTACHMENT filename=\"banner_6.jpg\" index=\"0\"><s>[attachment=0]</s>banner_6.jpg<e>[/attachment]</e></ATTACHMENT>
<ATTACHMENT filename=\"banner_5.jpg\" index=\"1\"><s>[attachment=1]</s>banner_5.jpg<e>[/attachment]</e></ATTACHMENT>
<ATTACHMENT filename=\"banner_4.jpg\" index=\"2\"><s>[attachment=2]</s>banner_4.jpg<e>[/attachment]</e></ATTACHMENT>
<ATTACHMENT filename=\"banner_3.jpg\" index=\"3\"><s>[attachment=3]</s>banner_3.jpg<e>[/attachment]</e></ATTACHMENT>
<ATTACHMENT filename=\"banner_2.jpg\" index=\"4\"><s>[attachment=4]</s>banner_2.jpg<e>[/attachment]</e></ATTACHMENT>
<ATTACHMENT filename=\"banner_1.jpg\" index=\"5\"><s>[attachment=5]</s>banner_1.jpg<e>[/attachment]</e></ATTACHMENT>

So pretty much impossible to translate to SMF 2.1 format, AFAICT at the moment.

Tyrsson

Is there a lookup table that relates all of the attachments to each post maybe? Its just a wild guess.
PM at your own risk, some I answer, if they are interesting, some I ignore.

Antechinus

That's done in the attachments table, AFAICT:
CREATE TABLE `phpbb_attachments` (
  `attach_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `post_msg_id` int(10) unsigned NOT NULL DEFAULT '0',
  `topic_id` int(10) unsigned NOT NULL DEFAULT '0',
    blah blah blah...

Like I said, it just bungs them into post_msg_id in the reverse of the order in which they were uploaded. Why reverse order? No idea, but that's what the BBC tags for that post are saying (indexing relative to post is arse about to attachment ID numbers).

ETA: Easy solution would be to just delete all inline attachment BBC on conversion. Not ideal, but should be workable if attachments themselves convert.
$row['body'] = preg_replace_callback(
array(
/*--- Note: Will need something to convert inline attachments ---*/
/*--- May be impossible, due to incompatibility in numbering. ---*/
/*--- Worst case scenario: delete all inline attachments BBC. ---*/
'~<ATTACHMENT(.+?)</ATTACHMENT>~s',
),
array(
'',
), $row['body']);

Advertisement: