News:

Bored?  Looking to kill some time?  Want to chat with other SMF users?  Join us in IRC chat or Discord

Main Menu

Need help with parsing message => HTML

Started by HunterP, February 16, 2011, 09:43:35 AM

Previous topic - Next topic

HunterP


Hi there,

I've written a little mod which parses specific message-lines into a better readable table, example :

http://www.hulpverleningsforum.nl/forum/index.php?topic=49050.msg850035#msg850035

What this mod does :

1. Recognizes these lines with a preg_match in preparsecode() in Subs-Post.php and puts the recognized data in a customized BBcode table
2. In Subs-Post.php the "fix mistakes" array is modified for the customized TABLEs and TDs
3. In Subs.php the $code-array in parse_bbc() is modified for these new TABLEs and TDs

Example #1 :

if (preg_match('~(?:\n|\r|^)(\d\d:\d\d:\d\d\s\d\d-\d\d-\d\d)\s*(?:\d\d/\d\d\d\s*|)([-| ]ALPHA[-| ]|GROUP-\d|[-| ]StNUM[-| ])\s*(.*?)(?:\n|\r)~', $message) != 0)
{
$message = preg_replace('~(?:\n|\r|^)(\d\d:\d\d:\d\d\s\d\d-\d\d-\d\d)\s*(?:\d\d/\d\d\d\s*|)([-| ]ALPHA[-| ]|GROUP-\d|[-| ]StNUM[-| ])\s*(.*?)(?:\n|\r)~', "\n" . '[table=pdw][tr][td=time]$1[/td][td=type][color=#999999]$2[/color][/td][td=message]$3[/td][/tr]' . "\n", $message);
$message = preg_replace('~(?:[\x{0A}-\x{0D}])(?!\[)(.*?)\s*(\d{7})\s*(.*?)(?<!\])(?=[\x{00}-\x{19}]|$)~', "\n" . '[tr][td=extra]$1[/td][td=capcode]$2[/td][td=label]$3[/td][/tr]', $message);

$message = preg_replace('~(\[td=extra\])\s*(.*?)\s*(\[/td\])~', '$1$2$3', $message);
}


This gives me :

[table=pdw][tr][td=time]17:57:44 15-02-11[/td][td=type][color=#999999]GROUP-2[/color][/td][td=message]A2 19105 HOUTKADE GOES rit 3804 [/td][/tr]
[tr][td=extra][/td][td=capcode]1320105[/td][td=label]Ambu 19-105 (Goes)[/td][/tr][/table]


Example #2 :

'~\[tr\](?![\s' . $non_breaking_space . ']*\[(td|td=\S{0,9})\])~s' . ($context['utf8'] ? 'u' : '') => '[tr][td]',

Example #3 :

array(
'tag' => 'table',
'type' => 'unparsed_commas',
'test' => '\w+,\w+\]',
'before' => '<table class="$1" id="$2">',
'after' => '</table>',
'require_children' => array('tr'),
'block_level' => true,
),


Not all the code is shown, just some examples. Maybe my coding is a bit crappy, but on my 1.1.13 forum it does exactly what I want (see the URL at the beginning of this post).

When upgrading a not live 1.1.13 forum to 2.0RC5, I'm not able to get my mod working. I've done some changes to make all operations succesfull, but somehow all default tags (like TR /TR /TD /TABLE) diseappear and only the custom tags (TABLE=foo and TD=fighters) remain.

Maybe it is better not to fix this, but to improve the modification and skip the BBcode part (I only did this because I wrote an little program in C++ which can output these messages into BBcode format) and directly output HTML, but I have no clue where to add this code?

What I want is not to modify the postings anymore, but to change the layout at the moment it gets displayed.

I want to scan the posting for some patterns (like in example #1) and to preg_replace them with HTML.

Very basic, like this :

17:57:44 15-02-11 GROUP-2 A2 19105 HOUTKADE GOES rit 3804
     1320105 Ambu 19-105 (Goes)

Will be recognized by my preg_match and modified into :

<table>
<tr><td="time">17:57:44 15-02-11</td><td="type">GROUP-2</td><td="message">A2 19105 HOUTKADE GOES rit 3804</td></tr>
<tr><td="capcode">1320105</td><td="label">Ambu 19-105 (Goes)</td></tr>
</table>


The main question is; where can I put a preg_match() which (pre)parses a posting into HTML without messing up anything else?

HunterP


Not for bumping, think I found something.

In Display.php :

   // Run BBC interpreter on the message.
   $message['body'] = parse_bbc($message['body'], $message['smileys_enabled'], $message['id_msg']);
   $message['body'] = preg_replace('~(1234567890)~', '<b>$1<b>', $message['body']);

I've added the second line for testing and when writing Test 1234567890 in a message, it appears as Test 1234567890.
So far, so gooed, but as I'm determined to do it as good as possible this time, I'd like to know if this is the best place for my "custom HTML parsing" ?

Arantor

You'd be best doing inside parse_bbc directly to be honest, that way it's covered wherever the bbc parser is invoked.

HunterP

Quote from: Arantor on February 16, 2011, 01:21:50 PM
You'd be best doing inside parse_bbc directly to be honest, that way it's covered wherever the bbc parser is invoked.

Thanks... Actually I'm running into some difficulties as I'm suddenly confrontated with &nbsp; and <br /> and I think I'll have to find the point before these are added...

Arantor


HunterP

Quote from: Arantor on February 16, 2011, 01:39:50 PM
They're added in preparsecode.

Thanks, but difficult, as they are -ofcourse- being put into the database. So I decided to strtr all nbsp's and <br />'s to make it abit easier and putting them back when I finished modifing. After this, it was very easy as I could use most of the previous code. It only doesn't work when a messages gets AJAX-modified. The message then appears in plain text. I have to refresh before the table(s) become visible, but I can live with that :)

Arantor

That's because quick modify is also doing parse_bbc and preparsecode in transit. It's really not wise to mess with nbsps and <br />s though. Stuff has a nasty habit of breaking unexpectedly.

HunterP

Quote from: Arantor on February 16, 2011, 08:48:41 PM
That's because quick modify is also doing parse_bbc and preparsecode in transit. It's really not wise to mess with nbsps and <br />s though. Stuff has a nasty habit of breaking unexpectedly.

What are the risks?

This is wat I do :

$message['body'] = strtr($message['body'], array('&nbsp;' => ' ', '<br />' => "\n"));

mycode();

$message['body'] = strtr($message['body'], array("\n" => '<br />'));

I'm not putting the spaces back  ::)

Arantor

It's not a case of risk, more the fact that stuff has a habit of breaking unexpectedly. There are, of course, reasons why nbsp gets used rather than just spaces, formatting for one (now, if anyone hits a double space, it will be compressed to a single space, for example)

HunterP


First, I've put this mod inside prepareDisplayContext() in Display.php but that only affected regular postings. I've moved it to parse_bbc() in Subs.php and now also PM's and previews of PM's are OK. The only thing I'm missing is the preview of a posting, how does that BBC get parsed? Or is it the same function, but am I missing something...

Arantor

parse_bbc isn't discriminating, it's applied for previews too. And signatures.

HunterP

Quote from: Arantor on February 28, 2011, 03:01:42 AM
parse_bbc isn't discriminating, it's applied for previews too. And signatures.

Weird... It doesn't seem to work correctly only when previewing a message. Maybe my regex is crappy, but it works perfectly in all other situations and the strings are the same. Let's make 2 screenshots :

Okay wait a sec... I've just discovered something really weird. When previewing a NEW posting, it works OK. When editing an existing posting it does NOT work ok when previewing. What is the difference??

Previewing a new posting


Previewing an edited posting


The string(s) :

19:02:12 28-02-11   00/070   GROUP-2    Prio 2 Bev gaarne tel contact RAC ND Woensdrechthof 118 Grip (Grip 1: ja) 3132
         GRIP 1       1500010    BRW Veiligheidsregio Haaglanden (Monitorcode)
         GRIP 1       1500412    BRW Pijnacker - Nootdorp (Kazernealarm)
         GRIP 1       1500520    BRW Pijnacker - Nootdorp (Lichtkrant)

My RegEx :

if (preg_match('~^(?: ?)(\d{1,2}:\d{2}:\d{2})\s(\d{2}-\d{2}-\d{2})\s+(?:\d{2}/\d{3}\s+|)([- ]ALPHA[- ]|GROUP-\d|[- ]StNUM[- ]|-GROUP-)\s+(.+)$\s*.*?\s*(\d{7})\s+(.+?)$~m', $message) != 0)
{
$message = preg_replace('~^(?: ?)(\d{1,2}:\d{2}:\d{2})\s(\d{2}-\d{2}-\d{2})\s+(?:\d{2}/\d{3}\s+|)([- ]ALPHA[- ]|GROUP-\d|[- ]StNUM[- ]|-GROUP-)\s+(.+)$~m', '<div class="monitor">p2kflex</div><table class="pdw"><tr><td class="time">$1 $2</td><td class="type"><span style="color:#999999;">$3</span></td><td class="message">$4</td></tr>', $message);
$message = preg_replace('~^(.{0,25})(\d{7})\s+(.+?)$~m', '<tr><td class="extra">$1</td><td class="capcode">$2</td><td class="label">$3</td></tr>', $message);
}


Maybe my RegEx'es can be improved, I'm already happy that I got this working :)

First I test for the existence of this specific pattern, then the first line is preg_replaced, then all following lines (can be unlimited) are preg_replaced. Please note that the strings can differ a little bit. For some members, the string can contain tabs in stead of spaces, that's the reason for \s+ , unfortunately it can't be any more specific.

For some 'live' examples :

http://www.hulpverleningsforum.nl/forum/index.php?topic=49967.0

Arantor

That would imply that newlines are being handled differently between the two. Newlines will have been modified into br tags by SMF after saving.

HunterP

Quote from: Arantor on February 28, 2011, 01:54:21 PM
That would imply that newlines are being handled differently between the two. Newlines will have been modified into br tags by SMF after saving.

Before my regex, I modify <br> into \n and &nbsp; into a space (and back afterwards).

Arantor


HunterP


HunterP

Quote from: HunterP on February 28, 2011, 01:48:33 PM
When previewing a NEW posting, it works OK. When editing an existing posting it does NOT work ok when previewing. What is the difference??

Coming closer, if I preview this posting :

19:02:12 28-02-11   00/070   GROUP-2    Prio 2 Bev gaarne tel contact RAC ND Woensdrechthof 118 Grip (Grip 1: ja) 3132
         GRIP 1         1500010    BRW Veiligheidsregio Haaglanden (Monitorcode) (#1)
      GRIP 1       1500412    BRW Pijnacker - Nootdorp (Kazernealarm) (#2)
         GRIP 1       1500520    BRW Pijnacker - Nootdorp (Lichtkrant) (#3)
GRIP 1   1500010  BRW Veiligheidsregio Haaglanden (Monitorcode) (#4)

I've numbered the lines at the end, not to break the regex :)

#1 Starts with several spaces, GRIP and again spaces => does not work
#2 Starts with several tabs, GRIP and again tabs => DOES work!
#3 Starts with several spaces, GRIP and again tabs => does not work
#4 Starts with no tabs/spaces => DOES work!

Seems if the spaces are somehow different in this preview. Any suggestions?

HunterP

Quote from: HunterP on February 16, 2011, 08:54:11 PM
$message['body'] = strtr($message['body'], array('&nbsp;' => ' ', '<br />' => "\n"));

$message = strtr($message, array('&nbsp;' => ' ', '&#160;' => ' ', '<br />' => "\n"));

Seems to do the trick.....

Arantor

Ahhhh, yeah, it applies a conversion of double space to something else.

HunterP

Another question, when parsed the HTML as I wanted, I'd like to do an extra check :

01:53:41 05-03-11  GROUP-1  13-188 A1  MIDDENBEEMSTER   INSULINDEWEG  0  RIT:32
                   0120188  CPA Amsterdam (Ambu 188 - VZA Zaandam)
                   0120998  CPA Amsterdam (Monitorcode)

Gets converted into :

<table class="pdw"><tr><td class="time">01:53:41 05-03-11</td><td class="type">GROUP-1</td><td class="message">13-188 A1 &nbsp;MIDDENBEEMSTER &nbsp; INSULINDEWEG &nbsp;0 &nbsp;RIT:32</td></tr><tr><td class="extra"></td><td class="capcode">0120188</td><td class="label">CPA Amsterdam (Ambu 188 - VZA Zaandam)</td></tr><tr><td class="extra"></td><td class="capcode">0120998</td><td class="label">CPA Amsterdam (Monitorcode)</td></tr></table>

Only when one of the lines in one table (there can be more tables in one message and a table can contain (theoretically) an unlimited number of lines) contains '0120188', I'd like to add an ID next to the table's CLASS:

$message = preg_replace('~(<table class="pdw">)(((?!/table).)*0120188<)~', '<table class="pdw" id="foo">$2', $message);

Something seems to be very incorrect in this regex, one forum generates a 500 error, an other forum displays an empty page when trying to view several topics. However, when '0120188' appears, the correct ID is added (if the page is visible).

Maybe there is a different (better) way for my regex?

EDIT: What I'm trying to do is to find '0120188', and modify the first <table> tag before it.

Advertisement: