Automatically link certain words?

Started by MrMike, May 27, 2012, 12:40:51 PM

Previous topic - Next topic

MrMike

Is there a mod that will automatically link certain words? I've looked through the mods and don't see anything that quite fits that description.

In short, I have a list of keywords and when they appear in a forum post, I'd like to automatically link them to a specified page. The censor feature can do this, but it interferes with the Anti-Spam Links mod.

So, essentially a second censor feature I can use to turn specific words into specific links. Is there a mod like that?

MrPhil

This has been asked many times (as you're probably aware), and the usual answer is to use word censor. That's not a good solution, even if you're not using Anti-Spam Links, because it will probably end up turning every occurrence of a given word into a link. Preferably, only the first on a page would be linked. There is a question about what text would be so treated -- just post body? signatures? topic subjects? page title?

Anyway, you could probably have a list of words and actions very similar to the censor function. I don't know why word censor and Anti-Spam Links don't play nicely together, so that problem would have to be looked for and addressed. When you read in the list at the top of the page, you would initialize an array of flags for all words in the list. Do the censor-type scan of all applicable text, replacing the found word (if it hasn't already been marked on this page) with whatever the replacement is. Mark the word as already done (for this page). It would be possible to do this globally (only link the first time in a session), but that would involve tracking the list of words seen in a new database table.

Whoever writes this mod might make it more general purpose... not just links, but also trademark signs (tm or P or R) could be handled too (insert symbol, not link). Future actions could be added, such as affiliate linking (there's a mod for that). Maybe the mod should be called FirstTime?

MrMike

I'll write up a function to do something like this, but I'm not familiar with how to turn it into an official mod package. Perhaps you or someone else would be able to do that if I provided the code?


Quote from: MrPhil on May 27, 2012, 02:31:27 PM
This has been asked many times (as you're probably aware), and the usual answer is to use word censor. That's not a good solution, even if you're not using Anti-Spam Links, because it will probably end up turning every occurrence of a given word into a link. Preferably, only the first on a page would be linked. There is a question about what text would be so treated -- just post body? signatures? topic subjects? page title?

Anyway, you could probably have a list of words and actions very similar to the censor function. I don't know why word censor and Anti-Spam Links don't play nicely together, so that problem would have to be looked for and addressed. When you read in the list at the top of the page, you would initialize an array of flags for all words in the list. Do the censor-type scan of all applicable text, replacing the found word (if it hasn't already been marked on this page) with whatever the replacement is. Mark the word as already done (for this page). It would be possible to do this globally (only link the first time in a session), but that would involve tracking the list of words seen in a new database table.

Whoever writes this mod might make it more general purpose... not just links, but also trademark signs (tm or P or R) could be handled too (insert symbol, not link). Future actions could be added, such as affiliate linking (there's a mod for that). Maybe the mod should be called FirstTime?

MrPhil

Well, I'm a virgin, too  :-[  Is there a good tutorial on turning code edits into real mods?

Arantor

http://www.simplemachines.org/community/index.php?topic=20319.0

Basically it's just a case of putting together some find/replaces in XML, and just looking through existing mods should cover everything you need to know.

MrMike

I have a working example of the word-linker code that I want to clean up and (possibly) improve a bit, but I should have something within the next day if anyone is interested.

My thought is to use it as an include file, with the include line inserted just before the post body is output. In SMF 2.02, in the Themes/default/Display.template.php file, this would be right around line 507, just before this bit of code:

echo

    $message['body'], '</div>


(Someone who knows something about making mod packages could probably suggest an alternative method if this way of doing it isn't the best way.)

In a nutshell, the var $message['body'] gets intercepted, run through the word linker, and the changed text is put back into the $message['body'] var, where it's then displayed.

Right now the word-linker links only the first occurrence of any given word, but this could be configured on a per-word basis through an option setting. That is, you could make it link the word 'dog' only once, link 3 instances of the word 'cat' , link 1 instance of the word 'house', and so on.

It also avoids link words that are inside of an existing link, so you won't get double-linked words or mangled links. :)

Arantor

I wouldn't put it in the display template if at all possible. Aside from the fact you're doing processing and not pure display (which means it's semantically, if not technically, the wrong place), what you end up doing is applying the changes to all themes.

Instead, attach the code into Display.php, inside prepareDisplayContext() and adjust $output['body'] there before it's returned back to Display.template.php as $message.

MrMike

#7
Quote from: Arantor on May 28, 2012, 08:47:22 AMAside from the fact you're doing processing and not pure display (which means it's semantically, if not technically, the wrong place), what you end up doing is applying the changes to all themes.

You're correct- it's both technically and semantically the wrong place. :)

Quote from: Arantor on May 28, 2012, 08:47:22 AMInstead, attach the code into Display.php, inside prepareDisplayContext() and adjust $output['body'] there before it's returned back to Display.template.php as $message.

Yes, this is the proper spot to do the transform...thank you for this, Arantor. It looks like the best spot in Display.php is immediately after the line where the BBC interpreter is run.

Currently I've got it working so you can specify the number of times each word is linked, plus you can include an optional style or class tag so the altered links can be made to look a little different (if you want).

You can see a demo of it working on these pages:

Test Page:
http://deltabravo.net/forum/index.php?topic=39855.msg318677#msg318677

(and here  and here as well.)

I can post the code and instructions for the modifications required (minor), but I'd really like to work with someone who can turn this into a mod package.

There are only a few words linked (don't want to overdo it), but I'll be adding entries for the 50 or so most commonly used terms worth linking.

MrPhil

That looks pretty nice. You mentioned doing it after BBCode processing -- have you confirmed that you generate links only once per page (not once per post)? Are all links of the same format (in your case, fed to search), or could you have some words that generate, say, links to external sites, and some that just get a &trade; or &reg; mark added? Since each word in the list gets its corresponding output listed, there's probably no need to get fancy with variable substitutions and such:

'Brillo'  'after'  ''  '&trade;'

would just be entered as

'Brillo'  'Brillo&trade;'

unless you wanted to match both upper and lower case forms and use the original in the output:

'Brillo'  'Brillo&trade;'
'brillo'  'brillo&trade;'

What if someone miscapitalizes, or obFusCates in order to hide their usage? In that case, you might want to do something like

'/(Brillo)/i'   '$1&trade;'

using preg_match(). Something to think about.

Does this thing work only on BBCode-processed text? Should it be done on topic subject lines, or in the breadcrumb trail or page title? Do signatures get normal BBCode processing?

As a labor saving device, maybe for similar but not identical outputs:

'term1,term2,term3,term4'  'function definition to process'   '<a href="/$1/$2.php">$3</a>'

where $1 and $2 might be generated from the matched term, and $3 is the original term.

MrMike

Quote from: MrPhil on May 28, 2012, 03:45:59 PMThat looks pretty nice. You mentioned doing it after BBCode processing -- have you confirmed that you generate links only once per page (not once per post)?

Right now it's running per post; I'm working on making it generate words once per page. Trying to save a few queries if I can.


Quote from: MrPhil on May 28, 2012, 03:45:59 PMAre all links of the same format (in your case, fed to search), or could you have some words that generate, say, links to external sites,

Right now they just go to various hand-selected searches or in some cases a specific page. The links can go anywhere you want, on a per word basis.


Quote from: MrPhil on May 28, 2012, 03:45:59 PMWhat if someone miscapitalizes, or obFusCates in order to hide their usage?

It's doing a case insensitive search so I'm not too concerned about people obfuscating their words. For the target phrase 'court order' it'll catch 'court order', 'Court order', Court Order', 'COURT Order", and so on, and it'll replace the phrase with the same case as whatever it was before it got linked.


Quote from: MrPhil on May 28, 2012, 03:45:59 PMDoes this thing work only on BBCode-processed text? Should it be done on topic subject lines, or in the breadcrumb trail or page title? Do signatures get normal BBCode processing?

I think it'll work on any text or HTML as far as I know.

Arantor

Well, if you're doing it after bbcode is performed, it's being done once per post but applied to HTML.

If you really wanted to get clever and do it based on items not within existing links, you could attach it to the output buffer code and do it once per page. Similar sort of principle, of course.

Going down that particular road is complicated depending on whether you'd prefer to do it cleanly or more simply (the cleaner version uses hooks and never touches any core files but will be 2.0 only)

emanuele

Well, if you don't care about the 0.5% (or less) of people browsing with javascript disabled, javascript could be another option.


Take a peek at what I'm doing! ;D




Hai bisogno di supporto in Italiano?

Aiutateci ad aiutarvi: spiegate bene il vostro problema: no, "non funziona" non è una spiegazione!!
1) Cosa fai,
2) cosa ti aspetti,
3) cosa ottieni.

Arantor

I wouldn't necessarily be inclined to do it in JavaScript on the client side, traversing the DOM is not going to be particularly wonderful, especially as you'd have to do it through full traversal and noting whether at any point the hierarchy you're within is encapsulated by an a tag... doable but going to be messy.

MrMike

Quote from: Arantor on May 28, 2012, 07:31:30 PM
Well, if you're doing it after bbcode is performed, it's being done once per post but applied to HTML.

That's why I'm doing it there- I want the transforms to take place after all the other processing is done so I'm working with the "finished product", so to speak.

However, I don't want to run the query for each post. I'd prefer to save the keywords array and reuse them, but I'm having a little trouble doing this so far.


Quote from: Arantor on May 28, 2012, 07:31:30 PM
If you really wanted to get clever and do it based on items not within existing links, you could attach it to the output buffer code and do it once per page. Similar sort of principle, of course.

Yep, but I don't want to do the whole page, just the text in the posts.

Quote from: Arantor on May 28, 2012, 07:31:30 PM...depending on whether you'd prefer to do it cleanly or more simply

I'm a big fan of simple. :)  I like clean and elegant, but sometimes it's more trouble than it's worth.

Arantor

In which case, declare your keywords array as static inside prepareDisplayContext, if it's empty, perform the query. (Since it's static, it'll be preserved between runs)

I get what you mean about running it per post, it was more a suggestion of a possible course of action rather than a recommended course, as it were.

MrMike

Actually I may have to run it for each post, as I remove keywords from the array once they been used (so they only link x number of occurrences per post). Which means that posts further down the page don't get the keywords linked if they've already been found in an earlier post. But, I can probably just copy the array to a throw-away array used once for each post as well. Let me see what I can do.


Quote from: Arantor on May 28, 2012, 09:15:59 PM
In which case, declare your keywords array as static inside prepareDisplayContext, if it's empty, perform the query. (Since it's static, it'll be preserved between runs)

I get what you mean about running it per post, it was more a suggestion of a possible course of action rather than a recommended course, as it were.

MrMike

Quote from: MrPhil on May 28, 2012, 03:45:59 PMWhat if someone miscapitalizes, or obFusCates in order to hide their usage?

Correction: I'm using a case-insensitive search, but it doesn't appear to be finding words with capitalization or mixed-case. I'm not sure why it isn't, frankly.

Arantor

How exactly are you performing that search?

MrMike

Quote from: Arantor on May 29, 2012, 08:59:13 AM
How exactly are you performing that search?

This is the basic search line:

$post_text = preg_replace('/([\W|\s]+)('.preg_quote($word).')([\W|\s]+)/i','$1<a href="'.$rep_string.'"'.$pp_style.' title="Related pages for \''.$word.'\'">$2</a>$3',$post_text, $rep_count);

The "/i" should make it case-insensitive.

Arantor

It should, yes.

However, I think you have more issues than that looking at the code, which may well be the problem. Might I suggest:

$post_text = preg_replace('/\b(' . preg_quote($word) . ')\b/i', '<a href="' . $rep_string . '"' . $pp_style . ' title="Related pages for \'' .$word . '\'">$1</a>', $post_text, $rep_count);

You don't specifically have to parenthesise the spacing, you just have to take it into account for matching purposes, and \W or \s+ is essentially either way saying a word boundary, which is \b, and as such you only need to match the actual term itself.

I actually suspect part of the trouble you were having was less about matching capitalisation but not hitting certain combinations of characters. Also note that you don't have to have the | operator when you're inside [] since you're indicating a class, and in fact [\W|\s] means to look for \W or | or \s.

Advertisement: