News:

SMF 2.1.4 has been released! Take it for a spin! Read more.

Main Menu

Hashtag parsing

Started by L24, July 16, 2013, 05:51:18 AM

Previous topic - Next topic

L24

Hello. I'm writing a modification to my smf-based message board, and I got some serious dilemma.
Mod is just taking all matching hashtags (words prefixed by '#' sign) in messages, and parsing them into urls, for example:

message:
QuoteExtra prizes in a #fishing contest

transforms to:
Extra prizes in a <a href=scripturl?action=tag;tag=fishing">#fishing</a> contest

And the main question is, when, and where should I perform transform? When message is ready to display (Display.template)? Or another place (e.g. where bbc is parsing)?

Another problem is a fact that I will plan to store hashtags in a database. Which moment is good to grab them from message body? createPost() function in Subs-Post?

I really care about security issues, so any clues are helpful. Thanks for future answers. Greets :).

MrPhil

After parse_bbc(), all BBCode should have been turned into HTML. Some time around that point, the Word Censor is thrown at it. That would probably be a good point to process hashtags (a new function, not a modified parse_bbc or word_censor). Try to make sure that the term is not already within an <a> tag, and do your transformation. If you do it cleanly, that might be a mod of general interest to others.

You would want to extract/store hashtags around the time that a post is created/edited (saved into the database), not when it's being displayed (too much drag on performance to do it at each display). Is this information to be used in the hashtag expansion discussed before, or is it totally separate (e.g., just doing statistics)?

Regarding security, on hashtag expansion you might check that it's just simple alphanumerics and not HTML code such as an iframe or script tag (the Admin could probably do that, but that could mess up the hashtag link anyway). On putting a hashtag into the database, or otherwise using it in SQL, you want to sanitize it and make sure no SQL injection can be done with it. Again, checking that a hashtag is a simple alphanumeric string ought to be sufficient.

L24

Quote from: MrPhil on July 16, 2013, 09:48:23 AM
You would want to extract/store hashtags around the time that a post is created/edited (saved into the database), not when it's being displayed (too much drag on performance to do it at each display). Is this information to be used in the hashtag expansion discussed before, or is it totally separate (e.g., just doing statistics)?

It could be useful when the hashtag page is displaying. These pages may contain all topics, messages which have actual tag. Storing hashtags into database probably saves time to searching in all message bodies (e.g. when just a few messages have displayed tag).

Hashtags will be matched by regular expression (it should to improve security, right?). Max 16-32 characters, only alphanumerical. Is it very hard to avoid wrong tag, as you said? For example, I have to recognize this situation, after URL tag parsing:
<a href=?>Some #test text</a>

Worse than above are trying to not parse tag when it is in IMG bbc. And other bbc's also.

Tags should be ommited when they appear in html markups. Seems hard to implement. Any tips for that? :)

Thanks for your response!

MrPhil

To make sure you're not in the middle of a link (per your example), you'd probably have to scan upstream to make sure you hit a </a> before you hit a <a ..., or downstream to hit <a ... before you hit </a>. I don't know if there would be any advantage to doing it before parse_bbc (i.e., still as BBCode). Admins can put in direct HTML, so you'd probably still have to scan the HTML (to be thorough). I almost wonder if it would be advantageous to feed the text to an XML or HTML parser that turns the text into a tag tree. I shudder to think how expensive that would be in server processing time. Besides <a> tags, could any other tag that includes a URL (<img>, et al.) have a # in it? In an <img> tag, alt= and title= attributes could include #. Almost any tag can have title=. Getting ugly fast, isn't it? Are there any other tags that <a> could not be nested within? (that <a> cannot be a child of -- directly or indirectly)  You might check the Word Censor code to see if there's anything in there to prevent the insertion of <a> when the child of certain tags -- there probably isn't. That's a good general problem to detect, as many people use Word Censor, glossary tags, or similar constructs to add links within text.

I don't know if "HTML markups" in general are a good reason to omit hashtag expansions. It would be harmless to expand <b>some #term</b>. It's just certain tags that you don't want to stick an <a> inside of.

emanuele

Since I'm playing with that too (not for SMF (even though the code looks similar) so we are not in competition :P) is this:
https://github.com/emanuele45/topic_tags/blob/hashtags/Tags.subs.php#L242

1) a first preg_replace_callback to "protect" all the links (replaced by a placeholder, hopefully unusual enough not to appear in a message...now that I think again about it, I could even use an md5 hash instead of the counter, will consider),
2) then a second preg_replace_callback to replace the hashtags,
3) and finally the restore of the links.

TBH I committed this yesterday evening and I didn't test it much, just a quick try, seems to work.

About the when...I'm doing it after parse_bbc.

ETA: and this is the function that extracts the tags from the body.
It can be almost anywhere I think, just be sure it is run when you are sure the message is inserted (and don't forget about modified messages ;)).


Take a peek at what I'm doing! ;D




Hai bisogno di supporto in Italiano?

Aiutateci ad aiutarvi: spiegate bene il vostro problema: no, "non funziona" non è una spiegazione!!
1) Cosa fai,
2) cosa ti aspetti,
3) cosa ottieni.

L24

Thank you fellas for advices ;)

@emanuele:
Your hashtag module seems really vast :). I mean to do something simplier. I don't need create list of avaible tags - users do it on their own. They need to be consistent if they want to keep a board clean and easy to navigate, so they have to use proper tags. When abusive hashtag will appear in my mod, it's easy to erase it from message by a moderator :D, any connections with tag will be losted. Your solution is big, and keeps everything under control, so big respect for that :).

I'm planning to do this in few steps, just like:
1. Recognize valid tags (up to 16 characters, only digits and alphabet).
2. Store them with a message into messages table (create separate string of recognized tags for each message).
3. Completly independent from previous steps - render hashtags which matches into terms described above.

Thanks again and greetings :).

emanuele

I consider it basic... :P
Well, it's a mixture because it started as "normal tagging" (in the sense of "blog tagging" let's say, so that you add tags to a topic writing them into another text box), then I decided to introduce hashtags too.

BTW, the two functions I linked are just those you were asking about, that's what I did (obviously) and it doesn't mean is correct nor the best way, it's just a way. ;)

All the rest is just bells and whistles, nothing more: a way to show all the topics with a certain tag in the exact same way a board would be displayed, the tags cloud, etc. ;D


Take a peek at what I'm doing! ;D




Hai bisogno di supporto in Italiano?

Aiutateci ad aiutarvi: spiegate bene il vostro problema: no, "non funziona" non è una spiegazione!!
1) Cosa fai,
2) cosa ti aspetti,
3) cosa ottieni.

gorbi

Were you able to finish the job and write working mod?

DarkTexas


Advertisement: