News:

Wondering if this will always be free?  See why free is better.

Main Menu

Edit Censored Words List?

Started by shin111, January 30, 2010, 03:52:08 PM

Previous topic - Next topic

Arantor

This is a flaw in PCRE you cannot get around.

The core reason is that SMF and thus PHP are using regular expressions based on \w and \W - that a character is or is not a character that can be part of a word.

To quote the PHP manual:

QuoteMatching characters by Unicode property is not fast, because PCRE has to search a structure that contains data for over fifteen thousand characters. That is why the traditional escape sequences such as \d and \w do not use Unicode properties in PCRE.

You're using UTF-8, a form of Unicode (yes yes, I know, let's just keep it simple), and this doesn't support the \w methodology used.

Options: one, rewrite it in a different, and significantly slower, form, or two accept that it can't be dealt with as it's a limitation of PHP, not SMF.
Holder of controversial views, all of which my own.


ispasov

OK, could you help me to rewrite it in something else.
I hope that the users will be able to see cyrillic characters, right?

Arantor

I don't think you understand.

It will slow your forum down by a MASSIVE amount, no matter how cleverly written it is, because censortext is called in LOTS of places, and using regular expressions with PCRE means it's using compiled code which will always be faster than non compiled code.

I'm not prepared to rewrite you a non-native (interpreted, not compiled) version that is UTF-8 safe because it will KILL performance of your forum.
Holder of controversial views, all of which my own.


ispasov

OK, so lets leave it like this and wait for better times.  O:)
Thanks, I will keep follow the posts of this forum for more information.

Arantor

It won't ever change. Let me explain the problem.

You have your body text, and you're comparing it, letter by letter by letter by letter, one at a time, against a list of about 15,000 different known characters that can make up a word, to match this.

And you're doing this for every letter of every post, every letter of every subject, every letter of every signature - and other places. It can't ever be fast, which is why the PHP PCRE library (which is what's doing the work) doesn't support it.
Holder of controversial views, all of which my own.


ispasov


Advertisement: