News:

Want to get involved in developing SMF, then why not lend a hand on our github!

Main Menu

Word Censor List

Started by dougiefresh, November 05, 2013, 03:00:43 AM

Previous topic - Next topic

dougiefresh

Link to Mod



WORD CENSOR LIST v1.5
By Dougiefresh -> Link to Mod



Introduction
So, you want to run a family friendly community, without any vulgar words appearing on your site. The easiest way to prevent that is to use SMF's word censor feature, but you have an empty list of words and don't want to spend an hour filling in every naughty word you know and some you don't.

Word Censor List will help you by adding a list of some commonly censored words and some uncommon ones to your forum.

Compatibility Notes
This mod was tested on SMF 2.0.5, but should work on earlier versions of SMF 2.0.x.  SMF 1.x is not and will not be supported.

Changelog
The changelog can be viewed at XPtsp.com.

License
Copyright (c) 2015 - 2018, Douglas Orend
All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

dougiefresh

Updated to v1.1.  Upgrading from v1.0 to v1.1 is not necessary, as it does not replace the functionality provided, only fixes the settings installer.

TonyG

I have a list of censor words that I carry around from one family-oriented site to another. Interested in an update to the list you have in edit_db.php? Do you already have a place for this or some preferred mechanism for doing this?
Thanks!

Kindred

Quote from: dougiefresh on November 05, 2013, 03:00:43 AM
. The easiest way to prevent that is to use phpBB's word censor feature,

really? :)
Слaва
Украинi

Please do not PM, IM or Email me with support questions.  You will get better and faster responses in the support boards.  Thank you.

"Loki is not evil, although he is certainly not a force for good. Loki is... complicated."

margarett

Se forem conduzir, não bebam. Se forem beber... CHAMEM-ME!!!! :D

QuoteOver 90% of all computer problems can be traced back to the interface between the keyboard and the chair

dougiefresh

Always interested in submissions....  Please share!

Biology Forums

Always wanted a mod like this, thanks.

dougiefresh

Quote from: Kindred on December 22, 2014, 04:26:12 PM
Quote from: dougiefresh on November 05, 2013, 03:00:43 AM
. The easiest way to prevent that is to use phpBB's word censor feature,

really? :)
:o Whoops!!!  I meant that you should use SMF's word censor feature.....  Fixed that in the first post!  ::)  I guess I should admit that I copied the description from a phpBB mod and didn't pay that much attention....

TonyG

I just updated the list. Based on other entries, I added and modified a lot of words to include RegExp tests, but it doesn't look like any of those are working. I'm using the Advanced Censor mod which does a PHP function strstr, and Block Censor Words.

Has anyone here modified their filter to do regex tests with the censor list?

Thanks!

Arantor

Considering that the internals of the censor function already use regex, I wish you the *very* best of luck performing the rewrite required to make that work as intended.

TonyG

So am I to understand that this Word Censor List was invalid from the start?

I'll have to look at the regexp code because it doesn't look like it's working with the masks being used.

So which is it? Are we using the wrong kind of regex? Is the regex not working? Is there any documentation for the syntax supported by the current regex mechanism?IF that's a preg_match, can we assume that if the word list in the database has a string that can be interpreted by preg_match that we'll get good censor matching?

And now that I'm thinking about this I'm thinking that the mods might be using strstr() while SMF might be using preg_match, which leaves text to get filtered in different ways along the chain of execution - that can just lead to confusion and embarrassment.

Let's not leave this unresolved - what SHOULD work there?

Thanks.

Arantor

I don't know what you understood from what I said, to be honest, but clearly there is some misunderstanding somewhere.

This adds them to the database in the way SMF's own interface does. This is then internally converted into a regex for processing purposes. It doesn't support full regex syntax for this reason. Hence my comment.

But multiple times I have seen comments... you clearly know best, of course. Best of luck to you.

TonyG

We do have a misunderstanding. I'm trying to understand how this stuff is working so that we can do better filtering.
I completely understand that this Word Censor List mod just inserts text strings into the database.
From there, what happens to each string?

The list already includes some strings with regexp. I just asked if that was valid or not.
You said "This is then internally converted into a regex for processing purposes. It doesn't support full regex syntax for this reason. "
OK, so what sort of conversion is done there? Knowing that will allow us to make better improvements to this list.

From the examples already in the list, it seemed to me that "b[4a@][!l][!l][0o][0o]n" should match balloon, b@l!00n, and b4l!0on. Is that not correct? If not then all I was saying is that a number of entries already in the list are bad and we need to change how this is approached.

TonyG

#13
Coming back to this topic. Can anyone tell us exactly what Regex syntax is supported for words found in the censor list?

I see the Load.php code referred to by @Arantor:

if ($censor_vulgar == null)
{
$censor_vulgar = explode("\n", $modSettings['censor_vulgar']);
$censor_proper = explode("\n", $modSettings['censor_proper']);

// Quote them for use in regular expressions.
for ($i = 0, $n = count($censor_vulgar); $i < $n; $i++)
{
$censor_vulgar[$i] = strtr(preg_quote($censor_vulgar[$i], '/'), array('\\\\\\*' => '[*]', '\\*' => '[^\s]*?', '&' => '&amp;'));
$censor_vulgar[$i] =
                              (empty($modSettings['censorWholeWord']) ?
                                    '/' . $censor_vulgar[$i] . '/' :
                                        '/(?<=^|\W)' .
                                        $censor_vulgar[$i] .
                                        '(?=$|\W)/') .
                              (empty($modSettings['censorIgnoreCase']) ?
                                   '' :
                                        'i') .
                              ((empty($modSettings['global_character_set']) ?
                                   $txt['lang_character_set'] :
                                        $modSettings['global_character_set']) === 'UTF-8' ? 'u' : '');

if (strpos($censor_vulgar[$i], '\'') !== false)
{
$censor_proper[count($censor_vulgar)] = $censor_proper[$i];
$censor_vulgar[count($censor_vulgar)] = strtr($censor_vulgar[$i], array('\'' => '&#039;'));
}
}
}

// Censoring isn't so very complicated :P.
$text = preg_replace($censor_vulgar, $censor_proper, $text);


I broke up that meaty assignment statement just for readability. I understand that's adjusting each word element to account for server-specific settings. But can anyone explain exactly what the reformatting code is doing which might preclude using Regex syntax in elements of $modSettings['censor_vulgar'] ?

Note: I just looked at the Advanced Censor mod. This will not process the $modSettings['censor_vulgar'] list using Regex as seen above. It looks for specific text.:
if (strstr($pMessageBody, $vCensorVulgar[$i])) return true;

However, I believe that code could easily be retrofit with the code from Load.php.

Thanks.

dougiefresh

Hmmmm....  I don't have a copy of version 1.0 of this mod, so I'm gonna have to figure something out regarding the broken censor list....

TonyG

#15
I don't understand @dougiefresh.

To get the Regexp in your word list to work, I think one just needs to understand what's being done in that core code to each element before it does the final preg_replace. It might be helpful to write that data to a file to see what's been done to it. Then we can revise each element to confirm.

As to the Advanced Censor mod, it returns a true before posting if the text contains a censored word. So all that's needed there is the same code from Load.php, and final check:
if (preg_replace($censor_vulgar, $censor_proper, $text) !== $text) return true;
Someone should advise him that his mod is invalid if the wordlist contains Regexp. I guess I'll do this after we get through this discussion.

HTH

dougiefresh

Uploaded v1.2 - January 12th, 2015
o Removed most wildcards from the word censor list.
o Corrected link to the mod in the descriptions.

TonyG

So is the answer to the ongoing question that regex is simply not supported at all for censored words?
If so, then removing the wildcards from the list in this mod is the right solution for this mod.

I think the better long-term solution however is to find out what regexp is possible in the code from Load.php, and then get words in the list to conform within the constraints.

metallicgloss

I installed this package and it is now turning all 'hello' into *o and it is REALLY ANNOYING.
I edited the file in the pack but nothing has changed. I re-installed my forum and re-added a couple packages with an execption of this. It is still doing it, it is doing something with the database. Where can i remove it so it now doesnt could hell as a swear word.

dougiefresh

@metallicgloss: Go into the Admin panel, under Forum => Posts and Topics => Censored Words.  Put a check in the option saying Check only whole words:.  That should solve the problem....

Advertisement: