Auto-moderation. Bayesian statistical filtering of posts!

Started by Col, November 26, 2005, 10:51:33 AM

Previous topic - Next topic

Col

OK, I accept that this might be a big feature or mod to implement, and I am unsure of what extra load such a scheme might produce, but since the subject of some kind of approval area for posts gets raised pretty often, I thought this might be a way ahead.

The little reading around I have done threw up a kind Bayesian filter that had three mail types (looking at e-mail Bayesian filters): spam, non-spam, and unsure; specifically, this was SpamBayes e-mail filtering. I thought that if such an implementation could be created for SMF posts, that would be a great feature!

Having the third option of 'unsure' seems to improve performance, and would make sense for an environment such as a forum. Whitelisting of certain member and post groups would ensure that most members posts are never affected. If a post is wrongly tagged as spam (offensive), then you will usually hear from the poster, the problem can be corrected, and the intelligent filter will learn. Obviously, occasionally posts will get through that you would prefer filtered, and they can be added to the 'offensive' folder, and again the filter learns. Normally, the only folder that a moderator need look at will be the 'unsure' posts, and this should be small in number to make this a snap to use.

I wonder if there is some way of 'jerry-rigging' an off-the-shelf Bayesian e-mail filter to use the 'posts by e-mail' mod being developed, for basic testing purposes?

From what I've read, the basic theory is not complicated, but obviously there will be complications with intergration to forum software. I'd love to hear if such a scheme is indeed possible, even if not probable in the near future.

Thanks.

Grudge

I do quite like the idea, although it is maybe a little too much to ask :)

It's actually one step further than my idea of auto moderation which is to look at people reporting a topic. If a topic gets reported a lot then remove it. You could take that idea a step further by looking at who reports topics and what percentage of their reporting results in a message being deleted, and build up a trust factor on users etc - and use that to determine how likely to be worth deleting a message is.
I'm only a half geek really...

Col

Quote from: Grudge on November 26, 2005, 11:07:49 AM
I do quite like the idea, although it is maybe a little too much to ask :)

You know you want to do this Grudge! ;)

QuoteIt's actually one step further than my idea of auto moderation which is to look at people reporting a topic. If a topic gets reported a lot then remove it. You could take that idea a step further by looking at who reports topics and what percentage of their reporting results in a message being deleted, and build up a trust factor on users etc - and use that to determine how likely to be worth deleting a message is.

There is no reason why both techniques cannot be employed! ;)

Maybe the Bayesian technique would not be that difficult? Posts would be checked against the filter, but only the groups that are not whitelisted. If I understand the technique correctly, an index is built up of words used in your forum, good and bad. Different weights are given different words, titles, punctuation, the use of caps, number of links, tags, swearwords, etc.; and good words too. The weighting given to the words is automatic, developed from the good and the bad posts, and only the highest scoring words are used to decide the post's fate. The relatively easy bit, I would think, is the interface. I would imagine that it would more simple than for the other post moderation ideas proposed. It would consist of a whitelist, bad posts, unsure posts, and who has access to the auto-moderator. Good posts are on the forum boards, and would have a 'bad post' button'.

Yeah, I'm sure it's a thousand times more complicated than I'm trying to make it sound. - It would though, be very, very cool! 8)

Col

Hi,

I was wondering if this idea might be more practical now that there is a search index in place? This is a vital component for the Bayesian approach, and it's just lying there waiting to be used. I would guess that the index would have to carry some value with each word, that indicates its offensiveness weight. I know other things need to be considered too, such as links, whitespace, etc.

As I said before, whitelisted groups and members would not be checked against the list, cutting load. False-negatives can, of course, be retrospectively added, and false-positives removed.

Grudge, does this idea seem any more feasible now? I know it would still be a big job, but would be unique to this forum system, I'm sure.

Dannii

Someone is considering making a mod for Akismet, would it be somewhat similar?
"Never imagine yourself not to be otherwise than what it might appear to others that what you were or might have been was not otherwise than what you had been would have appeared to them to be otherwise."

Daniel15

QuoteSomeone is considering making a mod for Akismet, would it be somewhat similar?
Hey, they stole my idea! :P I was going to get around to doing this, eventually (however, I'm currently getting around to it like 3D Realms is getting around to completing Duke Nukem Forever :P). If no-one attempts this, then I'll try to do it (it may not be for a few weeks though, I have exams coming up soon)

Using Akismet is one of the best ways to block spam... It works better than almost every other method :)
Daniel15, former Customisation team member, resigned due to lack of time. I still love everyone here :D.
Go to smfshop.com for SMFshop support, do NOT email or PM me!

Col

Hi eldʌkaː,

What's happened to your handle? Is that deliberate, or a problem with the charset?

Anyway, yes, it appears to use Bayesian filtering. It doesn't carry the "unsure" tag, which I think would be a great improvement - a moderator would just need to check the posts that ended up there, and not the posts that are obviously spam.

I'm unsure of what happens if the connection to Akismet is lost - are the posts rejected, approved, or held up? Also, I assume this will only work on public forums?! However, this does look as though it would be a useful tool, and would be a very welcome modification to SMF.

Dannii

My nick is the phonetic represenation of my nick.. it will work if you have proper fonts ;)

Akismet should work if there are interruptions. And for the few it gets wrong, there would be a "this is spam" button for those that were missed, and the ones that it thinks are spam would be send to a hidden board that you can check as well.
"Never imagine yourself not to be otherwise than what it might appear to others that what you were or might have been was not otherwise than what you had been would have appeared to them to be otherwise."

Col

Hi,

Yes, I just had a look at your site, and saw that you are studying Linguistics. I copied/pasted your name into Word, and now know how to pronounce it - I have just started a linguistics degree!  ;)  I had a phonetics test last week, and it went far better than I ever dreamed.

I understand that false positives and negatives can be corrected.

I have a feeling latency may become a problem, and I'm unsure if the filter uses the words/posts supplied by the individual blog/site, or if it is a pool of words from all submissions. I would imagine that a Bayesian filter will work far more reliably with local data, and not a pool - the whole idea of the Bayesian approach, after all.

Dannii

Just started a linguistics degree? Awesome! I'm doing Engineering now, but I'm going to change to a dual degree next year with Linguistics too. It's fun!

I'm not sure how a Bayesian filter works, but I think Akismet would work well however it does, it wouldn't be popular if it doesn't.
On their site the describe what information you can send. Some is optional.
"Never imagine yourself not to be otherwise than what it might appear to others that what you were or might have been was not otherwise than what you had been would have appeared to them to be otherwise."

Col

Well, I studying a joint honours degree, with Counselling, and some think that a bit odd. Engineering and Linguistics is a combination that I am surprised at!

Are you saying that you will be studying two FULL degrees, or half-and-half?

Dannii

Two full degrees, but its not the full time of each, because the electives of each are taken by the other degree. It would be 5.5 years if I didn't fail anything, so I think I'll be there for closer to 7 (because I unfortunately fail quite a few).
Yeah I don't know anyone else doing Eng and Linguistics, but one of my friends is doing Eng and Japanese, which is close!
"Never imagine yourself not to be otherwise than what it might appear to others that what you were or might have been was not otherwise than what you had been would have appeared to them to be otherwise."

Daniel15

Anyways, back on topic :P. I've started the Akismet MOD, please see http://www.simplemachines.org/community/index.php?topic=96034.msg811898#msg811898 for details on what I've done so far :D

Quoteand the ones that it thinks are spam would be send to a hidden board that you can check as well.
That's what I should do :P
Daniel15, former Customisation team member, resigned due to lack of time. I still love everyone here :D.
Go to smfshop.com for SMFshop support, do NOT email or PM me!

Advertisement: