Alexander The Great

April 10, 2008

Spam Words

Filed under: Software,spam — alexanderthegreatest @ 8:24 am
Tags: , , , ,

Many of my readers know by this point that I have an aversion to spam.  Add to that a strange sense of humor, and a little bit of sadness at the way the internet has been colonized by corporations – even to the point that companies we all pay for access to the internet want to charge extra fees for using particular sites, see Comcast – and we have Alexander the Great.

I’ve been thinking about moving my blog from Word Press to a self hosted ASP.NET server.  There are several reasons to do this, but many of them come down to the fact that I pay the mortgage with my job as a .NET programmer.  Version 3.5 is out in production, and I haven’t had much hands on experience with LINQ yet.  My own blog system will take some work, but it will also give me a chance to experiment with a lot of new concepts and make myself more valuable in the market.  But I digress, the real problem, in my mind anyway, is to create a spam filter that’s anywhere near as good as Akismet, which comes built in to WP.  I publish any comment that makes it past the spam filter (I believe in free speech), so I need my own,k and it has to be good.

I’m thinking the easiest way to do this is a rules based approach.  I can have a list of prohibited words, and give each of them a point value, then add up all the points for violations.  I’ve been thinking about which words to ban?

That’s what I’ve got so far.  This isn’t a terribly well planned post, really more of a brainstorm.  I would love to invite readers to share their thoughts on the matter, and, ultimately, I plan to open source what I come up with.  I firmly believe that ASP.NET is a better platform in most ways than PHP, but, sadly, there’s far less open code available for it.  A problem to be solved!

Here are some links to different spam research, in case anybody else is interested in tackling the problem.  Even if it’s already been done before, I think this is a valuable learning experience.  Like a muscle, the brain works best when it works often.

Advertisements

5 Comments »

  1. Unless you’re dead set on developing your own spam filter, you could just use the akismet API for .net. It’s available here: http://akismet.com/development/

    Comment by VirtuosiMedia — April 10, 2008 @ 9:11 am | Reply

  2. Or, rather than simply filter words, you could implement a “naive bayesian filter”.
    Wikipedia has an explaination there: http://en.wikipedia.org/wiki/Naive_Bayes_classifier

    A quick google find brought me to this: http://www.codeproject.com/KB/cs/BayesClassifier.aspx
    It’s a bit old, but surely you can adapt the code base to be integrated into yours.

    Comment by Tripy — April 10, 2008 @ 12:19 pm | Reply

  3. .

    Comment by Medieval — April 10, 2008 @ 2:13 pm | Reply

  4. I agree with VirtuosiMedia – I’ve seen several different scripts use the Akismet anti-spam API in the past – it’s definitely worth a try. It would be far more advanced than a custom spam filter that spammers would find a way around no doubt.

    Dan

    Comment by Dan — April 11, 2008 @ 6:08 am | Reply

  5. Hey John I commented on this post a week or two ago, and it wasn’t published immediately (I had a feeling it wouldn’t), so I was just wondering if you check akismet for those sort of comments or not?

    Maybe it was just deleted if you’ve been receiving tons of spam, and didn’t see it amongst the 100s of other spams blogs often get.

    Comment by James Lewitzke — April 24, 2008 @ 2:45 pm | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: