Akismet For MT: Death to Spam

If that headline means anything to you, then rejoice. It's long been known that Akismet, WordPress's remarkable anti-comment spam technology, was the best out there. Moveable Type users (like me) salivated at the thought of having Akismet-like functionality on our sites. The technology works in an AI like fashion,…

Akismet

If that headline means anything to you, then rejoice. It’s long been known that Akismet, WordPress’s remarkable anti-comment spam technology, was the best out there. Moveable Type users (like me) salivated at the thought of having Akismet-like functionality on our sites. The technology works in an AI like fashion, learning from the edges – bloggers like us – what is spam, and what is not. It’s elegant, and it scales.

Well, thanks to the folks at Automattic (and a big assist from Scot Hacker, Searchblog’s native web jockey), it’s now possible to run Akismet as a Moveable Type plugin. Searchblog was among the first to test the Akismet plugin, and it is working beautifully. Sure, you’ll see spam on this site from time to time. But as soon as I label it “junk” in my MT backend, it’ll never show up again. Yeeehaw!

PS – Akismet tracks spams blocked on its home page. According to those figures, 84 percent of all comments left in the blogosphere are spam. Holy crap.

27 thoughts on “Akismet For MT: Death to Spam”

  1. Thanks for this information, spam on my MT driven site has been driving me nuts. This seems to do what I’d hoped MT-Blacklist was supposed to achieve.

    Secondly, is that the Scot Hacker that wrote the BeOS Bible? That was a great book – it really taught me a lot about BeOS back in the day. I wish that operating system had been more successful, it was fantastic.

  2. Yikes. If we assume that 84% of the comments are spam, 10% are irrelevant, 5% are stupid and 0.9% are marginally psychotic it leaves only one comment in a thousand that I need to be reading.

    But not *here* at search blog where 4 out of 5 comments are so useful it’s painful.

  3. Good to know. We upgrated to MT 3.2 just a week ago, and we already have over 1700 junk comments. For us, the noise % is way above 84.

  4. Akismet is fantastic; I use WordPress and Matt’s Akismet for many web sites and I do not know what I would do without it. Since blogging is an extremely part time thing for me, I was on the verge of blocking anything with a link or requiring registration before Akismet showed up.

  5. John,

    Do you know if Akismet has any effect on server load? My issue with fighting spam has been that no matter what method I’ve used they always end up causing a great load on my server — and then my hosting firm threatens to shut me down. I’ve had to turn off comments b/c of that. I’d love to try Akismet if it’s ‘easy’ on a server. From what I can tell it seems like it should be…

  6. I installed it 10 days ago and all I can tell you is this, from my WordPress management screen:

    Akismet has caught 72 spam for you since you installed it.

    Gotta love it!

  7. Andrew – Yep, that’s me (ex-BeOS guy). And I feel the same way about BeOS, though OS X has in most ways become what BeOS always wanted to be, with a few exceptions (speed and metadata, primarily, but speed diffs have been mostly nullifed by fast modern hardware and arbitrary metadata is making its way into OS X very quickly).

    Michael – This was my hope as well — that the comment form action would go FIRST to Akismet servers and only be returned if non-spam, thus greatly reducing server load from MT sites. On further reflection, that’s not easy to do, for one good reason: The author needs the ability to scan for false positives and to mark things as spam/ham, to educate the mothership. If Akismet blocked most comments without giving you the chance to review them, it would prevent you from catching false positives and would break the feedback loop. So to be effective, all comments still must go into the blog installation’s comments system to give you a chance to flag false positives, etc.

    There are two levels of server resource impact:

    1) A comment is submitted and entered into the database, but not published on the site.

    2) A comment is submitted and published.

    With WordPress, those two levels are identical in terms of server impact, since everything is dynamic. With Movable Type, publishing always requires a page rebuild, which is CPU-expensive. So the more you can prevent unnecessary publishing, the lower the impact on the server. What you really want to do is prevent spam submissions from triggering page rebuilds. Akismet is great at preventing unneccessary publishing, but so are a lot of other spam fighting tools. Fortunately, Akismet has a VERY low false positive rate, so virtually all of its estimations about what is/is not spam are correct.

    Nutshell: Akismet/MT doesn’t reduce the overall number of database inserts, but it does reduce the number of CGI-based page rebuilds that are ultimately triggered.

    So… Akismet/MT is an awesome plugin, but it’s not a panacea in terms of reducing server load. It could conceivably become that by introducing a proxy server for comments into the chain. With an external proxy server, you could point the comment form action to the proxy, which would then hold comments for review. The result would be that impact on the main web server would be drastically reduced, but a review cycle / feedback loop would still be possible.

    Hope this makes sense!

  8. Thanks Scot. Funny thing when my server was getting slammed, it wasn’t from MT doing rebuilds, it was just from the spam filter. None of the spam was actually getting through.

    I guess I’ll just have to give Akismet a try and see what the server load is like.

  9. Russell – Akismet catches TrackBack spam too!

    Michael – You don’t mention which spam filter you were using; there are so many different approaches to filtering, and some are high-impact, others low. Some make a CGI call, others don’t. Some make a database insert regardless, others filter before hitting the database. A lot of variables in that equation.

  10. MT-Blacklist: For every comment submitted, extract the entire list of blacklisted strings from the database and compare the current submission to it. Only do the database submission if tests are passed. So there’s a large db request, a CGI action, and probably another db insert.

    SpamLookup: For every comment submitted, compare to a ruleset. But the ruleset is stored in the database. So essentially the same deal, except that you don’t have to pull out the entire blacklist to do the compare.

    MT-Akismet will be less expensive CPU-wise than either of those, since you have a brief call to CGI, some data transmission between you and Akismet, and then a db insert (probably of a comment flagged as junk).

  11. At the risk of asking about something that’s obvious, the recommended way to use the plugin is as the only anti-spam tool, turning off Spamlookup and such?

  12. billg – Yep, that’s what I would recommend, since scoring between systems can conflict (where it should complement). Try it first with other spam systems off, and maybe moderation enabled (at least until Akismet/MT is catching as much as Akismet/WP (spam signatures seem to differ between the systems, resulting in slightly lower effectiveness right now – that should improve as Akismet has been in the wild for a while.

    Annoying Old Guy – That autoban plugin appears to be on exactly the right track, and I like that it doesn’t require mod_security, like other similar systems, as mod_security can really drag down Apache 1.x when there are very large rulesets. Out of curiosity, how large (in Kb) are your .htaccess rules, and have you noticed any negative effect on Apache from storing them in memory?

  13. Glad to see all the excitement, and I really do think this is another plugin that shows off how extensible MT’s junk folder system is. Just to respond to a couple of the points here…

    “We upgrated to MT 3.2 just a week ago, and we already have over 1700 junk comments. ”
    Junk should automatically be deleted in your MT system, so the number of junk comments that are there shouldn’t be an issue for you. It’s just a good sign the system has caught all those for you.

    “With Movable Type, publishing always requires a page rebuild, which is CPU-expensive.”

    Scot, I’m assuming you’re talking about your particular setup, but I wanted to make clear that MT does *not* require a page rebuild, and hasn’t for years. It’s just a smart setting for high-traffic sites that don’t want to be database-dependent for their page views.

    Also, out of curiosity, how are your sites running Akismet dealing with this in the privacy policy or commenting policy? I’ve seen a lot of what John writes about privacy for any site storing or filtering people’s content on the web, and I’d guess readers here (or on other business sites, where MT is often deployed) are going to be more picky about these kinds of issues than on personal blogs.

  14. Thanks, Scott. I’m preparing to move a new MT site from my desktop to the server, and will likely go with the plugin and moderation.

  15. “…MT does *not* require a page rebuild, and hasn’t for years. It’s just a smart setting for high-traffic sites that don’t want to be database-dependent for their page views.”

    I suspect the implication some folks take from that fact is that using static pages is, in every case, a bad thing for low-traffic sites.

    But, why would that be? I gather some users have had issues with their hosting companies re: server load brought on when they rebuild large archives. Hence, I’d guess, the recommendation to go dynamic with archives and static with everything else. However, if the server isn’t complaining about rebuilds, wouldn’t the advantages of going static accrue to both high-traffic and low-traffic sites?

  16. Good point, Bill, I should have said it can be a benefit for sites of any size, but that the balance tilts even further towards static pages as your traffic gets into some of the more extreme author-to-reader ratios that very large sites have. Does that make more sense?

  17. Looks like my biggest one right now is about 35K (~3000 entries). I haven’t noticed any performance hit, but it’s not my server so my measuring ability is crude at best. I asked the webhost staff about it and they didn’t think several thousand entries was a problem.

    The bigger concern would be the server hit when an object is junked, because it loads all the junk from the DB and regenerates the .htaccess file. My opinion is that overall, that’s much cheaper if you are getting hit by the same source in a big burst because after the first N the server doesn’t load up the MT application. The plugin also does file locking so that a new update doesn’t start if a previous one hasn’t finished. That should limit the extra load induced by the plugin to something reasonable.

    However, that is all just opinion from the crude observations I have made. It seem that the number of IP sources tends to top out around 7000 and I have run with .htaccess files with that many entries with no noticeable effects.

  18. Scot, I’m assuming you’re talking about your particular setup, but I wanted to make clear that MT does *not* require a page rebuild, and hasn’t for years. It’s just a smart setting for high-traffic sites that don’t want to be database-dependent for their page views. Also, out of curiosity, how are your sites running Akismet dealing with this in the privacy policy or commenting policy?

    Anil, true, MT can render pages dynamically, but the default config is still static, right? And as a result, the vast majority of MT blogs are all static. So most low-traffic sites are static too.

    I’ll let John respond on whether he sees a privacy concern here — I would say that that would be a largely academic concern, but if he wants to add a note that Akismet is external, we can certainly add it to the site.

  19. I gather some users have had issues with their hosting companies re: server load brought on when they rebuild large archives.

    billg: People rebuild sites very seldom, but are hit with spam constantly. I would venture that any host that asks an MT user to stop is referring to performance issues stemming from comment spam attacks, not site rebuilds. Thanks for the tech notes on autoban — this definitely looks like a good one for the toolbelt.

  20. My ISP complained about the traffic load, that’s what alerted me to the trackback spam. 10,000 links in the space of 2 days. I turned off the trackbacks & deleted all the links.

    The problem with filtering is that simple changes can defeat filters. The best filter is the human brain and a good white list.

    Brent
    *********************
    Block Spam 100%
    http://www.spam-killa.com
    The Online Solution to email Spam.

Leave a Reply to Anil Dash Cancel reply

Your email address will not be published. Required fields are marked *