artificial intelligence - Better approach to filtering Wikipedia edits

Keywords:artificial  intelligence 


When you are watching for news of particular Wikipedia article via its RSS channel, its annoying without filtering the information, because most of the edits is spam, vandalism, minor edits etc.

My approach is to create filters. I decided to remove all edits that don't contain a nickname of the contributor but are identified only by the IP address of the contributor, because most of such edits is spam (though there are some good contributions). This was easy to do with regular expressions. I also removed edits that contained vulgarisms and other typical spam keywords.

Do you know some better approach utilizing algorithms or heuristics with regular expressions, AI, text-processing techniques etc.? The approach should be able to detect bad posts (minor edits or vandalisms) and should be able to incrementally learn what is good/bad contribution and update its database.

thank you

1 Answer: 

There are many different approaches you can take here, but traditionally spam filters with incremental learning have been implemented using Naive bayesian classifiers. Personally, I prefer the even easier to implement Winnow2 algorithm (details can be found in this paper).

First you need to extract features from the text you want to classify. Unfortunately the Wikipedia RSS feeds don't seem to be particularly machine readable, so you probably need to do some preprocessing. Alternatively you could directly use the Mediawiki API or see if one of the bot frameworks linked at the bottom of this page is of help to you.

Ideally you would end up with a list of words that were added, words that were removed, various statistics you can compute from that, and the metadata of the edit. I imagine the list of features would look something like this:

  • editComment: wordA (wordA appears in edit comment)
  • -wordB (wordB removed from article)
  • +wordC (wordC added to article)
  • numWordsAdded: 17
  • numWordsRemoved: 22
  • editIsMinor: Yes
  • editByAnIP: No
  • editorUsername: Foo
  • etc.

Anything you think might be helpful in distinguishing good from bad edits.

Once you have extracted your features, it is fairly simple to use them to train the Winnow/Bayesian classifier.