Stop the wacky wikis!
UI researchers design tool to improve Wikipedia accuracy
Check the Microsoft entry on Wikipedia at some point in the past and you might have learned that the company’s name is Microshaft, its products are evil, and its logo is a kitten.
Similarly, you may have learned from Abraham Lincoln’s Wikipedia entry that he was married to Brayson Kondracki, his birth date is March 14, and Pete likes PANCAKES.
None of these are correct and/or relevant, but they all showed up at one time or another in the online encyclopedia’s listings. They are also an example of one of the challenges facing Wikipedia—finding and undoing the malicious editing that introduces facts that are incorrect, misleading, editorializing or just plain bizarre.
But a group of University of Iowa researchers are developing a new tool that can detect potential vandalism and improve the accuracy of Wikipedia entries. The tool is an algorithm that checks new edits to a page and compares them to words in the rest of the entry, then alerts an editor or page manager if something doesn’t seem right.
Existing tools do exist that try to weed out potential vandalism and are quite useful in many cases, says Si-Chi Chin, a graduate student in UI’s Interdisciplinary Graduate Program in Informatics. Those tools are based on rules and screens that spot obscenities or vulgarities, or major edits, such as deletions of entire sections, or significant edits throughout a document (changing “Microsoft” to “Apple” in the Microsoft entry, for instance).
But those tools are built manually, with prohibited words and phrases entered by hand, so they’re time-consuming and easy to evade. They also aren’t as good for catching smaller types of vandalism that lead Chin and her professors to develop the automated tool. They recently tested the algorithm by reviewing all the edits made to the Abraham Lincoln and Microsoft entries, Wikipedia’s two most vandalized pages, to see how many of the pernicious edits it could find. That meant reviewing more than 4,000 edits in each entry. Some are still on the page, but most have been deleted and archived.
As described in their paper, “Detecting Wikipedia vandalism with active learning and statistical language models,” the statistical language model algorithm works by finding words or vocabulary patterns that it can’t find elsewhere in the entry at any time since it was first written. For instance, when someone wrote “Pete loves PANCAKES” into Lincoln’s section, the algorithm recognized the graffiti as potential vandalism after scanning the rest of the entry.
“It determines the probability of each word appearing, and because the word ‘pancakes’ didn’t turn up anywhere else in the history of Lincoln’s entry, the algorithm saw it as something new and possible graffiti,” Chin says.In all, the statistical language model algorithm caught more of the vandalism in some categories than existing tools.