Wikipedia:Turnitin/Intro

Wikipedia has a copyright problem, and just how massive a problem there is remains obscured by a lack of sophisticated and scaled tools to detect copyright violations. The most advanced of these, CorenSearchBot (now MadmanBot), compares new articles against a Google search for similar pages. Turnitin, one of the leading plagiarism detection service providers in the world, could offer us a system significantly more comprehensive than that. Turnitin processes millions of documents for thousands of institutions. They are experts at finding and flagging overlaps between documents on the one hand, and webpages, magazines, journals, and prior submissions to Turnitin on the other.

Right now the most involved vision for a collaboration (which is still in preliminary discussion) might look something like the following. Turnitin adapts their algorithm to tailor specifically to analyze Wikipedia documents for plagiarism. (Plagiarism does not mean copyright violation, but it's the best starting point for an investigation). Their algorithm would be tweaked to ignore Wikipedia mirrors and other sites with copyright licenses compatible for use on the encyclopedia. At off-peak hours for Turnitin, they could run full reports of every single article on English Wikipedia. The reports would detail which parts of Wikipedia articles matched web content, proprietary content, and, if desired, prior submissions to Turnitin. The reports would identify which external source positively overlapped for each match. A page cataloging instances of plagiarism could be created which ranked articles from highest to lowest match-score. Bots could be developed which automatically flagged articles with high scores. Copyright forums could be updated to deal with worst offenders a prioritized way.

In return for providing this service, Turnitin would like some attribution. Although specifics have not yet been detailed, this could come in several different variations, all of which will have to be approved by the community. The current proposal is for a talk page banner that says, "This article was checked for text-matches against other websites and articles. Click here to read the full report." The linked report would be 'branded' as something like WikipediaCheck, and the lower right corner would contain a logo for iThenticate, Turnitin's parent company. There could also be promotion of whatever plan is decided upon in different community forums, especially those that focus on Copyright investigations, and content production/review (AfC, DYK, GA, FA). Last, Turnitin could advertise in their own promotional materials that they help 'check Wikipedia'. Also, an external requirement/constraint is that whatever system we would implement complements rather than overloads their core site operations.

The benefit of a collaboration with Turnitin is access to their experience with detecting plagiarism and having significant technical support for implementing a plan forward. If every Wikipedia article was checked for plagiarism, it could massively retool how we approach content, how we root out copyright violations, and how we ensure the encyclopedia is indeed free for anyone to use, modify, or sell. That is not only required by copyright law, but also by our policies, and our core mission. The expertise and level of services Turnitin is considering providing to us would cost tens if not hundreds of thousands of dollars if purchased. In this case, they seem interested in offering the same level of service for nothing but us saying that they are doing so. It's a tempting and intriguing possibility, and it would need careful consideration to design, approve, and implement in a way that maximally benefits our community and minimally disrupts it.

The prevalent concerns for such a partnership involve core issues of non-commerciality and independence. What is at stake from having a for-profit company provide a service to us? Are there free, open-source, or competing alternatives? How much attribution crosses the line into advertising? Are we setting a precedent which could compromise Wikipedia's neutrality or invite future alliances with corporations that would threaten our basic principles. I believe these deep questions can be fairly addressed in this context, and when weighed against the potential benefit of a collaboration with Turnitin, at the least are worth exploring further.

FYI, I have no financial or personal connection with Turnitin. I heard about them through a mainstream news article and wondered if they could help us with a core site management issue. I have recently set-up a somewhat similar partnership with HighBeam Research, which donated 1000 free 1-year accounts for us to use ($200,000 worth of services on paper), and in return they only wanted us to advertise the account application process widely and follow our citation policies in providing links back to their articles when they were used as references. It is my belief that these relationships, which are informal, non-contractual, and non-exclusive, can be mutually beneficial and help us do what we do better. Ocaasi^{t | c} 22:11, 28 March 2012 (UTC)[reply]