How to Detect Plagiarism and Content Theft

Filed as Features, Guides on May 21, 2007 8:46 am

Repost This

Content theft is on the rise and the problem is spreading to more and more bloggers.

Many blogs, especially those with spam-friendly keywords, are scraped from their very first post. Those who avoid that fate, sadly, seem to follow soon thereafter as their sites receive links and gain the attention of blog search engines and content-hungry spammers.

When it is all said and done, it is not a matter of if, but when, your new blog is plagiarized, either via an automated process or by a human looking to fill the pages of their own site.

However, detecting such plagiarism can be a daunting challenge. With the Internet as vast as it is and growing every second, finding plagiarized copies of your work can seem to be akin to finding a needle in a haystack.

Fortunately, the very tools that spammers and plagiarists rely on to benefit from your work make it easy to locate them. It is simply a matter of knowing how to use the tools that are available.

Textual Plagiarism

Of all the media formats, text plagiarism is the hardest to prevent. Though a few javascript tricks exist they can be easily defeated and other techniques, such as hiding the text in an image, also hide the text from search engines.

Fortunately though, text plagiarism can be easily detected. The easiest way is to simply find a unique sentence in an article you created and then search for that phrase in one of the major search engines.

For example, with this article, a search for “tools that spammers and plagiarists rely on to benefit” works well (at least as of this writing). To eliminate the need for repeated searching, one can set up a Google Alert for the phrase, thus having Google notify you via email when a new site with the phrase appears.

Another technique, for WordPress bloggers, is to use Maxpower’s Digital Fingerprint Plugin to insert a customized fingerprint into every post. By default this is placed in every feed entry but can be modified to be included in the site itself manually. The fingerprint creates an artificial unique phrase that you can easily search for and set up a Google Alert for.

Finally, there are several Web sites that provide automated searches for Web Site content and can help eliminate many of the challenges in searching for your own work. Of those, Copyscape is the most refined, however its limit of ten results per free search limits its usefulness. PlagiarismChecker.com provides a basic plagiarism check by using an algorithm to guess unique phrases in a page. Finally, Article Checker provides a very thorough line-by-line search of a site, helping the user pick up likely phrases that might lead them to plagiarist sites.

Photo Plagiarism

Unlike text plagiarism, photo plagiarism is relatively easy to guard against. However, as photo sharing sites, such as Flickr, become more popular and make uploading images easier, it is also becoming much more common.

Unlike text, there is no easy way to search for plagiarized photographs. Since search engines only understand text, finding an image, even over a specialized search engine such as Google Image Search, can be difficult.

One technique is to give your images unique file names and search by that. Also, if they are available, keeping a close eye on your server logs can be very useful as many plagiarists will not only lay claim to the photo, but also hotlink it off of your own server, thus stealing bandwidth and the work that went into the image.

There are also new, experimental search engines designed to detect images that are very similar in nature. Though several such search engines are being worked on, none are indexing the Web as a whole yet and only a few are publicly available for use.

Finally, in addition to image search engines, Digimarc also offers a MyPictureMarc service that embeds images with an invisble watermark and then tracks the image as it is distributed around the Web. However, the version of the service that offers image tracking starts at nearly $500 per year and that puts it out of the reach of most amateur photographers.

The bottom line is that, at this time, it is far more wise to ensure that your images are marked well and are harder to steal. Most cases of image plagiarism, at this time, are reported by loyal fans and not discovered through technology. Though new tools promise to change that, they are several months off at least.

Audio and Video Plagiarism

At this time, audio and video plagiarism are relatively rare when compared to image and text plagiarism. The tools for editing audio and video are more expensive and harder to use. Also, the tools for hosting such content has, historically, been out of the reach of most Webmasters.

However, with the advent of sites such as YouTube, the concern about audio and video plagiarism has grown. Unfortunately, though, it remains some of the hardest plagiarism to detect.

Though there are many companies in the field of identifying duplicate audio and video content, they are typically targeted at large corporations, not end users.

For example, Gracenote, famous for its work with Myspace fingerprints audio and Audible Magic offers tools for both Audio and Video. Both are very effective at what they do but they are targeted at corporate users, not individuals.

Despite this, it is possible to detect a decent amount of plagiarism of audio and video content. Since tagging of clips is necessary to make the information easily searchable, following the tags related to your content may be the best way to track such plagiarism at this time.

Other than checking server logs, that is the best tool available right now. However, it likely will not be long until a company steps in to fill this market, especially as the podcasting and vlogging continue to grow in popularity.

Conclusions

Detecting plagiarism and content theft, for most kinds of content, is a fairly simple matter. The tools are already available and it is just a matter of knowing how to exploit them in order to find out who is misusing your work.

Even if plagiarism doesn’t interest you, the same techniques can be used to track down legitimate uses of your content, for example, people taking advantage of your Creative Commons License, and learn about how your work is spreading across the Web.

But the bottom line still remains, if you post to the Web, especially in a blog, it is only a matter of time before your work is taken. It can be a frustrating experience, but the first step to stopping it is learning about it and, in order to learn about it, you just have to know where to look.

Note: I am not a lawyer and nothing in this article is to be taken as legal advice. Though it is based upon extensive research into fair use, it is not to be taken as legal truth. If you have a question about fair use, it would be best to take it up with an attorney.


Jonathan Bailey writes at Plagiarism Today, a site about plagiarism, content theft and copyright issues on the Web.

Tags: ,

This post was written by

You can visit the for a short bio, more posts, and other information about the author.

Submissions & Subscriptions

Submit the post to Reddit, StumbleUpon, Digg or Del.icio.us.

Did you like it? Then subscribe to our RSS feed!



  1. By Tony Hung posted on May 21, 2007 at 8:56 am
    Want an avatar? Get a gravatar! • You can link to this comment

    I will sound like a total homer when I say this — but damn, Jonathan, I am *so* glad you’re on the BlogHerald team.

    What a great article! :)

    tony

  2. By Jonathan Bailey posted on May 21, 2007 at 9:27 am
    Want an avatar? Get a gravatar! • You can link to this comment

    Thanks Tony. I’m glad to be here! Looking forward to writing the next one already!

  3. By pelf posted on May 22, 2007 at 10:34 am
    Want an avatar? Get a gravatar! • You can link to this comment

    While the subject line is interesting, the article is too long (IMHO). So I guess I gotta come back trrw to continue reading the rest! :)

  4. Blog News Watch » Blog Archive » Wednesday Roundup: May 23, 2007May 23, 2007 at 10:41 am
  5. How to Stop Plagiarism Cold : The Blog HeraldMay 28, 2007 at 8:00 am
  6. Article Checker Featured In Blog Herald | Article Checker Captures: AC CapturesJune 15, 2007 at 3:06 am
  7. The Logo Factor - Design Blog » Rip offs, knock offs and downright skullduggeryJune 22, 2007 at 7:03 am
  8. Detecting Plagiarism and Content Theft « DM2KJanuary 8, 2008 at 8:59 am
  9. Breaking Trust: How Not To Link to a Plagiarist : The Blog HeraldMarch 5, 2008 at 3:23 am