Last week, Tony posted an article about a somewhat different kind of spam blogger.
The spammer had taken an article from this site, scraped it and then modified it before republishing. Though the method of modification remains debatable, it is clear that it was through some automated means as the duplicate version was mangled and borderline unintelligible.
However, the unfortunate truth is that this type of scraping is not as uncommon as we might wish and the technology to do it has been around for several years. Worse still, this type of scraping is growing much more popular as search engines clamp down on duplicate content and ad networks get better at detecting traditional content theft.
Modified scraping is a rising threat that bloggers need to be aware of as it presents a whole new set of challenges for content creators.
No Laughing Matter
It is easy to laugh at these automatic scrapers as their results are often quite comical and produce gems such as this out of completely legible text:
“One word, or the demand thereof, denaturized the full intent. Such is the venture digit takes when they indite in a module in which they demand the pertinent fluency.”
However, the broken English belies the full extent of the problem. Spammers create these works by taking posts from legitimate bloggers and then running it through an algorithm. This can involve using a thesaurus to find synonyms for the words in questions or an automatic translation program to convert the work into another language, possibly then converting it back to English.
This process of modifying the content before reposting it is often called “spinning”. Spinning a work before republication has several advantages, the largest of which is that Google is less likely to detect the work as a duplicate and, thus rank it higher. However, almost equally important is that it is much harder for victims of plagiarism to detect and follow up on the misuse, making this kind of abuse much harder to stop.
The good news in all of this is that, since so little of the content remains the same, the odds of the search engines penalizing the victim are much more slim than with traditional spamming. However, this isn’t saying that these modified scrapers aren’t targeting similar keywords to your site, which they often intentionally leave intact when spinning a work, and might usurp the original work through a combination of scraping and spam linking.
Though less of a direct threat to bloggers, these scrapers are still a major thorn to legitimate content creators and remain a threat well worth addressing.
The problem is that, when confronted with this type of scraping many feel that there is little that they can do. They fear that, since the reuse isn’t verbatim, that the law does not protect them and there is no action they can take.
Fortunately, the law is very clear on this subject. Copyright is not merely the right to copy one’s own work, but a set of rights that includes the right to create derivative works. This is why only J.K. Rowling can sell Harry Potter books, though she does tolerate non-profit fan fiction, and why spinning a work is almost always still illegal.
This right to create derivative works covers the right to create translations and any other work based on copyrightable portions of the original. Spinning, since it starts with a copyright-protected work and creates a new work based upon it, violates that right.
Fair use arguments fall equally flat in the eyes of the law. Spinning is not transformative as it is designed to replace the original, it offers no commentary or criticism, it is for commercial use, it can greatly harm the market for the original work and usually is unattributed. There is almost no fair use argument left for the spammers who modify the posts they scrape, leaving the door wide open for rightsholders to take action.
In short, though I am not a lawyer, I can see little reason to doubt your rights in the event you detect such scraping of your content. Your work is still very much protected and your rights are still very much enforceable.
What to Do
Of course, knowing that your work is protected does little good if you can not detect the misuse of your content. As we discussed earlier, this can be a challenge as the content has been modified and most search engines can only detect verbatim copying. Even powerful academic tools. such as Turnitin, struggle when faced with non-verbatim copying.
In a recent article on my site, I talked about various techniques for detecting spun versions of your posts. Those tips included the following:
- Digital Fingerprinting: Digital fingerprinting is a process by which you append a unique word or phrase to the end of your posts in your RSS feed. If the feed is scraped, so is the fingerprint and searching for that string of characters tells you which sites have taken your content. Since fingerprints don’t have easy translations or synonyms, they remain intact through the spinning process. Plugins such as the Digital Fingerprint Plugin and Copyfeed can automate the process.
- Trackback Monitoring: As was the case with Tony’s original post, spam blogs often leave links in the scraped post intact, even as they modify the copy. They often send trackbacks to those URLs in a bid to get extra incoming links to the spam blog. If you link to your own articles when writing, you can watch the trackbacks and get an idea for who is using your content, even if it is spun.
- FeedBurner Tracking: FeedBurner offers a very powerful “uncommon uses” feature that tracks where your feed is published. Since FeedBurner does not depend upon the post content to track the feed, spinning the text will not fool the system.
Once you’ve detected the scraping, you then have all of your typical resolution techniques at your disposal including contacting advertising networks, such as Adsense, filing a DMCA notice with the host or sending a such a notice to the search engines.
In short, detecting spun content is the hard part, dealing with it is relatively easy. Still if ever you need help with that, please feel free to post the Performancing Legal Issues Forum and I will be glad to assist you.
In the case that Tony references, we discovered after some research that the blog in question is really just the tip of a much larger spam blog network. So, we are currently contacting and filing notices with the ad networks involved to see if we can sever the revenue stream and, once that is done, we will seek takedown of the infringing work.
The process may be slower and requires more work but, since there is little harm being done to the original work in the short run, we feel it is more valuable to try and topple the whole network before seeking removal of the infringing work.
It is a bid to clean up at least one small corner of the Web and, hopefully, we’ll begin to show the fruits of that labor very soon.