Protecting Your Content From the Spinning Spammers

Last week, Tony posted an article about a somewhat different kind of spam blogger.

The spammer had taken an article from this site, scraped it and then modified it before republishing. Though the method of modification remains debatable, it is clear that it was through some automated means as the duplicate version was mangled and borderline unintelligible.

However, the unfortunate truth is that this type of scraping is not as uncommon as we might wish and the technology to do it has been around for several years. Worse still, this type of scraping is growing much more popular as search engines clamp down on duplicate content and ad networks get better at detecting traditional content theft.

Modified scraping is a rising threat that bloggers need to be aware of as it presents a whole new set of challenges for content creators.

No Laughing Matter

It is easy to laugh at these automatic scrapers as their results are often quite comical and produce gems such as this out of completely legible text:

“One word, or the demand thereof, denaturized the full intent. Such is the venture digit takes when they indite in a module in which they demand the pertinent fluency.”

However, the broken English belies the full extent of the problem. Spammers create these works by taking posts from legitimate bloggers and then running it through an algorithm. This can involve using a thesaurus to find synonyms for the words in questions or an automatic translation program to convert the work into another language, possibly then converting it back to English.

This process of modifying the content before reposting it is often called “spinning”. Spinning a work before republication has several advantages, the largest of which is that Google is less likely to detect the work as a duplicate and, thus rank it higher. However, almost equally important is that it is much harder for victims of plagiarism to detect and follow up on the misuse, making this kind of abuse much harder to stop.

The good news in all of this is that, since so little of the content remains the same, the odds of the search engines penalizing the victim are much more slim than with traditional spamming. However, this isn’t saying that these modified scrapers aren’t targeting similar keywords to your site, which they often intentionally leave intact when spinning a work, and might usurp the original work through a combination of scraping and spam linking.

Though less of a direct threat to bloggers, these scrapers are still a major thorn to legitimate content creators and remain a threat well worth addressing.

Legal Issues

The problem is that, when confronted with this type of scraping many feel that there is little that they can do. They fear that, since the reuse isn’t verbatim, that the law does not protect them and there is no action they can take.

Fortunately, the law is very clear on this subject. Copyright is not merely the right to copy one’s own work, but a set of rights that includes the right to create derivative works. This is why only J.K. Rowling can sell Harry Potter books, though she does tolerate non-profit fan fiction, and why spinning a work is almost always still illegal.

This right to create derivative works covers the right to create translations and any other work based on copyrightable portions of the original. Spinning, since it starts with a copyright-protected work and creates a new work based upon it, violates that right.

Fair use arguments fall equally flat in the eyes of the law. Spinning is not transformative as it is designed to replace the original, it offers no commentary or criticism, it is for commercial use, it can greatly harm the market for the original work and usually is unattributed. There is almost no fair use argument left for the spammers who modify the posts they scrape, leaving the door wide open for rightsholders to take action.

In short, though I am not a lawyer, I can see little reason to doubt your rights in the event you detect such scraping of your content. Your work is still very much protected and your rights are still very much enforceable.

What to Do

Of course, knowing that your work is protected does little good if you can not detect the misuse of your content. As we discussed earlier, this can be a challenge as the content has been modified and most search engines can only detect verbatim copying. Even powerful academic tools. such as Turnitin, struggle when faced with non-verbatim copying.

In a recent article on my site, I talked about various techniques for detecting spun versions of your posts. Those tips included the following:

  1. Digital Fingerprinting: Digital fingerprinting is a process by which you append a unique word or phrase to the end of your posts in your RSS feed. If the feed is scraped, so is the fingerprint and searching for that string of characters tells you which sites have taken your content. Since fingerprints don’t have easy translations or synonyms, they remain intact through the spinning process. Plugins such as the Digital Fingerprint Plugin and Copyfeed can automate the process.
  2. Trackback Monitoring: As was the case with Tony’s original post, spam blogs often leave links in the scraped post intact, even as they modify the copy. They often send trackbacks to those URLs in a bid to get extra incoming links to the spam blog. If you link to your own articles when writing, you can watch the trackbacks and get an idea for who is using your content, even if it is spun.
  3. FeedBurner Tracking: FeedBurner offers a very powerful “uncommon uses” feature that tracks where your feed is published. Since FeedBurner does not depend upon the post content to track the feed, spinning the text will not fool the system.

Once you’ve detected the scraping, you then have all of your typical resolution techniques at your disposal including contacting advertising networks, such as Adsense, filing a DMCA notice with the host or sending a such a notice to the search engines.

In short, detecting spun content is the hard part, dealing with it is relatively easy. Still if ever you need help with that, please feel free to post the Performancing Legal Issues Forum and I will be glad to assist you.

Conclusions

In the case that Tony references, we discovered after some research that the blog in question is really just the tip of a much larger spam blog network. So, we are currently contacting and filing notices with the ad networks involved to see if we can sever the revenue stream and, once that is done, we will seek takedown of the infringing work.

The process may be slower and requires more work but, since there is little harm being done to the original work in the short run, we feel it is more valuable to try and topple the whole network before seeking removal of the infringing work.

It is a bid to clean up at least one small corner of the Web and, hopefully, we’ll begin to show the fruits of that labor very soon.

Comments

  1. eschaton says

    I’ve also noticed that some bloggers scrape their own material, run it through text-modifying software, and post it to splogs (numbering often in the hundreds) linking back to their main blog in order to create massive Technorati authority gains.

  2. says

    I have a big problem with sploggers ripping off my content from my Free Stuff blog. I have contacted a few by commenting on their blog threatening legal action. It actually worked a few times.

    But most don’t care. I added an auto sig to my feed with a wordpress plugin, so at least i am getting backlinks. But the truth is the they are not quality links. I would rather just have my content stay mine!

    some people just have no lives….

  3. says

    Eschaton: There’s special software that does exactly that, working like the scrapers I’ve described here but without the scraping ability. Yet it uses the variations of the theme to spin hundreds of copies of the same work, all with modification.

    This something Google has had to work very hard to stop but seems to be making at least some progress.

    Ross: If you want, either shoot me an email or post about the problem to the Performancing Legal Issues Forum and I’ll see what I can do to help.

    http://performancing.com/forums/performancing-blog-forums/legal-issues

    There are other ways of handling this than just threatening legal action. One doesn’t need a lawyer, just to know the law.

    Let me know if I can help!

  4. says

    Since I’m just getting started with my own domain and blogging site, I found your article very informative. One item that troubled me though is the link that you have for the digital fingerprint plugin. It took me to a Google page that warned me that the site in question could harm my computer. Their page took me to StopBadware.org which had further information about the safety of visiting this plugin site. Is the site for the digital fingerprint legitimate and safe?

  5. says

    Thanks your article told me alot!
    I am finding they are taking my blog title and what I have written but the only way I know is the fact I have google alerts set up for keywords I blog about often. Yet I visit the spammer and there is no way to contact the person. I am finding I am now getting scrapped often due to the subject matter I blog about. I tell google on them.
    I also have a site that uses my blog as part of they membership ads, will there a way to set up the digital footprint to ignore them?

  6. says

    Ron & Lorelle: I’ve spoken with the person who wrote the plug in many times and the site is fine. However, if you don’t feel comfortable you can use the Copyfeed plugin as it has the same functionality along with many more features.

    Kim: First, in the future, you may want to consider informing their ad networks and their host about what is going on before contacting Google. The reason is that the latter doesn’t remove your work from the Web and other search engines. If you cut off the money and then cut off the hosting, you do much more harm to the spammer.

    As far as the other site you’re talking about, the secret there is to create a second, secret, feed that you only give out to sites that need a version of it without the fingerprint. You can use FeedBurner to do that. It can create two feeds from one and then add the digital fingerprint to one of the feeds using Feedflare.

    Hope that helps!

  7. Spewb says

    Under the Computer Fraud and Abuse Act (CFAA), which forbids exceeding authorized access to a computer with the intent to defraud the host of the blogging site can bust the scrappers not for plagiarizing your work but for having these guys access their servers repeatedly for the purpose of scrapping ( an act that is against almost everyone’s terms of service).

  8. says

    Spewb: As true as that is, the process is much more complicated. To get almost anything done under the CFAA you have to get an attorney, file an injunction and jump through legal hoops. The DMCA is as simple as a sheet of paper and takes less than 48 hours.

    It is a good alternative though, something to consider.

Trackbacks

  1. […] I find it hard to believe someone would want to steal my blog content. However when I saw a WordPress blog entry on it I had to stop and take a peek. What happens to your innocent blog article? Someone takes it, scapes it, modifies it, and republishes it claiming it for themselves. Now that’s my content someone else regurgitates on to their website. Called spinning, someone runs your content through an algorithm that can involve using a thesaurus to find synonyms for the words in question or an automatic translation program to convert the work into another language, possibly then converting it back to English. The Blog Herald has a great article aboutProtecting Your Content From the Spinning Spammers. […]

Leave a Reply

Your email address will not be published. Required fields are marked *