It’s a sad fact that pretty much any content posted to a blog or otherwise passed through an RSS feed is going to get scraped at one point or another. On the Web today, there are so many spammers spitting out junk blogs that almost everyone will be a victim of content theft if they blog long enough.
That being said, some content and some sites are more at risk than others. The type of content you post and the way you present it both are major factors in exactly how at risk your site is.
Even though you can not mitigate or change many of these factors, being aware of them can help you assess your site’s risk in this area and take appropriate action. Meanwhile, ignoring the factors could leave you seriously unprepared for a very real problem.
Types of Content
If you want to know what types of content are most commonly scraped, take a look in your email spam folder. If you take a stroll around the splogosphere, you’ll generally find the same kinds of spam there that you’d expect to find in your inbox.
Gambling, for example, is a popular topic in both worlds. Also popular with all kinds of spammers are advertisements for adult sites, financial services, pharmaceuticals and questionable software.
However, simply not blogging about these topics is not enough. As Rose Desroches found out over a year ago, spam bloggers determine what content is scraped not by the actual content, but by targeted keywords. Desroches learned that the hard way when a post she wrote helping parents protect their children was scraped and used to promote teen pornography.
Simply not blogging about sex or gambling is not adequate protection. If spambots find the keywords they want in your post, they will scrape it, no matter what the context.
It is important to closely note not only the topics you write about but the keywords you use. Then, follow closely the posts or sites that have keywords that may be interpreted incorrectly by a bot so you can deal with any suspicious use of your work.
In addition to the content in your feed, the way you present it on your site or shoot it out to the world can also have an impact on how much scraping you have to deal with.
First is the issue of full vs. partial feeds, a popular debate topic among bloggers. Though truncated feeds do reduce content theft, they also annoy legitimate users and have a very vocal opposition. Worse yet, the technology already exists to scrape content from the site itself so the content benefit from truncating a feed may be short lived. Once spammers adapt to shortened feeds, they will be no obstacle at all.
Second is the issue of autodiscovery or the use of meta tags to allow browsers and users to locate the feed easily. Some have raised questions if disabling this feature might be a means of preventing scraping and content theft but, sadly, it has no impact on the issue. Since most scrapers find their feeds through search engines or pinging services and not the blog itself, disabling autodiscovery doesn’t offer any protection.
However, that brings us to the third and final element, pinging. Since many spammers have spiders crawling the various pinging services, choosing the services you notify of new posts can have a drastic impact on your scraping. Pinging the search engines and directories is a wise move, but shooting your posts out blindly to every service that will accept it likely isn’t.
Though it is very convenient to ping one or two sites and know that all of the relevant search engines will pick you up, it is also convenient for spammers. They get a direct way to tap in to the blogging world and watch the content that floats by for their desired keywords. Since most major search engines offer a direct way to ping them, it would probably be best to use that.
This will be discussed in greater detail at a later date.
Though there are some reasonable steps that you can take to prevent spammers from finding or using your content, for the most part, the best approach still lies in detecting and following up on misuse. To that end, plugins such as Copyfeed or services such as FeedBurner can be valuable assets.
Simply put, changing your writing style or your RSS strategy to prevent content theft is not a viable strategy. Not only will it not prevent all content theft, but it will damage your site in other ways. Even changing your site’s pinging strategy comes with a required cost/benefit analysis and experimentation to find the right balance between protecting content and gaining traffic.
What is more important is to be aware of the situation and determine what your risk level is. You should do this evaluation not just for your blog itself, but for each article as some individual works will need special attention. This will let you spend your time and energy focusing on the works and sites that are at the highest risk of theft.
Sadly, dealing with content theft on the Web is still very much a detection and cessation game at the moment. Virtually everyone reading this can and will be scraped, it is just a matter of when and how often.