Content Theft: How At Risk is Your Blog?

Filed as Features on September 10, 2007 11:45 am

It’s a sad fact that pretty much any content posted to a blog or otherwise passed through an RSS feed is going to get scraped at one point or another. On the Web today, there are so many spammers spitting out junk blogs that almost everyone will be a victim of content theft if they blog long enough.

That being said, some content and some sites are more at risk than others. The type of content you post and the way you present it both are major factors in exactly how at risk your site is.

Even though you can not mitigate or change many of these factors, being aware of them can help you assess your site’s risk in this area and take appropriate action. Meanwhile, ignoring the factors could leave you seriously unprepared for a very real problem.

Types of Content

If you want to know what types of content are most commonly scraped, take a look in your email spam folder. If you take a stroll around the splogosphere, you’ll generally find the same kinds of spam there that you’d expect to find in your inbox.

Gambling, for example, is a popular topic in both worlds. Also popular with all kinds of spammers are advertisements for adult sites, financial services, pharmaceuticals and questionable software.

However, simply not blogging about these topics is not enough. As Rose Desroches found out over a year ago, spam bloggers determine what content is scraped not by the actual content, but by targeted keywords. Desroches learned that the hard way when a post she wrote helping parents protect their children was scraped and used to promote teen pornography.

Simply not blogging about sex or gambling is not adequate protection. If spambots find the keywords they want in your post, they will scrape it, no matter what the context.

It is important to closely note not only the topics you write about but the keywords you use. Then, follow closely the posts or sites that have keywords that may be interpreted incorrectly by a bot so you can deal with any suspicious use of your work.

Presentation

In addition to the content in your feed, the way you present it on your site or shoot it out to the world can also have an impact on how much scraping you have to deal with.

First is the issue of full vs. partial feeds, a popular debate topic among bloggers. Though truncated feeds do reduce content theft, they also annoy legitimate users and have a very vocal opposition. Worse yet, the technology already exists to scrape content from the site itself so the content benefit from truncating a feed may be short lived. Once spammers adapt to shortened feeds, they will be no obstacle at all.

Second is the issue of autodiscovery or the use of meta tags to allow browsers and users to locate the feed easily. Some have raised questions if disabling this feature might be a means of preventing scraping and content theft but, sadly, it has no impact on the issue. Since most scrapers find their feeds through search engines or pinging services and not the blog itself, disabling autodiscovery doesn’t offer any protection.

However, that brings us to the third and final element, pinging. Since many spammers have spiders crawling the various pinging services, choosing the services you notify of new posts can have a drastic impact on your scraping. Pinging the search engines and directories is a wise move, but shooting your posts out blindly to every service that will accept it likely isn’t.

Though it is very convenient to ping one or two sites and know that all of the relevant search engines will pick you up, it is also convenient for spammers. They get a direct way to tap in to the blogging world and watch the content that floats by for their desired keywords. Since most major search engines offer a direct way to ping them, it would probably be best to use that.

This will be discussed in greater detail at a later date.

Conclusions

Though there are some reasonable steps that you can take to prevent spammers from finding or using your content, for the most part, the best approach still lies in detecting and following up on misuse. To that end, plugins such as Copyfeed or services such as FeedBurner can be valuable assets.

Simply put, changing your writing style or your RSS strategy to prevent content theft is not a viable strategy. Not only will it not prevent all content theft, but it will damage your site in other ways. Even changing your site’s pinging strategy comes with a required cost/benefit analysis and experimentation to find the right balance between protecting content and gaining traffic.

What is more important is to be aware of the situation and determine what your risk level is. You should do this evaluation not just for your blog itself, but for each article as some individual works will need special attention. This will let you spend your time and energy focusing on the works and sites that are at the highest risk of theft.

Sadly, dealing with content theft on the Web is still very much a detection and cessation game at the moment. Virtually everyone reading this can and will be scraped, it is just a matter of when and how often.

Tags: , ,

This post was written by

You can visit the for a short bio, more posts, and other information about the author.

Submissions & Subscriptions

Submit the post to Reddit, StumbleUpon, Digg or Del.icio.us.

Did you like it? Then subscribe to our RSS feed!



  1. By Rose DesRochers posted on September 10, 2007 at 7:01 pm
    Want an avatar? Get a gravatar! • You can link to this comment

    Informative article Jonathon.

    Reply

  2. By Jonathan Bailey posted on September 10, 2007 at 7:08 pm
    Want an avatar? Get a gravatar! • You can link to this comment

    Rose: Glad you liked it!

    Reply

  3. » WordPress Plugins- Spam Fighting by Rose DesRochers - World Outside my WindowSeptember 10, 2007 at 10:35 pm
  4. By Chris posted on September 10, 2007 at 10:45 pm
    Want an avatar? Get a gravatar! • You can link to this comment

    My feed was scraped once through via an email subscription, which was then presumably automatically forwarded to some cretin’s Blogsend address, which then posted it to his splog. I know this because the offending and heavily AdSense-optimized splog contained scraped feeds, the offender’s own emails, and spam. I eventually had the offending blog removed via flagging it on the Blogger toolbar until it was pulled.

    One way to search for this type of abuse is to highlight a random snippet of text from some of your prime content, and paste it into Google Blog Search. No hits beyond your own blog and legitimate posters and you’re probably in the clear. But you’re right, Jonathan. Write a blog of any quality and the scrapers may eventually come.

    Reply

  5. By The Internet Cash Flow Guy posted on September 11, 2007 at 1:45 am
    Want an avatar? Get a gravatar! • You can link to this comment

    This is something that I need to admit I have not thought much about, but I do see it as becoming a problem as more people realize that making a few extra dollars on the internet takes just a bit of time. I agree that we need to be cognizant of it, and just be on the lookout for this. The most important thing is to not just sit by idly while it happens, but go after in pro-actively.

    Reply

  6. By Jonathan Bailey posted on September 11, 2007 at 1:25 pm
    Want an avatar? Get a gravatar! • You can link to this comment

    Chris: I’ve seen a lot of that going around and I was already planning to write an article about it. However, it appears I’m going to have to redouble my efforts a bit.

    I would recommend, in your case, that you consider using what is known as a digital fingerprint. There are plugins such as the Digital Fingerprint Plugin and Copyfeed that can add them. Once the fingerprint appears on another site, you then know they scraped it.

    However, I think it would be neat if one of these plugins could differentiate between emailed entries and ones scraped over RSS. I’ll have to shop that around to a few of the plugin authors.

    ICFG: Agreed completely. Be aware of the problem and searching for it proactively is the most important part. Prevention, right now, is not practical so it requires a proactive stance. I can not agree more!

    Reply

  7. non academic scholarships for high school studentsSeptember 12, 2007 at 9:29 am
  8. By Darnell Clayton posted on September 12, 2007 at 5:24 pm
    Want an avatar? Get a gravatar! • You can link to this comment

    I often find linking back to my own site is a good way to find spammers.

    If I see a random article linking back with no apparent reason, I can then follow up on the evil spammer.

    With Whois.net and other services, I can contact their host and have them removed.

    Note: Is it me, or do most splogs reside on WordPress and Blogger? I’ve encountered more of the former than the latter, although I have yet to find any on SixApart’s services.

    Reply

  9. How To Avoid Spambots By Using Pinging Services : The Blog HeraldSeptember 17, 2007 at 4:14 pm

    Your words are your own, so be nice and helpful if you can. If this is the first time you're posting a comment, it might go into moderation. Don't worry, it's not lost, so there's no need to repost it! We accept clean XHTML in comments, but don't overdo it please.

    Current ye@r *