How to Help Immunize Your Site Against Scraping

Filed as Features on March 31, 2008 10:33 am

Repost This

Scraping is one of the most annoying things that bloggers have to deal with. It can hurt their search engine ranking, cause confusion among readers and cause them to unwittingly help spammers line their pockets.

Nobody likes being scraped but it seems that some sites are able to survive it relatively unscathed while others are bumped clean out of the search engines, almost instantly replaced by the spammers that take their content.

So how do you ensure that the damage caused by scrapers are kept to an absolute minimum? There is no secret formula, but there are a few tricks that seem to work very well.


Don’t Be Young

The longer your site has been around, the stronger its natural protection both in the search engines and in the minds of users. Though everyone has to start out new and wade through a period of uncertainty, it is another good reason not to chance you brand or move to a new domain without weighing the move heavily.

Build Incoming Links

Building links is an established part of any good SEO practice but it is especially important here. Spammers often have their own link-building system built into their networks and frequently have a decent amount of inbound links before they touch your entries. Building your own inbound links ensures that they are not able to replace you easily.

However, it is important not to be spammy yourself with your inbound links. Don’t simply engage in link exchanges or purchase links. Search engines can often detect those and may penalize you, causing you to lose ground rather than gain it.

Cross-link Your Posts

When writing about something, if you’ve touched on a related topic before, link to it Make the linking natural but try to link to at least a few of your own posts within your entries. When spammers scrape the feed, so long as they don’t strip out the HTML, they will also be taking those links and will point back to your site.

Search engines use these kinds of clues to determine who thwww.e original site is.

Add RSS Footers or Headers

Adding footers and headers to your RSS feed may not be a perfect solution, especially as spammers get more focused on more and more narrow swaths of content, they are a great way to reduce the impact of complete RSS scraping and protect full feeds.

If you are a WordPress user, consider using either RSS Footer or FeedEntryHeader. FeedBurner users can use FeedFlare to achieve much the same effect.

Claim Your Site

Both Google and Technorati allow you to claim your blog on their site. For Google, bloggers should visit Google Webmaster Tools and for Technorati users should create a profile on the site. Doing so may not have a large impact on your site but it makes it clear that a human is behind it to the search engines. Also, on Technorati, it allows you to display an icon next to your blog, clearly distinguishing it from spam to users as well.

Also, consider registering your site on MyBlogLog and similar services, even if you do not plan on participating, just to have further sites vouch for the authenticity of your blog.

Report Spam

Even if you don’t want to take the copyright route and get the spam blogs taken down, report any spammers misusing your content to Google and be sure to use the form in the Google Webmaster Tools as, according to Matt Cutts at Google, it is given more weight.

Even if they do not remove the sites from the search engine, they are at least aware of the problem and can rank accordingly.

Provide Content Outside the Feed

Finally, provide good, useful content that exists outside the RSS feed, typically in static pages. Google loves this kind of content too and it is something that the scrapers won’t have. Inevitably, Google and other search engines will show preference to your site for the large amount of unique content, even if you are being scraped heavily.

Conclusions

Being scraped is never a good thing. Though some talk about making the scrapers work for you, the techniques are not fool-proof and have been known to fail. However, they are often great for mitigating the damage and are good practice for any blogger.

That being said, there are still likely going to be some cases of scraping that requires a higher level of action. Inevitably, a scraper, either through luck or skill, may still be able reach a point where they are able to steal some of your thunder. When that happens, it is important to be aware of the laws and techniques you can use to protect yourself.

But to those who want to avoid that as much as possible, it is a good idea to work on armoring your site against these kinds of attacks in advance.

A little prevention really can help keep the spammers at bay…

Tags: , , , , , ,

This post was written by

You can visit the for a short bio, more posts, and other information about the author.

Submissions & Subscriptions

Submit the post to Reddit, StumbleUpon, Digg or Del.icio.us.

Did you like it? Then subscribe to our RSS feed!



  1. By Ryan D. posted on March 31, 2008 at 1:14 pm
    Want an avatar? Get a gravatar! • You can link to this comment

    I’m dealing with this now, I get about 2-3 “people” a week scraping my site and I’m trying to figure out a way to prevent it or at least stop them from hammering my server. The issue is these people make scripts that just hammer the site to get all the pages and that puts a huge load on the server, last week I banned an ip that was hitting my site once a second for hours on end. I’ve resorted to scanning the logs(via script) and banning ips but I need to come up with something easier. I have yet to see any of my content(images) on any sites yet but what else would these people be using it for?

    Reply

  2. By infmom posted on March 31, 2008 at 4:36 pm
    Want an avatar? Get a gravatar! • You can link to this comment

    I found a really good WordPress plugin called AntiLeech (recommended by Lorelle here). It seems to do the job nicely and I’ve already got a fairly long list of URLs on its no-no list.

    Reply

  3. By Jonathan Bailey posted on March 31, 2008 at 8:40 pm
    Want an avatar? Get a gravatar! • You can link to this comment

    Ryan: Antileech, as recommended in the second comment, is a good start. But also look into a plugin called Copyfeed, which makes the IP location and the banning much more simple.

    Finally, if the sites are too annoying, file DMCA notices or abuse complaints with their hosts and get those sites pulled down.

    If you need any specific help, email me at jonathan at plagiarismtoday dot com and I’ll do what I can!

    Infmom: Agreed, Antileech is a great place to start. It’s a plugin well worth looking up.

    Be sure to check out Copyfeed too as it is a huge help as well.

    Thanks for the feedback!

    Reply

  4. By DevTopics posted on April 1, 2008 at 8:30 am
    Want an avatar? Get a gravatar! • You can link to this comment

    What do you do if original content from your website or blog is stolen and republished in full on another site? You fight back!

    A splog or “spam blog” is a blog that steals content from other web sites, then aggregates and republishes the content on its own blog. Splogs are created primarily to make money from ads shown on the splog and/or promote affiliated web sites. Splog owners are too dishonest, lazy or stupid to create their own original content and instead thieve yours.

    Splogs are harmful because they effectively steal a portion of your blog’s search engine ranking, traffic and ad revenue.

    When someone steals your original content, the best recourse is to file a DMCA complaint.

    http://www.devtopics.com/how-to-file-a-dmca-complaint/

    Reply

  5. By martin posted on April 3, 2008 at 5:53 pm
    Want an avatar? Get a gravatar! • You can link to this comment

    Natural SEOis helpful against Scraping Site. Keywords can help you optimize your online advertising strategies by improving your search engine rankings for terms that will help you most.

    Reply

  6. By victor louis posted on April 6, 2008 at 1:36 am
    Want an avatar? Get a gravatar! • You can link to this comment

    Anti spam webinar-“Spammers Vs Today’s spam filters”

    Today’s spam filters are not accurate and spam volumes are increasing rapidly. This will cost $42 billion for US alone. Spammers are using more innovation technology to send spam mails & Today’s spam filters are blocking only 80% of spam mails.

    Register for a complimentary Webinar conducted by Abaca and Ferris research to know more about the spammers behind the black market. To register please click the link below:
    http://www.surveymonkey.com/s.aspx?sm=LPFKkdkFwOYltiQZtM_2bttw_3d_3d

    Reply

  7. By Michael posted on April 6, 2008 at 2:07 pm
    Want an avatar? Get a gravatar! • You can link to this comment

    Another great way of stopping scrapping is to report the spam blog to Adsense, this cuts off their revenue stream and hurts them a lot more than any of these other options, if they aren’t making any money from scrapping then why would they do it?

    Reply

    Your words are your own, so be nice and helpful if you can. If this is the first time you're posting a comment, it might go into moderation. Don't worry, it's not lost, so there's no need to repost it! We accept clean XHTML in comments, but don't overdo it please.

    Current day month ye@r *