Categories:

Overcome scraper sites

I recently got a few emails asking me about scraper sites and how to beat them. I’m not sure anything is 100% effective, but you can probably use them to your advantage (something). If you are not sure what scraper sites are:

A scrape site is a website that extracts all your information from other websites using web scraping. In essence, no part of a scraper site is original. A search engine is not an example of a scraper site. Sites like Yahoo! and Google collect content from other websites and index it so that you can search for keywords in the index. Search engines then display snippets of the site’s original content that they pulled in response to your search.

In recent years, and due to the advent of the Google AdSense web advertising program, scraper sites have proliferated at a staggering rate for spam search engines. Open content, Wikipedia, is a common source of material for scraper sites.

from the main article on Wikipedia.org

Now it should be noted that having a wide range of scraper sites hosting your content can lower your ranking on Google as it is sometimes perceived as spam. So I recommend doing everything you can to prevent that from happening. You won’t be able to stop everyone, but you can benefit from the ones that don’t.

Things you can do:

Include links to other posts on your site in your posts.

Include your blog name and a link to your blog on your site.

Manually whitelist good spiders (google, msn, yahoo, etc.).

Manually blacklist bad guys (scrapers).

Automatically blog all page requests at once.

Automatically blocks visitors who disobey the robots.txt file.

Use a spider trap – you should be able to block access to your site using an IP address … this is done via .htaccess (I hope you are using a Linux server …) Create a new page, that record the address IP of anyone who visits it. (don’t set ban yet, if you see where this is going …). Then set your robots.txt with a “nofollow” to that link. Then put the link on one of your pages, but hidden, where a normal user will not click on it. Use a board game to show: none or something. Now wait a few days as the good spiders (google etc) have a cache of your old robots.txt file and might accidentally crash. Wait until they have the new one to do the automatic lock. Track this progress on the page that collects IP addresses. When you’re feeling fine (and you’ve added all the major search spiders to your whitelist for extra protection), switch that page to login and autoban every ip you see and redirect to a dead-end page. That should take care of quite a few of them.

Leave a Reply

Your email address will not be published. Required fields are marked *