Sometimes your competitors will do almost anything to compete with you including stealing your content.
To do this they sometimes employ automated software much like a search engine crawler to make the process quicker and easier than manually copying your site. This can cause many problems for you.
In this article we look at ways to stop this from happening.
Stealing on the web is rampant. I don’t mean stealing people’s user id’s and passwords. I mean the stealing that goes on to a website.
Webmasters and designers steal images they like, or find a cool JavaScript they like so they steal that as well.
But what really causes problems is when your competitors steal your content.
As we all know, content is king on the web. Whoever has the most content wins. So if a competitor of yours needs to grow quickly, one of the easiest ways to do it is through the use of a website harvester.
A website harvester is no different than any other search engine crawler. It goes and requests all the URLs it can find and then proceeds to download all the content associated with those URLs.
So how does one protect themselves from malicious scrapers?
Simple really. You build a spider trap.
As the name implies, you create a section of your site devoted to luring in the spiders that are not friendly, and then you proceed to either trap them or ban them from accessing your site.
What’s involved in making a spider trap?
Usually a bit of PHP code combined with a database and a URL rewriter.
The first thing you need to do is create the space on the site dedicated to capturing those bad bots. You then use robots.txt to exclude that section from crawling.
You do this because you want to ensure Googlebot, Yahoo! Slurp, MSNbot and the others don’t also get trapped. Since most good spiders will follow the robots.txt exclusion protocol you are going to politely deny them access to this location.
From here there are various options. One of my favorite involves logging to a database or text file and then dynamically denying access to the bad bot.
How does it work?
Let me give you a practical example.
I once had a client that was getting harvested many times per day by many different bad spiders. It was so bad at one point that the bad bots were doubling his bandwidth usage.
So we devised a plan whereby we’d create this trap as mentioned above and when we captured the user agent and IP info we immediately banned them from the site.
This is how it worked:
The bad bot would come to the site and find a link on an image. The link would point to the trap directory.
Normally, a regular spider would first check the robots.txt file to ensure they could in fact index the content in that directory. Since the file excluded this directory, the “good” spiders wouldn’t go in.
Page 1 of 2 :: First | Last :: Prev | 1 2 | Next
|