Ignorance is Bliss

Every once in awhile for some reason I take a look at the log files for either one of my web sites or one of my customers. The log files are a technical record of all of the requests for files that come in on the server. The requests can be from valid users going to one of my web pages, or they can also be from search engine spiders that visit from Google, Yahoo, etc. Occasionally I’ll also see visits that are the pattern of hackers testing the sites for known or unknown vulnerabilities.

Here’s the first log record from today’s log file: - - [17/Jul/2011:00:28:39 -0700] "GET /cgi-bin/color.cgi?backcolor=CCCC33 HTTP/1.1" 200 67156 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" 255 67465

This happens to be from the largest search engine site in China. The visit came from a computer called a spyder or bot. Baiduspider is not only a bad bot, but since it’s from China, I don’t need or desire it to be crawling my site. So it gets banned.

How do I know it’s a bad bot? Well back in the mid 90s some guy decided to make a standard for defining a set of rules that spiders are supposed to follow. Like a lot of Internet standards, the robots.txt file was written by a computer geek, and therefore makes no sense. There’s also no good documentation for how it’s supposed to work, and you’ll find many conflicting opinions on how it is supposed to work.

Google and other search engines have also added to the syntax, so you’re never sure if the search engine you’re targeting is going to follow the rules as you write them.

There are even online syntax checkers that attempt to help you make sure your syntax is correct, but no matter which way you write the rules they will tell you you’ve written them backwards. I guess I need the secret decoder ring.

The reason I was trying to make sense of all this again today is because while reading through the log files for aestheticdesign.com this morning, I noticed that Yahoo’s Slurp bot was accessing the same files over and over, from different IP addresses. It’s only half way through the day and already Yahoo has downloaded the same image from my web site 39 times!

Now that image doesn’t ever change so there’s no need to keep downloading every few minutes. It’s just another example of why Yahoo is such a non player in the search engine business these days. In researching the issue I came across many articles going back as far as 2007, of web masters having problems with Slurp. My favorite was this one “Yahoo! Slurp too Stupid to be a Robot” written back in 2009 on Jeff Starr’s blog. He decided to block Yahoo, not only because of their incompetence, but also because Yahoo wasn’t even sending any traffic to his site.

So I looked at my site’s stats, and sure enough, Yahoo referred 150 out of 43,525 hits to my web site, or in other words .2% of my traffic came from Yahoo. Yet the Yahoo Slurp spider itself accounted for 5.67% of the hits on my site. Something’s seriously out of whack when the search engine itself accounts for 28 times the amount of hits as it sends you.

Since I already hate Yahoo for their incompetence, I’m pretty close to just banning their bot completely. I’m giving them one more chance and if the recent change I made to the robots.txt file doesn’t slow them down, then I’ll start banning them completely.

Once again, looking at the raw data in the log file has taught me that ignorance truly is bliss. I’m pretty sure that’s how the programmers at Yahoo must feel too.