Well the folks that brought you the most easily infected PC operating system, and the buggiest and most non-compliant web browser, have done it again. Their latest invention is a crawler robot that may end up costing you money.
All of the major search engines have robot computers that are known as crawlers, bots, and spyders. Their job is to go to web sites and follow all the links and then index those pages, follow the links on those pages, and go on and on and on. That’s how the internet is indexed these days. All well and good. Until Microsoft designed their bot.
The search engine bots are supposed to have a minimal impact on your web site, coming by for a taste now and then. If they try to devour the whole meal at once it can impact your server, slowing it down for your visitors.
This month after receiving the bill from my hosting provider, I noticed that I was charged extra for going over my bandwidth limit. That was the first time I ever experienced this so I did a bit of investigation.
One account on my server was way above normal usage and after looking at the logs, I noticed that search engine bots were way over represented in the stats than they should have been. Drilling down into those stats I found that bingbot, Microsoft’s bot for their Bing search site accounted for 19% of the site’s traffic. It was hitting my site so often that it totally skewed the Webalizer stats for the month.
The problem seems to have resulted in a voting script that I recently added and bingbot was trying every voting link which seems to have put it in an endless loop. For a long time bots ignored URL’s with question marks in them because only dynamically served pages would have them and the bots could quite often be caught up in endless loops when trying to follow them. Apparently the engineers at Microsoft thought they were smart enough to venture into the rough waters of dynamic URL’s, but they forgot to bring their lifeboats with them in case they needed to abandon ship.
Their foolhardy venture cost me a small amount this month, and I’ve put some modifications into my robot.txt file to keep them out of that part of the site. Hopefully they will grab the updated robot.txt file soon and pay attention to it.
Just for the record, here are all the IP addresses I found in the logs during a one hour period that can be traced to bingbot. At some points there were up to 4 different bingbots hitting my site at the same time! In all, in just this random 1 hour time frame that I picked, 34 different bingbots hit my site 94 times. Mostly indexing the same 15 pages over and over again.