February 17, 2004

Someone has a bad net crawler, and it isn't me. Looking at logs for my web page (which gets very few hits as it is), I noticed over the past week two separate incidents where there have been multiple (like, say 50 and 100 at a time...) requests in a row from some crawler looking for the file hilite.htc. Unfortunately for the crawler, the htc is a DHTML behavior file, which Tripod does not serve up with the correct MIME type on most of its servers [I think I caught it acting correctly ONCE]. Something in the file (or perhaps in this page itself) causes the web crawler to get this link over and over and over. Here's an example from the logfile:

XXX.YYY.ZZ.A - - [12/Feb/2004:20:36:00 -0500] "GET /rgautier/hilite.htc HTTP/1.1" 304 - "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322; MSIECrawler)"

And there was another one with the referrer of Mikelist.com/current.htm (although Mike doesn't link to me that I know of). So, if you're a robot - sorry, you'll need some more smarts...

It looks like the MSIECrawler hits come from someone who has subscribed to my web page (little old me? Wow! Shocker!). However, the links that had MikesList in it didn't have the MSIECrawler in the reference data (They said 'Rich's XP', which is curious because my name is Rich, but the IP address wasn't from anywhere that I connect from and I don't recall having a machine that I called 'Rich's XP')

No comments: