What's the best way to block bots from searching your website? I have created a robots.txt file which looks like this: Code: User-agent: * Disallow: / Disallow: /cgi-bin/ I have included the following in my index.html file: Code: <meta name="robots" content="NOINDEX, NOFOLLOW"> And I have also included an .htaccess file in my root which looks like this: Code: SetEnvIfNoCase User-Agent "^Yandex*" bad_bot Order Deny,Allow Deny from env=bad_bot Yet I'm still seeing entries in Apache's access.log: Code: 178.154.164.251 - - [10/Nov/2012:04:33:14 -0500] "GET /robots.txt HTTP/1.1" 200 324 "-" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)" 178.154.164.251 - - [10/Nov/2012:04:33:14 -0500] "GET /phpbb/search.php?search_id=active_topics&sid=3a033d745efebc4ace615dd64e8f63f7 HTTP/1.1" 200 3735 "-" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)" 178.154.164.251 - - [10/Nov/2012:04:33:17 -0500] "GET /phpbb/ucp.php?mode=login&sid=3a033d745efebc4ace615dd64e8f63f7 HTTP/1.1" 200 3513 "-" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)" 66.249.76.173 - - [10/Nov/2012:06:05:11 -0500] "GET /robots.txt HTTP/1.1" 200 368 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 178.154.164.251 - - [10/Nov/2012:06:32:14 -0500] "GET /phpbb/index.php?sid=3a033d745efebc4ace615dd64e8f63f7 HTTP/1.1" 200 3908 "-" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)" 123.125.71.74 - - [10/Nov/2012:06:35:02 -0500] "GET /robots.txt HTTP/1.1" 200 331 "-" "Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2" I have even included the IP address 178.154.164.251 in my iinbound filter list on my router. (The fact that I see that address still listed in my Apache logs suggests (at least to me) that Yandex isn't coming from that address. Thoughts anyone?
Try Code: User-agent: Yandex Disallow: / in your roboty.txt. See http://help.yandex.com/webmaster/?id=1113851
I had tried that but it didn't seem to amke any diffence. Adding a serious of IP address blocks (a bit overboard) seems to ahve worked. The: Code: User-agent: * Disallow: / seems to have stopped the vast majority of activity I'm not interested in having. I believe I had a more serious problem though which I'll address in a seperate thread. Sheesh, I'm getting more traffic than a free bordello beside a Naval dock!
Yes, I'm aware that is should stop ALL bots. And it does if they observe the rules in robots.txt. But if they don't, they'll keep knocking away with the zeal of a vaccuum cleaner salesman pounding on my front door. Quick question you might know the answer to... When I see an entry in my Apache logfile that says : GET /robots.txt" does that mean the robot has tried to do a search and then has recieved my robots.txt file? I guess what I'm really asking here is must a robot search at least once from 999.999.999.999 to recieve the robots.txt file after which searches from 999.999.999.999 will stop?