I have to admit i am writing this post because i will publish soon big article about cURL extension and show you how to use it to create bots which would be even able to login to web pages and do some actions there, actually nothing illegal, but still i fill the need to tell you how to detect “good” (search engine bots) and “bad” (scrappers) bots and how to protect from them.
Search engine bots
Code for a script which will determine whatever visitor is Google bot, Microsoft bot or Yahoo bot can be written with few lines of code, it is easy to detect them because big companies do not try to hide the fact that they are sending a bot to you, so bots usually “use” browsers with quite unique user agent identifiers, which is stored in $_SERVER['HTTP_USER_AGENT'] variable.
The best way to find out what bots visit your website is to examine server access logs, here are logs from one of my old websites:
00.00.000.000 - - [07/Sep/2008:02:10:46 +0200] "GET /cool-videos HTTP/1.1" 200 2037 "-" "msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)" 00.00.000.000 - - [07/Sep/2008:03:33:12 +0200] "GET / HTTP/1.1" 200 9969 "-" "msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)" 00.00.000.000 - - [07/Sep/2008:08:14:15 +0200] "GET /robots.txt HTTP/1.0" 404 - "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 00.00.000.000 - - [07/Sep/2008:08:57:03 +0200] "GET /robots.txt HTTP/1.1" 404 - "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)" 00.00.000.000 - - [07/Sep/2008:08:57:04 +0200] "GET /cool-videos HTTP/1.1" 200 1776 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)" 00.00.000.000 - - [07/Sep/2008:09:04:44 +0200] "GET /cool-videos HTTP/1.0" 200 1757 "http://search.live.com/results.aspx?q=videos" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322)" 00.00.000.000 - - [07/Sep/2008:13:57:59 +0200] "GET / HTTP/1.1" 200 10487 "-" "WebAlta Crawler/2.0 (http://www.webalta.net/ru/about_webmaster.html) (Windows; U; Windows NT 5.1; ru-RU)"
Do you see what bots visit my website? They are: msnbot, yahoo bot and WebAlta (whatever that is). The most important information is at the end of the line, for example in the first line we are interested in this: “msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)” (again this information can found in $_SERVER['HTTP_USER_AGENT']). As you can see there is alway an url pointing to certain site which explains what does bot do, we can take advantage of that and use this information to identify bots
$bots = array( 'msn' => 'http://search.msn.com/msnbot.htm', 'yahoo' => 'http://help.yahoo.com/help/us/ysearch/slurp', 'WebAlta' = 'http://www.webalta.net/ru/about_webmaster.html' ); $agent = strtolower($_SERVER['HTTP_USER_AGENT']); foreach($bots as $name => $bot) { if(stripos($agent,$bot)!==false) { echo $name; } }
if user agent signature ($_SERVER['HTTP_USER_AGENT']) contains text from variable $bot then most probably we identified it. Also it safe to use URLs to identify bots because they are long and usually unique bot if bot does not provide URL in HTTP_USER_AGENT then you must identify it with something else, for example his name.
Probably you noticed i did not showed how to detec Google bot and i did it intentionally, it is a home work assignment for you, examine your server access logs and determine how it identifies himeself in $_SERVER['HTTP_USER_AGENT'] variable.
Unwanted bots
Aside of search engine bots there is a ton of bots which you do not want to see on your site, like bots that post spam comments on blogs, the problem however with them is that actually you cannot do much about them. If they are well written as a signatures they usually use some popular browser signature.
Pretty good solution could be blocking IP address but whose address to block? Well obviously someone who already sent spam comment, but on the other hand they could be using proxy server to access your site, so you will block only server not actual spammer, and what about bots that are scrapping content from your site? They behave as normal users, the only difference is that they move quickly from page to page.
The point is, it’s really hard ban some bots, and in many cases it maybe even impossible, but here are few pointers on how block such bot access to your site, although this tips apply only to lame bots:
- check $_SERVER['HTTP_USER_AGENT'] if it is empty chances are it is a bot because all normal browsers idenitify themselves
- send them cookie, lame bots usually can’t save cookies because lame programmers do not know how to do it, on the other hand real browser can have cookies turned off so this one is a bit tricky
- if visitor visits large number of pages in short amount of time like 100 in one minute, then it is probably a bot looking for email addresses on your website or just stealing your content
robots.txt
One last thing i have to mention here, robots.txt files are meant to tell bots what they can visit and what they cannot and finally which bots are allowed to access your site. But the fact is that only good bots (coming from search engines) respect these file and sometimes even they do not, so having an robots.txt file on your website wil not protct you in any way from spammers and thieves.



22 Comments on "Detecting search engine bots with PHP"
Nice tutz Bro
Thnx
Arrived at your site through wordpress while looking for ”
How to create WordPress Plugin from a scratch”. You have some real nice content here.
RSS_Subscriber= RSS_Subscriber+1
Thanks I was looking for this information.
Thanx.. Its really a valueable one..
but is it a good way to detect bot by the address of it or by just comparing the name solves the purpose better way like..
if ((eregi(“yahoo”,$this->USER_AGENT)) && (eregi(“slurp”,$this->USER_AGENT)))
{
$this->Browser = “Yahoo! Slurp”;
$this->Type = “robot”;
}
You made some good points there. I did a search on the topic and found most people will agree with your blog.
I don
Very well written post however, I would recommend that you turn the No Follow off in your comment section.
Keep up the good work.
This is a great website, so many people need this information, thanks for providing it. I love your color scheme too!
This has been really interesting, thanks for that. I love this blog theme too!
Many many thanks for I was looking for a long time.
nice tutorial..
thx bro
some times bat robots use HTTP_USER_AGENT same like search engine bots
Hey, thanks for info. Very useful. Here is list of User Agents:
http://www.user-agents.org/
They have the list available for download in XML too. Thanks again.
Thanks for the tutorial. For people that are new to this, it should be mentioned that using server-side detection to serve different content to search engines than you serve to your users, is against search engine policy and your site might get seriously penalized for doing it (it also depends on the scale at which you are doing it).
Hmm, there seems to be some difficulties with the first link, as it returns a 404 error
Woww.. awesome tutorial. thank you
I am searching for a php code which tells me that how many time a page with my adds is visited by browsers not by bots
Very good tutorial, I find what I am looking for here.
Hi yo all,
In my website..I want to change the mobile number based on from which search engine (google,bing,ask) user access my website. I want to change the mobile number based on that……
Thanks in advance..
Thanks for the info. Very good post!
really very nice tutorial but i wanted a sure short way to differentiate b/w bots and users. That’s completely not here. so where i am gonna find it.?