Detecting search engine bots with PHP

Written by on September 7, 2008 in PHP - 22 Comments

I have to admit i am writing this post because i will publish soon big article about cURL extension and show you how to use it to create bots which would be even able to login to web pages and do some actions there, actually nothing illegal, but still i fill the need to tell you how to detect “good” (search engine bots) and “bad” (scrappers) bots and how to protect from them.

Search engine bots

Code for a script which will determine whatever visitor is Google bot, Microsoft bot or Yahoo bot can be written with few lines of code, it is easy to detect them because big companies do not try to hide the fact that they are sending a bot to you, so bots usually “use” browsers with quite unique user agent identifiers, which is stored in $_SERVER['HTTP_USER_AGENT'] variable.

The best way to find out what bots visit your website is to examine server access logs, here are logs from one of my old websites:

00.00.000.000 - - [07/Sep/2008:02:10:46 +0200] "GET /cool-videos HTTP/1.1" 200 2037 "-" "msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)"
00.00.000.000 - - [07/Sep/2008:03:33:12 +0200] "GET / HTTP/1.1" 200 9969 "-" "msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)"
00.00.000.000 - - [07/Sep/2008:08:14:15 +0200] "GET /robots.txt HTTP/1.0" 404 - "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
00.00.000.000 - - [07/Sep/2008:08:57:03 +0200] "GET /robots.txt HTTP/1.1" 404 - "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"
00.00.000.000 - - [07/Sep/2008:08:57:04 +0200] "GET /cool-videos HTTP/1.1" 200 1776 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"
00.00.000.000 - - [07/Sep/2008:09:04:44 +0200] "GET /cool-videos HTTP/1.0" 200 1757 "http://search.live.com/results.aspx?q=videos" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322)"
00.00.000.000 - - [07/Sep/2008:13:57:59 +0200] "GET / HTTP/1.1" 200 10487 "-" "WebAlta Crawler/2.0 (http://www.webalta.net/ru/about_webmaster.html) (Windows; U; Windows NT 5.1; ru-RU)"

Do you see what bots visit my website? They are: msnbot, yahoo bot and WebAlta (whatever that is). The most important information is at the end of the line, for example in the first line we are interested in this: “msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)” (again this information can found in $_SERVER['HTTP_USER_AGENT']). As you can see there is alway an url pointing to certain site which explains what does bot do, we can take advantage of that and use this information to identify bots

$bots = array(
'msn' => 'http://search.msn.com/msnbot.htm', 
'yahoo' => 'http://help.yahoo.com/help/us/ysearch/slurp', 
'WebAlta' = 'http://www.webalta.net/ru/about_webmaster.html'
);
 
$agent = strtolower($_SERVER['HTTP_USER_AGENT']);
foreach($bots as $name => $bot)
{
    if(stripos($agent,$bot)!==false)
    {
        echo $name;
    }
}

if user agent signature ($_SERVER['HTTP_USER_AGENT']) contains text from variable $bot then most probably we identified it. Also it safe to use URLs to identify bots because they are long and usually unique bot if bot does not provide URL in HTTP_USER_AGENT then you must identify it with something else, for example his name.

Probably you noticed i did not showed how to detec Google bot and i did it intentionally, it is a home work assignment for you, examine your server access logs and determine how it identifies himeself in $_SERVER['HTTP_USER_AGENT'] variable.

Unwanted bots

Aside of search engine bots there is a ton of bots which you do not want to see on your site, like bots that post spam comments on blogs, the problem however with them is that actually you cannot do much about them. If they are well written as a signatures they usually use some popular browser signature.

Pretty good solution could be blocking IP address but whose address to block? Well obviously someone who already sent spam comment, but on the other hand they could be using proxy server to access your site, so you will block only server not actual spammer, and what about bots that are scrapping content from your site? They behave as normal users, the only difference is that they move quickly from page to page.

The point is, it’s really hard ban some bots, and in many cases it maybe even impossible, but here are few pointers on how block such bot access to your site, although this tips apply only to lame bots:

  • check $_SERVER['HTTP_USER_AGENT'] if it is empty chances are it is a bot because all normal browsers idenitify themselves
  • send them cookie, lame bots usually can’t save cookies because lame programmers do not know how to do it, on the other hand real browser can have cookies turned off so this one is a bit tricky
  • if visitor visits large number of pages in short amount of time like 100 in one minute, then it is probably a bot looking for email addresses on your website or just stealing your content

robots.txt

One last thing i have to mention here, robots.txt files are meant to tell bots what they can visit and what they cannot and finally which bots are allowed to access your site. But the fact is that only good bots (coming from search engines) respect these file and sometimes even they do not, so having an robots.txt file on your website wil not protct you in any way from spammers and thieves.

About the Author

Greg Winiarski is a freelance PHP and JavaScript programmer. He specializes in web applications and WordPress development.

22 Comments on "Detecting search engine bots with PHP"

  1. ghprod October 4, 2008 at 9:55 pm ·

    Nice tutz Bro :)

    Thnx

  2. ROW October 10, 2008 at 4:57 pm ·

    Arrived at your site through wordpress while looking for ”
    How to create WordPress Plugin from a scratch”. You have some real nice content here.

    RSS_Subscriber= RSS_Subscriber+1 :)

  3. zichzach February 22, 2009 at 2:51 pm ·

    Thanks I was looking for this information.

  4. terrific March 24, 2009 at 10:50 am ·

    Thanx.. Its really a valueable one..

  5. terrific March 24, 2009 at 1:50 pm ·

    but is it a good way to detect bot by the address of it or by just comparing the name solves the purpose better way like..

    if ((eregi(“yahoo”,$this->USER_AGENT)) && (eregi(“slurp”,$this->USER_AGENT)))

    {

    $this->Browser = “Yahoo! Slurp”;

    $this->Type = “robot”;

    }

  6. Sophia April 8, 2009 at 10:52 pm ·

    You made some good points there. I did a search on the topic and found most people will agree with your blog.

  7. David April 20, 2009 at 4:30 am ·

    I don

  8. Pagerank Checker June 8, 2009 at 8:06 pm ·

    Very well written post however, I would recommend that you turn the No Follow off in your comment section.

    Keep up the good work.

  9. Acne No More November 22, 2009 at 12:49 pm ·

    This is a great website, so many people need this information, thanks for providing it. I love your color scheme too!

  10. Zygor Guides December 2, 2009 at 3:19 pm ·

    This has been really interesting, thanks for that. I love this blog theme too!

  11. Link Wheel December 13, 2009 at 9:57 am ·

    Many many thanks for I was looking for a long time.

  12. ganool February 3, 2010 at 4:43 am ·

    nice tutorial..
    thx bro

  13. ganool February 4, 2010 at 3:37 am ·

    some times bat robots use HTTP_USER_AGENT same like search engine bots

  14. Bob March 26, 2010 at 2:05 pm ·

    Hey, thanks for info. Very useful. Here is list of User Agents:

    http://www.user-agents.org/

    They have the list available for download in XML too. Thanks again.

  15. Criss Leonte April 8, 2010 at 11:49 am ·

    Thanks for the tutorial. For people that are new to this, it should be mentioned that using server-side detection to serve different content to search engines than you serve to your users, is against search engine policy and your site might get seriously penalized for doing it (it also depends on the scale at which you are doing it).

  16. Gino Iacuzio September 5, 2010 at 9:27 am ·

    Hmm, there seems to be some difficulties with the first link, as it returns a 404 error

  17. Nusaweb September 7, 2010 at 8:17 am ·

    Woww.. awesome tutorial. thank you

  18. Ahmad Ali November 13, 2010 at 7:04 pm ·

    I am searching for a php code which tells me that how many time a page with my adds is visited by browsers not by bots

  19. topcontractdeals January 16, 2011 at 9:16 am ·

    Very good tutorial, I find what I am looking for here.

  20. srihari March 14, 2011 at 7:26 am ·

    Hi yo all,

    In my website..I want to change the mobile number based on from which search engine (google,bing,ask) user access my website. I want to change the mobile number based on that……

    Thanks in advance..

  21. BillyFox April 1, 2011 at 11:04 am ·

    Thanks for the info. Very good post!

  22. Khan August 27, 2011 at 1:34 am ·

    really very nice tutorial but i wanted a sure short way to differentiate b/w bots and users. That’s completely not here. so where i am gonna find it.?

Leave a Comment