Register   Login       Forum   Search   Help  

Post new topic Reply to topic
The Web Hosting Forum > Programming & Technical > PHP spider detection

Author Thread
Euler



Joined: 02 Sep 2004
Posts: 109
PHP spider detection  Reply with quote  

Maybe this belongs in the programming section. I don't know. I assume discriminating between spiders and humans is important to SEO-aware webmasters.

What's the fastest way to check $_SERVER['HTTP_USER_AGENT'] to detect humans? More generally, what is the current best practice for managing this?


Post Thu Feb 24, 2005 7:11 pm
 View user's profile Send private message
Thermit
Site Admin


Joined: 11 Aug 2004
Posts: 272
 Reply with quote  

Ut oh, not putting on the black SEO hat are we? Wink

This thread had some ideas that might be of use...

http://www.webmasterworld.com/forum24/384.htm


Let me guess, you aren't cloaking for SEO, whacha up to?


Post Thu Feb 24, 2005 8:53 pm
 View user's profile Send private message
Euler



Joined: 02 Sep 2004
Posts: 109
 Reply with quote  

Naw.

I simply wish to log human or spider while I'm logging browser type.

And I just finished a solution:

From my log file (not the apache log), notice the line beginning with Human or Robot.

quote:
[pre]
/==B=e=g=i=n====E=x=e=c=u=t=i=o=n====ThuFeb24_15:23:25CST2005===\
Robot (Yahoo! Slurp): 68.142.249.107 GET /
8 queries built in 0.0711851119995 seconds.
Decl: 0.00260710716248 seconds.
DB : 0.0118429660797 seconds.
Auth: 0.0150530338287 seconds.
Acls: 0.0275280475616 seconds.
Cont: 0.0114028453827 seconds.
Fini: 0.00275111198425 seconds.
\===E=n=d===T=r=a=n=s=a=c=t=i=o=n===ThuFeb24_15:23:26CST2005===/

/==B=e=g=i=n====E=x=e=c=u=t=i=o=n====ThuFeb24_15:23:260CST2005===\
Human (FIREFOX ver.1.0): 192.168.XXX.XXX GET do=sect&op=see&sct=2
8 queries built in 0.0713069438934 seconds.
Decl: 0.0024950504303 seconds.
DB : 0.0117490291595 seconds.
Auth: 0.0151648521423 seconds.
Acls: 0.0275340080261 seconds.
Cont: 0.0115809440613 seconds.
Fini: 0.00278306007385 seconds.
\===E=n=d===T=r=a=n=s=a=c=t=i=o=n===ThuFeb24_15:23:26CST2005===/
[/pre]



I found many solutions on the net, all seemed to be optimized for programmer convenience. I needed it to be optimized for speed. So here is my current code for the detection.

quote:
[pre]
$browser = array (
"MSIE", // parent
"OPERA",
"MOZILLA", // parent
"NETSCAPE",
"FIREFOX",
"SAFARI"
);

$whowhat = "Robot (". $_SERVER['HTTP_USER_AGENT'] ."): ". $_SERVER['REMOTE_ADDR'] ." ". $_SERVER['REQUEST_METHOD'] ." ". $_SERVER['QUERY_STRING'] ."\n";
foreach ($browser as $parent) {
$s = strpos(strtoupper($_SERVER['HTTP_USER_AGENT']), $parent);
$f = $s + strlen($parent);
$version = substr($_SERVER['HTTP_USER_AGENT'], $f, 5);
$version = preg_replace('/[^0-9,.]/','',$version);

if (strpos(strtoupper($_SERVER['HTTP_USER_AGENT']), $parent)) {
$whowhat = "Human (". $parent ." ver.". $version ."): ". $_SERVER['REMOTE_ADDR'] ." ". $_SERVER['REQUEST_METHOD'] ." ". $_SERVER['QUERY_STRING'] ."\n";
}
}
fputs($log, $whowhat);

[/pre]


I modified it from code I found at php.net from a gentleman by the name of Steve. So thanks, Steve, wherever you are.

The result of all of this is that any time I wish to see the humans, I can "grep Human logfile" or for Bots, of course "grep Robot logfile". Extremely convenient, efficient and stable.


Post Thu Feb 24, 2005 9:33 pm
 View user's profile Send private message

Post new topic Reply to topic
Forum Jump:
Jump to:  

All times are GMT.
The time now is Tue Feb 07, 2012 9:23 am
  Display posts from previous: