php - How to detect social media giants Bots and refine the useragent in php?
问题描述
I am trying to build the script that will capture the USER-AGENT of the users.That can easily be done using $_SERVER['HTTP_USER_AGENT']
example: Below are all the twitter Bots that detect by $_SERVER['HTTP_USER_AGENT']
I just simple post the link of php script on twitter and it detect the bots:
Here are the Bots thats Captured by HTTP_USER_AGENT of twitter network.
1
Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.1.2) Gecko/20090729 Firefox/52.0
2
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)
3
Mozilla/5.0 (compatible; AhrefsBot/6.1; News; +http://ahrefs.com/robot/)
4
Mozilla/5.0 (compatible; TrendsmapResolver/0.1)
5 (Not sure its bot or Normal Agent)
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36
6
Twitterbot/1.0
7
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Applebot/0.1; +http://www.apple.com/go/applebot)
Now I want to Refine/filter the Bots name from the detected HTTP_USER_AGENT
example:
rv:1.9.1.2
Trident/4.0
(compatible; AhrefsBot/6.1; News; +http://ahrefs.com/robot/)
(compatible; TrendsmapResolver/0.1)
Twitterbot/1.0
(Applebot/0.1; +http://www.apple.com/go/applebot)
What I have tried so far:
if (
strpos($_SERVER["HTTP_USER_AGENT"], "Twitterbot/1.0") !== false ||
strpos($_SERVER["HTTP_USER_AGENT"], "Applebot/0.1") !== false
) {
$file =fopen("crawl.txt","a");
fwrite($file,"TW-bot detected.\n");
echo "TW-bot detected.";
}
else {
$file =fopen("crawl.txt","a");
fwrite($file,"Nothing found.\n");
echo "Nothing";
}
But somehow the above code is not working.let me know where I am getting wrong and in the crawl.txt always shows Nothing found let me know the proper/better/best way to detect bots or any direction or guidence is apprecheated.
解决方案
您可能会发现很容易发现捕获简单网站预览的机器人,但抓取受限内容的机器人用户代理要困难得多。
您需要做的不仅仅是解析 UA。还需要询问 REMOTE_ADDR。您将通过http://ip-api.com 之类的方式触发每个请求,以确定它是否来自数据中心。小心使用代理的用户,他们会触发误报。您可以进一步研究使用 Javascript 的浏览器功能,但请注意,这是一个难题,并且它是提供商检测工具和(通常)黑帽广告商之间的持续军备竞赛。
推荐阅读
- vba - Powerpoint VBA - 将 RGB 颜色作为变量传递
- fastboot - `fastboot devices` 未列出珊瑚开发板
- python - 为什么我不能调整 PNG 文件的亮度级别
- ios - 点击手势识别器妨碍选择集合视图单元格
- facebook - 为什么这个 Facebook Graph API 端点不需要 API 密钥?
- r - 如何将轴标题添加到 R 地图中的轴?
- python - 安装 mysql 后 Windows cmd 无法识别 python 或 pip
- html - css flexbox 网格砌体样式
- .net - 根据存在的相关数据选择数据
- css - 复制加法混合的 CSS 混合模式