python - Getting no data from the server when scraping a site
问题描述
I have extracted the items from a particular website and now want to write them to an .xls file.
I expected a full excel sheet with the headings and rows of information, but get a sheet with only the headings.
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
res = requests.get('https://www.raywhite.com/contact/?type=People&target=people&suburb=Sydney%2C+NSW+2000&radius=50%27%27&firstname=&lastname=&_so=contact')
soup = bs(res.content, 'lxml')
names=[]
positions=[]
phone=[]
emails=[]
links=[]
nlist = soup.find_all('li', class_='agent-name')
plist= soup.find_all('li',class_='agent-role')
phlist = soup.find_all('li', class_='agent-officenum')
elist = soup.find_all('a',class_='val withicon')
for n1 in nlist:
names.append(n1.text)
links.append(n1.get('href'))
for p1 in plist:
positions.append(p1.text)
for ph1 in phlist:
phone.append(ph1.text)
for e1 in elist:
emails.append(e1.get('href'))
df = pd.DataFrame(list(zip(names,positions,phone,emails,links)),columns=['Names','Position','Phone','Email','Link'])
df.to_excel(r'C:\Users\laptop\Desktop\RayWhite.xls', sheet_name='MyData2', index = False, header=True)
This is what the resulting DataFrame looks like:
解决方案
例如,我尝试打印您的汤调用的结果,nlist = soup.find_all('li', class_='agent-name')
并且正在返回空数组。汤函数没有找到任何数据。
进一步看,soup 请求是空的:
soup = bs(res.content, 'lxml')
print(soup)
给出:
<html>
<head><title>429 Too Many Requests</title></head>
<body bgcolor="white">
<center><h1>429 Too Many Requests</h1></center>
<hr/><center>nginx</center>
</body>
</html>
看起来该网站将您检测为机器人并且不允许您抓取。您可以按照此处的答案假装自己是网络浏览器:Web scraping with Python using BeautifulSoup 429 error
更新:
向请求中添加用户代理可以解决问题:
res = requests.get('https://www.raywhite.com/contact/?type=People&target=people&suburb=Sydney%2C+NSW+2000&radius=50%27%27&firstname=&lastname=&_so=contact', headers = {'User-agent': 'Super Bot 9000'})
您现在可以获得所需的输出。
一些网站拒绝没有用户代理的请求,而且这个网站似乎这样做了。添加用户代理使您的请求看起来更正常,因此站点允许它通过。这个或任何东西实际上没有任何标准,它因站点而异。
推荐阅读
- node.js - 如何停止 mamp pro 覆盖 localhost:3000?
- html - Ionic 3 加载组件未全屏
- ios - Swift 4:没有这样的模块'CoreServices.DictionaryServices'
- apache-spark - 运行集群模式火花作业时如何修复“连接被拒绝错误”
- dart - 如何防止基于flutter中的自定义逻辑重新渲染小部件?
- css - 一个完整的 div 宽度
- node.js - 在另一个模式中导入自定义猫鼬模式
- android - 查找要通过 Intent 启动的包名称
- operating-system - 操作系统内核数据的位置
- sql - 如何解决“外键约束格式不正确”问题