python - Python 中的 Yellow Pages Scraper 停止工作
问题描述
我正在尝试从黄页中抓取数据。我已经成功使用过这个刮刀几次,但它最近停止工作了。我注意到黄页网站最近发生了变化,他们添加了一个包含三个结果的赞助商链接表。由于这一变化,我的爬虫唯一能找到的就是这个赞助商链接表下方的广告。它不检索任何结果。
我在哪里错了?
我在下面包含了我的代码。例如,它显示了对威斯康星州 711 个地点的搜索。
import requests
from bs4 import BeautifulSoup
import csv
my_url = "https://www.yellowpages.com/search?search_terms=7-eleven&geo_location_terms=WI&page={}"
for link in [my_url.format(page) for page in range(1,20)]:
res = requests.get(link)
soup = BeautifulSoup(res.text, "lxml")
placeHolder = []
for item in soup.select(".info"):
try:
name = item.select("[itemprop='name']")[0].text
except Exception:
name = ""
try:
streetAddress = item.select("[itemprop='streetAddress']")[0].text
except Exception:
streetAddress = ""
try:
addressLocality = item.select("[itemprop='addressLocality']")[0].text
except Exception:
addressLocality = ""
try:
addressRegion = item.select("[itemprop='addressRegion']")[0].text
except Exception:
addressRegion = ""
try:
postalCode = item.select("[itemprop='postalCode']")[0].text
except Exception:
postalCode = ""
try:
phone = item.select("[itemprop='telephone']")[0].text
except Exception:
phone = ""
with open('yp-7-eleven-wi.csv', 'a') as csv_file:
writer = csv.writer(csv_file)
writer.writerow([name, streetAddress, addressLocality, addressRegion, postalCode, phone])
解决方案
您现有的脚本中有几个问题。您创建了一个 for 循环,它应该遍历 19 个不同的页面,而内容被限制在一个页面内。您定义的选择器不再包含这些元素。此外,您多次重复try:except
块,这使您的刮刀看起来非常凌乱。您可以定义自定义函数来摆脱IndexError
或AttributeError
解决问题。最后,您可以利用csv.DictWriter()
将抓取的项目写入 csv 文件。
试一试:
import requests
import csv
from bs4 import BeautifulSoup
placeHolder = []
urls = ["https://www.yellowpages.com/search?search_terms=Coffee%20Shops&geo_location_terms=WI&page={}".format(page) for page in range(1,5)]
for url in urls:
res = requests.get(url)
soup = BeautifulSoup(res.text, "lxml")
def get_text(item,path): return item.select_one(path).text if item.select_one(path) else ""
for item in soup.select(".info"):
d = {}
d['name'] = get_text(item,"a.business-name span")
d['streetAddress'] = get_text(item,".street-address")
d['addressLocality'] = get_text(item,".locality")
d['addressRegion'] = get_text(item,".locality + span")
d['postalCode'] = get_text(item,".locality + span + span")
d['phone'] = get_text(item,".phones")
placeHolder.append(d)
with open("yellowpageInfo.csv","w",newline="") as infile:
writer = csv.DictWriter(infile,['name','streetAddress','addressLocality','addressRegion','postalCode','phone'])
writer.writeheader()
for elem in placeHolder:
writer.writerow(elem)
推荐阅读
- python - 如何在 Python 中导入前正确调用预加载器函数?
- python - 如何修复错误的输出表单 xpath
- prolog - prolog无向图,如何找到两个人之间的人数?
- ios - 如何从子视图设置父视图仅以编程方式使用不使用 StoryBoard 拖动?
- thymeleaf - 在 thymleaf 中使用 ajax,只更新表单的一部分
- c - 包含从另一个 Makefile 生成的依赖文件时的 Makefile 错误
- reactjs - 从 IFrame 内部拦截 react 对静态资源的请求
- c++ - 未初始化结构指针的 int 数据成员始终返回 1
- android - Android webview 加载 webPage 错误消息/不支持的浏览器
- python - Pandas DataFrame 子集上的字符串修改