首页 > 解决方案 > Python 中的 Yellow Pages Scraper 停止工作

问题描述

我正在尝试从黄页中抓取数据。我已经成功使用过这个刮刀几次,但它最近停止工作了。我注意到黄页网站最近发生了变化,他们添加了一个包含三个结果的赞助商链接表。由于这一变化,我的爬虫唯一能找到的就是这个赞助商链接表下方的广告。它不检索任何结果。

我在哪里错了?

我在下面包含了我的代码。例如,它显示了对威斯康星州 711 个地点的搜索。

import requests
from bs4 import BeautifulSoup
import csv

my_url = "https://www.yellowpages.com/search?search_terms=7-eleven&geo_location_terms=WI&page={}"
for link in [my_url.format(page) for page in range(1,20)]:
  res = requests.get(link)
  soup = BeautifulSoup(res.text, "lxml")

placeHolder = []
for item in soup.select(".info"):
  try:
    name = item.select("[itemprop='name']")[0].text
  except Exception:
    name = ""
  try:
    streetAddress = item.select("[itemprop='streetAddress']")[0].text
  except Exception:
    streetAddress = ""
  try:
    addressLocality = item.select("[itemprop='addressLocality']")[0].text
  except Exception:
    addressLocality = ""
  try:
    addressRegion = item.select("[itemprop='addressRegion']")[0].text
  except Exception:
    addressRegion = ""
  try:
    postalCode = item.select("[itemprop='postalCode']")[0].text
  except Exception:
    postalCode = ""
  try:
    phone = item.select("[itemprop='telephone']")[0].text
  except Exception:
    phone = ""

  with open('yp-7-eleven-wi.csv', 'a') as csv_file:
    writer = csv.writer(csv_file)
    writer.writerow([name, streetAddress, addressLocality, addressRegion, postalCode, phone])

标签: pythonweb-scrapingyellow-pages

解决方案


您现有的脚本中有几个问题。您创建了一个 for 循环,它应该遍历 19 个不同的页面,而内容被限制在一个页面内。您定义的选择器不再包含这些元素。此外,您多次重复try:except块,这使您的刮刀看起来非常凌乱。您可以定义自定义函数来摆脱IndexErrorAttributeError解决问题。最后,您可以利用csv.DictWriter()将抓取的项目写入 csv 文件。

试一试:

import requests
import csv
from bs4 import BeautifulSoup

placeHolder = []

urls = ["https://www.yellowpages.com/search?search_terms=Coffee%20Shops&geo_location_terms=WI&page={}".format(page) for page in range(1,5)]
for url in urls:
    res = requests.get(url)
    soup = BeautifulSoup(res.text, "lxml")

    def get_text(item,path): return item.select_one(path).text if item.select_one(path) else ""

    for item in soup.select(".info"):
      d = {}
      d['name'] = get_text(item,"a.business-name span")
      d['streetAddress'] = get_text(item,".street-address")
      d['addressLocality'] = get_text(item,".locality")
      d['addressRegion'] = get_text(item,".locality + span")
      d['postalCode'] = get_text(item,".locality + span + span")
      d['phone'] = get_text(item,".phones")
      placeHolder.append(d)

with open("yellowpageInfo.csv","w",newline="") as infile:
  writer = csv.DictWriter(infile,['name','streetAddress','addressLocality','addressRegion','postalCode','phone'])
  writer.writeheader()
  for elem in placeHolder:
    writer.writerow(elem)

推荐阅读