首页 > 解决方案 > 抓取网站以获取关键字和状态时出现 KeyError

问题描述

从我的输入 CSV 文件检查多个网站时,我目前正在尝试将两件事放在一起:

  1. 检查 HTTP 状态
  2. 检查网站是否显示特定关键字

然后将结果保存到一个新的 CSV 文件中。

我的输入.csv:

id    url
1     https://example123.com
2     https://envato.com/blog/30-outstanding-coming-soon-and-under-construction-website-templates/
3     https://mundoshoponline.com

我的代码:

import requests
import pandas as pd
from bs4 import BeautifulSoup
import asyncio
import re

from concurrent.futures import ProcessPoolExecutor, as_completed

df = pd.read_csv('path/to/my/input.csv')

#my csv has urls in the 1st column
urls = df.T.values.tolist()[1]
results = {}
status = []

async def scrape(url):
 try:
    r = requests.get(url, timeout=(3, 6))
    r.raise_for_status()
    soup = BeautifulSoup(r.content, 'html.parser')

    #all keywords to check on the website
    data = {
    "coming soon": soup.body.findAll(text = re.compile("coming soon", re.I)),
    "Opening Soon": soup.body.findAll(text = re.compile("Opening Soon", re.I)),
    "Forbidden": soup.body.findAll(text = re.compile("Forbidden", re.I)),
    "Page not found": soup.body.findAll(text = re.compile("Page not found", re.I)),
    "Under Construction": soup.body.findAll(text = re.compile("Under Construction", re.I)),
    "Currently Unavailable": soup.body.findAll(text = re.compile("Currently Unavailable", re.I))}
    results[url] = data
 #check for http status and save to status list 
 except (requests.exceptions.ConnectionError, requests.exceptions.Timeout):
        status.append("Down")
 except requests.exceptions.HTTPError:
        status.append("Other")
 else:
        status.append("OK")


async def main():
    await asyncio.wait([scrape(url) for url in urls])

loop = asyncio.get_event_loop()
loop.run_until_complete(main())
loop.close()

comingList= []
openingList = []
forbiddenList= []
notfoundList = []
underList = []
currentlyList = []
#mark x if there are any hits for specific keyword
for url in results:
    comingList.append("x" if len(results[url]["coming soon"]) > 0 else "")
    openingList.append("x" if len(results[url]["Opening Soon"]) > 0 else "")
    forbiddenList.append("x" if len(results[url]["Forbidden"]) > 0 else "")
    notfoundList.append("x" if len(results[url]["Page not found"]) > 0 else "")           
    underList.append("x" if len(results[url]["Under Construction"]) > 0 else "")
    currentlyList.append("x" if len(results[url]["Currently Unavailable"]) > 0 else "")


df["comingSoon"] = pd.DataFrame(comingList, columns=['comingSoon'])
df["openingSoon"] = pd.DataFrame(openingList, columns=['openingSoon'])
df["forbidden"] = pd.DataFrame(forbiddenList, columns=['forbidden'])
df["notfound2"] = pd.DataFrame(notfoundList, columns=['notfound2'])
df["underConstruction"] = pd.DataFrame(underList, columns=['underConstruction'])
df["currentlyUnavailable"] = pd.DataFrame(currentlyList, columns=['currentlyUnavailable'])
df['status'] = status

print(df)

df.to_csv('path/to/my/output.csv', index=False)

但是,每当我为我的一些 url 运行上述脚本时for url in urls: ,它都会引发此错误并且脚本中断并且不会生成 output.csv:

Traceback (most recent call last):
  File "path/to/myscan.py", line 51, in <module>
    comingList.append("x" if len(results[url]["coming soon"]) > 0 else "")
KeyError: 'http://example123.com'

使用for url in results:output.csv 运行时如下所示:

[![在此处输入图像描述][1]][1]

似乎是错误的,因为第一行的关键字标记为存在(即将到来,underConstruction 列)+ 状态列 = Down。但网站不包含“即将推出”或“正在建设”字符串。

有人能帮我解决这个问题吗?我相信我的循环或尝试/除部分代码中可能存在问题。如果以上内容还不够,我很乐意提供更多信息。先感谢您。

标签: pythonbeautifulsouppython-requests

解决方案


我认为您的主要问题是您正在迭代整体,其中一些可能已经失败,因此在您的密钥urls中不存在。results

一种更安全的方法是遍历您确定已成功并拥有 key in 的 url 子集,results而不是

for url in urls:

你可以做到的

for url in results:

要使最终结果与您的 url 的输入顺序一致:

import requests
import pandas as pd
from bs4 import BeautifulSoup
import asyncio
import re

from concurrent.futures import ProcessPoolExecutor, as_completed
df = pd.read_csv('./input.csv')

#my csv has urls in the 4th column
urls = [ 'example123.com', 'https://envato.com/blog/30-outstanding-coming-soon-and-under-construction-website-templates/', 'http://alotechgear.com'] 
results = {}
status = {}
async def scrape(url):
 try:
    r = requests.get(url, timeout=(3, 6))
    r.raise_for_status()
    soup = BeautifulSoup(r.content, 'html.parser')

    #all keywords to check on the website
    data = {
    "coming soon": soup.body.findAll(text = re.compile("coming soon", re.I)),
    "Opening Soon": soup.body.findAll(text = re.compile("Opening Soon", re.I)),
    "Forbidden": soup.body.findAll(text = re.compile("Forbidden", re.I)),
    "Page not found": soup.body.findAll(text = re.compile("Page not found", re.I)),
    "Under Construction": soup.body.findAll(text = re.compile("Under Construction", re.I)),
    "Currently Unavailable": soup.body.findAll(text = re.compile("Currently Unavailable", re.I))}
    results[url] = data
 #check for http status and save to status list 
 except (requests.exceptions.ConnectionError, requests.exceptions.Timeout, requests.exceptions.MissingSchema):
     status[url] = "Down"
 except requests.exceptions.HTTPError:
     status[url] = "Other"
 else:
     status[url] = "OK"


async def main():
    await asyncio.wait([scrape(url) for url in urls])

loop = asyncio.get_event_loop()
loop.run_until_complete(main())
loop.close()

comingList= []
openingList = []
forbiddenList= []
notfoundList = []
underList = []
currentlyList = []
statusList = []
#mark x if there are any hits for specific keyword

for url in urls:
    if(not results.get(url)):
        statusList.append(status.get(url))
        notfoundList.append("x")
        comingList.append("-")
        openingList.append("-")
        forbiddenList.append("-")
        underList.append("-")
        currentlyList.append("-")
    else:
        statusList.append(status.get(url))
        comingList.append("x" if len(results[url].get("coming soon")) > 0 else "-")
        openingList.append("x" if len(results[url].get("Opening Soon")) > 0 else "-")
        forbiddenList.append("x" if len(results[url].get("Forbidden")) > 0 else "-")
        notfoundList.append("x" if len(results[url].get("Page not found")) > 0 else "-")           
        underList.append("x" if len(results[url].get("Under Construction")) > 0 else "-")
        currentlyList.append("x" if len(results[url].get("Currently Unavailable")) > 0 else "-")

df["comingSoon"] = pd.DataFrame(comingList, columns=['comingSoon'])
df["openingSoon"] = pd.DataFrame(openingList, columns=['openingSoon'])
df["forbidden"] = pd.DataFrame(forbiddenList, columns=['forbidden'])
df["notfound2"] = pd.DataFrame(notfoundList, columns=['notfound2'])
df["underConstruction"] = pd.DataFrame(underList, columns=['underConstruction'])
df["currentlyUnavailable"] = pd.DataFrame(currentlyList, columns=['currentlyUnavailable'])
df['status'] = pd.DataFrame(statusList, columns=['Status'])

print(df)
df.to_csv('./output.csv', index=False)

样本结果:

                                           id    url comingSoon openingSoon forbidden notfound2 underConstruction currentlyUnavailable status
0                       1     https://example123.com          -           -         -         x                 -                    -   Down
1  2     https://envato.com/blog/30-outstanding-c...          x           -         -         -                 x                    -     OK
2                  3     https://mundoshoponline.com          -           -         -         x                 -                    -   Down

推荐阅读