首页 > 解决方案 > 使用 BeautifulSoup 和从 CSV 读取目标 URL 的问题

问题描述

当我使用单个 URL 来抓取 URL 变量时,一切都按预期工作,但在尝试从 csv 读取链接时没有得到任何结果。任何帮助表示赞赏。

有关 CSV 的信息:

    import requests  # required to make request
    from bs4 import BeautifulSoup  # required to parse html
    import pandas as pd
    import csv
    
    with open("urls.csv") as infile:
        reader = csv.DictReader(infile)
        for link in reader:
            res = requests.get(link['Links'])
            #print(res.url)
    url = res
    
    page = requests.get(url)
    
    soup = BeautifulSoup(page.text, 'html.parser')
    
    email_elm0 = soup.find_all(class_= "app-support-list__item")[0].text.strip()
    email_elm1 = soup.find_all(class_= "app-support-list__item")[1].text.strip()
    email_elm2 = soup.find_all(class_= "app-support-list__item")[2].text.strip()
    email_elm3 = soup.find_all(class_= "app-support-list__item")[3].text.strip()
    
    final_email_elm = (email_elm0,email_elm1,email_elm2,email_elm3)
    
    
    print(final_email_elm)
    
    df = pd.DataFrame(final_email_elm)
    
    #getting an output in csv format for the dataframe we created
    #df.to_csv('draft_part2_scrape.csv')

标签: pythonpandasbeautifulsoup

解决方案


问题在于这部分代码:

with open("urls.csv") as infile:
    reader = csv.DictReader(infile)
    for link in reader:
        res = requests.get(link['Links'])
...

循环执行后,res会有最后一个链接。所以,这个程序只会抓取最后一个链接。

要解决此问题,请将所有链接存储在一个列表中并迭代该列表以抓取每个链接。您可以将抓取的结果存储在单独的数据框中,并在最后将它们连接起来以存储在单个文件中:

import requests  # required to make request
from bs4 import BeautifulSoup  # required to parse html
import pandas as pd
import csv

links = []
with open("urls.csv") as infile:
    reader = csv.DictReader(infile)
    for link in reader:
        links.append(link['Links'])
        

dfs = []
for url in links:
    page = requests.get(url)

    soup = BeautifulSoup(page.text, 'html.parser')

    email_elm0 = soup.find_all(class_="app-support-list__item")[0].text.strip()
    email_elm1 = soup.find_all(class_="app-support-list__item")[1].text.strip()
    email_elm2 = soup.find_all(class_="app-support-list__item")[2].text.strip()
    email_elm3 = soup.find_all(class_="app-support-list__item")[3].text.strip()

    final_email_elm = (email_elm0, email_elm1, email_elm2, email_elm3)
    print(final_email_elm)

    dfs.append(pd.DataFrame(final_email_elm))


#getting an output in csv format for the dataframe we created
df = pd.concat(dfs)
df.to_csv('draft_part2_scrape.csv')

推荐阅读