首页 > 解决方案 > BeautifulSoup4 和 Python - 多个页面进入 DataFrame

问题描述

我有一些代码可以通过多个页面从在线零售商那里收集描述、价格和旧价格(如果打折的话)。我希望将其导出到 DataFrame 并尝试过但遇到以下错误:

ValueError:传递值的形状为 (1, 3210),索引暗示 (3, 3210)。

from bs4 import BeautifulSoup
import requests
import time
import pandas as pd

# Start Timer
then = time.time()

# Headers
headers = {"User-Agent": "Mozilla/5.0"}

# Set HTTPCode = 200 and Counter = 1
Code = 200
i = 1

scraped_data = []
while Code == 200:

    # Put url together
    url = "https://www.asos.com/women/jumpers-cardigans/cat/?cid=2637&page="
    url = url + str(i)

    # Request URL
    r = requests.get(url, allow_redirects=False, headers=headers)  # No redirects to allow infinite page count
    data = r.text
    Code = r.status_code

    # Soup
    soup = BeautifulSoup(data, 'lxml')

    # For loop each product then scroll through title price, old price and description
    divs = soup.find_all('article', attrs={'class': '_2qG85dG'}) # want to cycle through each of these

    for div in divs:

        # Get Description
        Description = div.find('div', attrs={'class': '_3J74XsK'})
        Description = Description.text.strip()
        scraped_data.append(Description)

        # Fetch TitlePrice
        NewPrice = div.find('span', attrs={'data-auto-id':'productTilePrice'})
        NewPrice = NewPrice.text.strip("£")
        scraped_data.append(NewPrice)

        # Fetch OldPrice
        try:
            OldPrice = div.find('span', attrs={'data-auto-id': 'productTileSaleAmount'})
            OldPrice = OldPrice.text.strip("£")
            scraped_data.append(OldPrice)
        except AttributeError:
            OldPrice = ""
            scraped_data.append(OldPrice)

    print('page', i, 'scraped')
        # Print Array
        #array = {"Description": str(Description), "CurrentPrice": str(NewPrice), "Old Price": str(OldPrice)}
        #print(array)
    i = i + 1
else:
    i = i - 2
    now = time.time()
    pd.DataFrame(scraped_data, columns=["A", "B", "C"])
    print('Parse complete with', i, 'pages' + ' in', now-then, 'seconds')

标签: htmlpandasdataframeparsingbeautifulsoup

解决方案


现在,您的数据根据​​我可以这样描述的算法附加到列表中:

  1. 加载网页
  2. 附加到列表值 A
  3. 附加到列表值 B
  4. 附加到列表值 C

这为数据集的每次运行创建的是:

[A1, B1, C1, A2, B2, C2]

只有一列包含数据,这就是 pandas 告诉你的。要正确构建数据框,您需要将其交换为您在每个行条目上拥有三个值的元组(heh),例如:

[
    (A1, B1, C1),
    (A2, B2, C2)
]

或者,以我喜欢的方式,因为它对编码错误和数据长度不一致的情况更加健壮:将每一行创建为列字典。因此,

rowdict_list = []
for row in data_source:
    a = extract_a()
    b = extract_b()
    c = extract_c()
    rowdict_list.append({'column_a': a, 'column_b': b, 'column_c': c})

并且数据框很容易构造,而无需在构造函数中显式指定列df = pd.DataFrame(rowdict_list)


推荐阅读