首页 > 解决方案 > Python网页抓取并保存到熊猫数据框

问题描述

我正在尝试在 remax 页面上抓取房屋列表并将该信息保存到 Pandas 数据框。但由于某种原因,它一直给我 KeyError。这是我的代码:

import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://www.remax.ca/ab/calgary-real-estate/720-37-st-nw-wp_id251536557-lst'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
detail_title = soup.find_all(class_='detail-title')
details_t = pd.DataFrame(detail_title)

这是我得到的错误:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-6-3be49b8e4cfc> in <module>
      6 soup = BeautifulSoup(response.text, 'html.parser')
      7 detail_title = soup.find_all(class_='detail-title')
----> 8 details_t = pd.DataFrame(detail_title)

~/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
    449                 else:
    450                     mgr = init_ndarray(data, index, columns, dtype=dtype,
--> 451                                        copy=copy)
    452             else:
    453                 mgr = init_dict({}, index, columns, dtype=dtype)

~/anaconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py in init_ndarray(values, index, columns, dtype, copy)
    144     # by definition an array here
    145     # the dtypes will be coerced to a single dtype
--> 146     values = prep_ndarray(values, copy=copy)
    147 
    148     if dtype is not None:

~/anaconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py in prep_ndarray(values, copy)
    228         try:
    229             if is_list_like(values[0]) or hasattr(values[0], 'len'):
--> 230                 values = np.array([convert(v) for v in values])
    231             elif isinstance(values[0], np.ndarray) and values[0].ndim == 0:
    232                 # GH#21861

~/anaconda3/lib/python3.7/site-packages/bs4/element.py in __getitem__(self, key)
   1014         """tag[key] returns the value of the 'key' attribute for the tag,
   1015         and throws an exception if it's not there."""
-> 1016         return self.attrs[key]
   1017 
   1018     def __iter__(self):

KeyError: 0

任何帮助将不胜感激!

标签: pythonpandasdataframeweb-scrapingbeautifulsoup

解决方案


你可以试试这个。我假设您只需要<span>标签中的文本。但是请随时从我的工作示例中进行调整。

import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://www.remax.ca/ab/calgary-real-estate/720-37-st-nw-wp_id251536557-lst'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
detail_title = soup.find_all(class_='detail-title')

ls = []

for _ in detail_title:
  ls.append(_.text)

df = pd.DataFrame(data=ls)

print(df)

输出

                           0
0            Property Type:
1             Property Tax:
2             Last Updated:
3        Property Sub Type:
4                  MLS® #:
5           Ownership-Type:
6               Year Built:
7                     sqft:
8              Date Listed:
9                 Lot Size:
10               Occupancy:
11             Subdivision:
12                 Heating:
13          Heating Source:
14          Full Bathrooms:
15          Half Bathrooms:
16                   Rooms:
17                Basement:
18    Basement Development:
19                Flooring:
20          Parking Spaces:
21                 Parking:
22                    Area:
23                Exterior:
24              Foundation:
25                    Roof:
26                   Faces:
27  Miscellaneous Features:
28         Lot Description:
29                   Condo:
30                Board ID:
31                   Suite:
32                Features:

编辑: print(type(detail_title))给出<class 'bs4.element.ResultSet'>,它不是一个可接受的数据类型。来自https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

数据:ndarray(结构化或同构)、Iterable、dict 或 DataFrame


推荐阅读