首页 > 解决方案 > 使用 beautifulsoup 抓取房地产经纪人数据

问题描述

我试图通过使用 beautifulsoup 从 realtor.com 上抓取一些数据来帮助一些房地产经纪人朋友。

我正在尝试获取房地产经纪人的姓名和电话号码列表,但我将每个项目作为单独的项目获取,并且页面上的每个房地产经纪人都有重复项。

这是我目前拥有的:

from bs4 import BeautifulSoup
import requests
import numpy as np
import pandas as pd

allRealtors = []
pages = np.arange(1, 2, 1)
for page in pages:
    page = requests.get("https://www.realtor.com/realestateagents/New-Orleans_LA/pg-" + str(page))
    soup = BeautifulSoup(page.text, 'html.parser')
    realtors = soup.find_all('div', {"class", ['jsx-1448471805 agent-name text-bold', 'jsx-1448471805 agent-phone hidden-xs hidden-xxs']})
    for item in realtors:
        allRealtors += item
print(allRealtors)

以下是我对 allRealtors 列表的当前结果:

['Lisa Shedlock', '(504) 330-8233', 'Lisa Shedlock', '(504) 330-8233', 'Heather Laughlin', '(504) 256-6180', 'Heather Laughlin', '(504) 256-6180', 'LIZ ASHE', '(504) 401-4285', 'LIZ ASHE', '(504) 401-4285', 'Richard Haffner', '(504) 456-2961', 'Richard Haffner', '(504) 456-2961', 'Shelly Vallee', '(504) 975-6014', 'Shelly Vallee', '(504) 975-6014', 'Britt Galloway, Agent', '(504) 455-0100', 'Britt Galloway, Agent', '(504) 455-0100', 'Catherine Goens Gerrets, Agent', '(504) 439-8464', 'Catherine Goens Gerrets, Agent', '(504) 439-8464', 'Suzy Lamore', '(504) 729-8818', 'Suzy Lamore', '(504) 729-8818', 'Patti Faulder', '(504) 799-1702', 'Patti Faulder', '(504) 799-1702']

它正在为每个房地产经纪人的姓名和电话号码创建副本。理想情况下,我会将 2 个值作为字典输入,如下所示:

{name:'Lisa Shedlock', number:'(504) 330-8233'; name:'Heather Laughlin', number:'(504) 256-6180'}

然后我会将该字典转换为带有列名称和电话号码的 pandas 数据框。

然而,这是我第一次使用beautifulsoup,不知道如何做到这一点。有什么建议么?

有没有更简单的方法来实现这一点?

谢谢!

标签: pythonbeautifulsoup

解决方案


那么你可以以这种方式使用选择器

from bs4 import BeautifulSoup
import requests
import numpy as np
import pandas as pd

realtors_data = {}
pages = np.arange(1, 2, 1)
print("PAGES: ", pages)
names_selector = "ul > div > div > div > div > div > a > div"
phone_selectors = "ul > div > div > div > div > div > div.jsx-1448471805.agent-phone.hidden-xs.hidden-xxs"
for page in pages:
    page = requests.get("https://www.realtor.com/realestateagents/New-Orleans_LA/pg-" + str(page))
    soup = BeautifulSoup(page.text, 'html.parser')
    names = soup.select(names_selector)
    phones = soup.select(phone_selectors)

    realtors = zip(names, phones)
    for name, phone in realtors:
        realtors_data[name.get_text()] = phone.get_text()


# Printing data
print(realtors_data)

输出:

{'Lisa Shedlock': '(504) 330-8233', 'Heather Laughlin': '(504) 256-6180', 'LIZ ASHE': '(504) 401-4285', 'Richard Haffner': '(504) 456-2961', 'Shelly Vallee': '(504) 975-6014', 'Britt Galloway, Agent': '(504) 455-0100', 'Catherine Goens Gerrets, Agent': '(504) 439-8464', 'Suzy Lamore': '(504) 729-8818', 'Patti Faulder': '(504) 799-1702', 'Susan Ann Bourgeois': '(504) 236-7836', 'Lane Washburn': '(504) 909-0824', 'Brandy Dufrene': '(504) 330-2963', 'Claire E Hohensee': '(504) 654-9353', 'Aaron DareTeam': '(504) 899-8666', 'Kara Breithaupt': '(504) 444-6400', 'Joli Tolbert-Burrell': '(504) 982-5654', 'AMANDA MILLER': '(504) 250-0059', 'Carla Lawson': '(504) 329-5164', 'Michael D. Lester': '(504) 559-4652', 'Michael A. Newcomer': '(504) 321-1654'}

推荐阅读