首页 > 解决方案 > 使用 BeautifulSoup 访问网页中的网页?

问题描述

我编写了一个 Python 脚本,它使用 beautifulsoup 解析网页的数据。我想做的进一步是单击页面上每个人的姓名,访问他们的个人资料,然后单击该页面上的网站链接并从该网站上抓取电子邮件 ID(如果有)。谁能帮我解决这个问题?我是 beautifulsoup 和 python 的新手,所以我无法继续。任何帮助表示赞赏。谢谢!我正在处理的链接类型是: https ://www.realtor.com/realestateagents/agentname-john

这是我的代码:

from bs4 import BeautifulSoup
import requests
import csv




#####################  Website
#####################           URL

w_url = str('https://www.')+str(input('Please Enter Website URL :'))





####################### Number of
#######################           Pages

pages = int(input(' Please specify number of pages: '))




#######################  Range
#######################         Specified
page_range = list(range(0,pages))




#######################  WebSite
#######################          Name ( in case of multiple websites )
#site_name = int(input('Enter the website name ( IN CAPITALS ) :'))



#######################  Empty
#######################        List
agent_info= []




#######################   Creating
#######################            CSV File
csv_file = open(r'D:\Webscraping\real_estate_agents.csv', 'w')

csv_writer = csv.writer(csv_file)
csv_writer.writerow(['Name and Number'])





####################### FOR
#######################    LOOP
for k in page_range:
    website = requests.get(w_url+'/pg-'+'{}'.format(k)).text
    soup = BeautifulSoup(website,'lxml')


    class1 = 'jsx-1448471805 agent-name text-bold'
    class2 = 'jsx-1448471805 agent-phone hidden-xs hidden-xxs'



    for i in soup.find_all('div',class_=[[class1],[class2]]):

        w = i.text
        agent_info.append(w)





#####################  Reomiving
#####################            Duplicates

updated_info= list(dict.fromkeys(agent_info))





#####################   Writing Data
#####################               to CSV

for t in updated_info:
    print(t)
    csv_writer.writerow([t])
    print('\n')




csv_file.close()

标签: pythonweb-scrapingbeautifulsoup

解决方案


如果您从 api 获取数据,效率会更高(代码行更少)。网站电子邮件似乎也在其中,因此如果需要,无需访问 30,000 多个网站中的每一个以获取该电子邮件,因此您可以在很短的时间内获得所有信息。

该 API 还包含您想要/需要的所有数据。例如,这里只有 1 个代理:

{'address': {'line': '1101 E 78TH ST STE 300', 'line2': '', 'city': 'BLOOMINGTON', 'postal_code': '55420', 'state_code': 'MN', 'state': '', 'country': 'US'}, 'advertiser_id': 2121274, 'agent_rating': 5, 'background_photo': {'href': 'https://ap.rdcpix.com/1223152681/cc48579b6a0fe6ccbbf44d83e8f82145g-c0o.jpg'}, 'broker': {'fulfillment_id': 3860509, 'designations': [], 'name': 'BRIDGE REALTY, LLC.', 'accent_color': '', 'photo': {'href': ''}, 'video': ''}, 'description': 'As a professional real estate agent licensed in the State of Minnesota, I am committed to providing only the highest standard of care as I assist you in navigating the twists and turns of home ownership. Whether you are buying or selling your home, I will do everything it takes to turn your real estate goals and desires into a reality. If you are looking for a real estate Agent who will put your needs first and go above and beyond to help you reach your goals, I am the agent for you.', 'designations': [], 'first_month': 0, 'first_name': 'John', 'first_year': 2010, 'has_photo': True, 'href': 'http://www.twincityhomes4sale.com', 'id': '56b63efd7e54f7010021459d', 'is_realtor': True, 'languages': [], 'last_name': 'Palomino', 'last_updated': 'Mon, 04 Jan 2021 18:46:12 GMT', 'marketing_area_cities': [{'city_state': 'Columbus_MN', 'name': 'Columbus', 'state_code': 'MN'}, {'city_state': 'Blaine_MN', 'name': 'Blaine', 'state_code': 'MN'}, {'city_state': 'Circle Pines_MN', 'name': 'Circle Pines', 'state_code': 'MN'}, {'city_state': 'Lino Lakes_MN', 'name': 'Lino Lakes', 'state_code': 'MN'}, {'city_state': 'Lexington_MN', 'name': 'Lexington', 'state_code': 'MN'}, {'city_state': 'Forest Lake_MN', 'name': 'Forest Lake', 'state_code': 'MN'}, {'city_state': 'Chisago City_MN', 'name': 'Chisago City', 'state_code': 'MN'}, {'city_state': 'Wyoming_MN', 'name': 'Wyoming', 'state_code': 'MN'}, {'city_state': 'Centerville_MN', 'name': 'Centerville', 'state_code': 'MN'}, {'city_state': 'Hugo_MN', 'name': 'Hugo', 'state_code': 'MN'}, {'city_state': 'Grant_MN', 'name': 'Grant', 'state_code': 'MN'}, {'city_state': 'St. Anthony_MN', 'name': 'St. Anthony', 'state_code': 'MN'}, {'city_state': 'Arden Hills_MN', 'name': 'Arden Hills', 'state_code': 'MN'}, {'city_state': 'New Brighton_MN', 'name': 'New Brighton', 'state_code': 'MN'}, {'city_state': 'Mounds View_MN', 'name': 'Mounds View', 'state_code': 'MN'}, {'city_state': 'White Bear Township_MN', 'name': 'White Bear Township', 'state_code': 'MN'}, {'city_state': 'Vadnais Heights_MN', 'name': 'Vadnais Heights', 'state_code': 'MN'}, {'city_state': 'Shoreview_MN', 'name': 'Shoreview', 'state_code': 'MN'}, {'city_state': 'Little Canada_MN', 'name': 'Little Canada', 'state_code': 'MN'}, {'city_state': 'Columbia Heights_MN', 'name': 'Columbia Heights', 'state_code': 'MN'}, {'city_state': 'Hilltop_MN', 'name': 'Hilltop', 'state_code': 'MN'}, {'city_state': 'Fridley_MN', 'name': 'Fridley', 'state_code': 'MN'}, {'city_state': 'Linwood_MN', 'name': 'Linwood', 'state_code': 'MN'}, {'city_state': 'East Bethel_MN', 'name': 'East Bethel', 'state_code': 'MN'}, {'city_state': 'Spring Lake Park_MN', 'name': 'Spring Lake Park', 'state_code': 'MN'}, {'city_state': 'North St. Paul_MN', 'name': 'North St. Paul', 'state_code': 'MN'}, {'city_state': 'Maplewood_MN', 'name': 'Maplewood', 'state_code': 'MN'}, {'city_state': 'St. Paul_MN', 'name': 'St. Paul', 'state_code': 'MN'}], 'mls': [{'member': {'id': '506004321'}, 'id': 416, 'abbreviation': 'MIMN', 'type': 'A', 'primary': True}], 'nar_only': 1, 'nick_name': '', 'nrds_id': '506004321', 'office': {'name': 'Bridge Realty, Llc', 'mls': [{'member': {'id': '10982'}, 'id': 416, 'abbreviation': 'MIMN', 'type': 'O', 'primary': True}], 'phones': [{'ext': '', 'number': '(952) 368-0021', 'type': 'Home'}], 'phone_list': {'phone_1': {'type': 'Home', 'number': '(952) 368-0021', 'ext': ''}}, 'photo': {'href': ''}, 'slogan': '', 'website': None, 'video': None, 'fulfillment_id': 3027311, 'address': {'line': '1101 E 78TH ST STE 300', 'line2': '', 'city': 'BLOOMINGTON', 'postal_code': '55420', 'state_code': 'MN', 'state': '', 'country': 'US'}, 'email': 'tony@thebridgerealty.com', 'nrds_id': None}, 'party_id': 23115328, 'person_name': 'John Palomino', 'phones': [{'ext': '', 'number': '(763) 458-0788', 'type': 'Mobile'}], 'photo': {'href': 'https://ap.rdcpix.com/900899898/cc48579b6a0fe6ccbbf44d83e8f82145a-c0o.jpg'}, 'recommendations_count': 2, 'review_count': 7, 'role': 'agent', 'served_areas': [{'name': 'Circle Pines', 'state_code': 'MN'}, {'name': 'Forest Lake', 'state_code': 'MN'}, {'name': 'Hugo', 'state_code': 'MN'}, {'name': 'St. Paul', 'state_code': 'MN'}, {'name': 'Minneapolis', 'state_code': 'MN'}, {'name': 'Wyoming', 'state_code': 'MN'}], 'settings': {'share_contacts': False, 'full_access': False, 'recommendations': {'realsatisfied': {'user': 'John-Palomino', 'id': '1073IJk', 'linked': '3d91C', 'updated': '1529551719'}}, 'display_listings': True, 'far_override': True, 'show_stream': True, 'terms_of_use': True, 'has_dotrealtor': False, 'display_sold_listings': True, 'display_price_range': True, 'display_ratings': True, 'loaded_from_sb': True, 'broker_data_feed_opt_out': False, 'unsubscribe': {'autorecs': False, 'recapprove': False, 'account_notify': False}, 'new_feature_popup_closed': {'agent_left_nav_avatar_to_profile': False}}, 'slogan': 'Bridging the gap between buyers & sellers', 'specializations': [{'name': '1st time home buyers'}, {'name': 'Residential Listings'}, {'name': 'Rental/Investment Properties'}, {'name': 'Move Up Buyers'}], 'title': 'Agent', 'types': 'agent', 'user_languages': [], 'web_url': 'https://www.realtor.com/realestateagents/John-Palomino_BLOOMINGTON_MN_2121274_876599394', 'zips': ['55014', '55025', '55038', '55112', '55126', '55421', '55449', '55092', '55434', '55109'], 'email': 'johnpalomino@live.com', 'full_name': 'John Palomino', 'name': 'John Palomino, Agent', 'social_media': {'facebook': {'type': 'facebook', 'href': 'https://www.facebook.com/Johnpalominorealestate'}}, 'for_sale_price': {'count': 1, 'min': 299900, 'max': 299900, 'last_listing_date': '2021-01-29T11:10:24Z'}, 'recently_sold': {'count': 35, 'min': 115000, 'max': 460000, 'last_sold_date': '2020-12-18'}, 'agent_team_details': {'is_team_member': False}}

代码:

import requests
import pandas as pd
import math

# Function to pull the data
def get_agent_info(jsonData, rows):
    agents = jsonData['agents']
    for agent in agents:
        name = agent['person_name']

        if 'email' in agent.keys():
            email = agent['email']
        else:
            email = 'N/A'
        
        if 'href' in agent.keys():
            website = agent['href']
        else:
            website = 'N/A'
            
        try:
            office_data = agent['office']
            office_email = office_data['email']
        except:
            office_email = 'N/A'
        
        row = {'name':name, 'email':email, 'website':website, 'office_email':office_email}
        rows.append(row)
    return rows

rows = []   
url = 'https://www.realtor.com/realestateagents/api/v3/search'
headers= {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36'}
payload = {'nar_only': '1','offset': '','limit': '300','marketing_area_cities':  '_',
           'postal_code': '','is_postal_search': 'true','name': 'john','types': 'agent',
           'sort': 'recent_activity_high','far_opt_out': 'false','client_id': 'FAR2.0',
           'recommendations_count_min': '','agent_rating_min': '','languages': '',
           'agent_type': '','price_min': '','price_max': '','designations': '',
           'photo': 'true'}

# Gets 1st page, finds how many pages yoyu'll need to go through, and parses the data   
jsonData = requests.get(url, headers=headers, params=payload).json()
total_matchs = jsonData['matching_rows']
total_pages = math.ceil(total_matchs/300)
rows = get_agent_info(jsonData, rows)
print ('Completed: %s of %s' %(1,total_pages))

# Iterate through next pages
for page in range(1,total_pages):
    payload.update({'offset':page*300})
    jsonData = requests.get(url, headers=headers, params=payload).json()
    rows = get_agent_info(jsonData, rows)
    print ('Completed: %s of %s' %(page+1,total_pages))

df = pd.DataFrame(rows)

输出:仅前 10 行 30,600

print(df.head(10).to_string())
                name                       email                                 website                   office_email
0       John Croteau           jcrot45@gmail.com  https://www.facebook.com/JCtherealtor/      1worcesterhomes@gmail.com
1  Stephanie St John       sstjohn@shorewest.com   https://stephaniestjohn.shorewest.com     customercare@shorewest.com
2     Johnine Larsen     info@realestategals.com               http://realestategals.com  seattle@northwestrealtors.com
3    Leonard Johnson  americandreams@comcast.net                 http://www.adrhomes.net     americandreams@comcast.net
4  John C Fitzgerald           john@jcfhomes.com                 http://www.JCFHomes.com                               
5  John Vrsansky  Jr     John@OnTargetRealty.com           http://www.OnTargetRealty.com        john@ontargetrealty.com
6      John Williams    jwilliamsidaho@gmail.com        http://www.johnwilliamsidaho.com               mpickford@kw.com
7        John Zeiter          j.zeiter@ggsir.com                                                         info@ggsir.com
8      Mitch Johnson  mitchjohnson1316@gmail.com                                            miaroberson@creedrealty.com
9          John Lowe           jplowe4@gmail.com                http://johnlowegroup.com  thedavisgrouponline@gmail.com

推荐阅读