python - 使用 BeautifulSoup 访问网页中的网页?
问题描述
我编写了一个 Python 脚本,它使用 beautifulsoup 解析网页的数据。我想做的进一步是单击页面上每个人的姓名,访问他们的个人资料,然后单击该页面上的网站链接并从该网站上抓取电子邮件 ID(如果有)。谁能帮我解决这个问题?我是 beautifulsoup 和 python 的新手,所以我无法继续。任何帮助表示赞赏。谢谢!我正在处理的链接类型是: https ://www.realtor.com/realestateagents/agentname-john
这是我的代码:
from bs4 import BeautifulSoup
import requests
import csv
##################### Website
##################### URL
w_url = str('https://www.')+str(input('Please Enter Website URL :'))
####################### Number of
####################### Pages
pages = int(input(' Please specify number of pages: '))
####################### Range
####################### Specified
page_range = list(range(0,pages))
####################### WebSite
####################### Name ( in case of multiple websites )
#site_name = int(input('Enter the website name ( IN CAPITALS ) :'))
####################### Empty
####################### List
agent_info= []
####################### Creating
####################### CSV File
csv_file = open(r'D:\Webscraping\real_estate_agents.csv', 'w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['Name and Number'])
####################### FOR
####################### LOOP
for k in page_range:
website = requests.get(w_url+'/pg-'+'{}'.format(k)).text
soup = BeautifulSoup(website,'lxml')
class1 = 'jsx-1448471805 agent-name text-bold'
class2 = 'jsx-1448471805 agent-phone hidden-xs hidden-xxs'
for i in soup.find_all('div',class_=[[class1],[class2]]):
w = i.text
agent_info.append(w)
##################### Reomiving
##################### Duplicates
updated_info= list(dict.fromkeys(agent_info))
##################### Writing Data
##################### to CSV
for t in updated_info:
print(t)
csv_writer.writerow([t])
print('\n')
csv_file.close()
解决方案
如果您从 api 获取数据,效率会更高(代码行更少)。网站电子邮件似乎也在其中,因此如果需要,无需访问 30,000 多个网站中的每一个以获取该电子邮件,因此您可以在很短的时间内获得所有信息。
该 API 还包含您想要/需要的所有数据。例如,这里只有 1 个代理:
{'address': {'line': '1101 E 78TH ST STE 300', 'line2': '', 'city': 'BLOOMINGTON', 'postal_code': '55420', 'state_code': 'MN', 'state': '', 'country': 'US'}, 'advertiser_id': 2121274, 'agent_rating': 5, 'background_photo': {'href': 'https://ap.rdcpix.com/1223152681/cc48579b6a0fe6ccbbf44d83e8f82145g-c0o.jpg'}, 'broker': {'fulfillment_id': 3860509, 'designations': [], 'name': 'BRIDGE REALTY, LLC.', 'accent_color': '', 'photo': {'href': ''}, 'video': ''}, 'description': 'As a professional real estate agent licensed in the State of Minnesota, I am committed to providing only the highest standard of care as I assist you in navigating the twists and turns of home ownership. Whether you are buying or selling your home, I will do everything it takes to turn your real estate goals and desires into a reality. If you are looking for a real estate Agent who will put your needs first and go above and beyond to help you reach your goals, I am the agent for you.', 'designations': [], 'first_month': 0, 'first_name': 'John', 'first_year': 2010, 'has_photo': True, 'href': 'http://www.twincityhomes4sale.com', 'id': '56b63efd7e54f7010021459d', 'is_realtor': True, 'languages': [], 'last_name': 'Palomino', 'last_updated': 'Mon, 04 Jan 2021 18:46:12 GMT', 'marketing_area_cities': [{'city_state': 'Columbus_MN', 'name': 'Columbus', 'state_code': 'MN'}, {'city_state': 'Blaine_MN', 'name': 'Blaine', 'state_code': 'MN'}, {'city_state': 'Circle Pines_MN', 'name': 'Circle Pines', 'state_code': 'MN'}, {'city_state': 'Lino Lakes_MN', 'name': 'Lino Lakes', 'state_code': 'MN'}, {'city_state': 'Lexington_MN', 'name': 'Lexington', 'state_code': 'MN'}, {'city_state': 'Forest Lake_MN', 'name': 'Forest Lake', 'state_code': 'MN'}, {'city_state': 'Chisago City_MN', 'name': 'Chisago City', 'state_code': 'MN'}, {'city_state': 'Wyoming_MN', 'name': 'Wyoming', 'state_code': 'MN'}, {'city_state': 'Centerville_MN', 'name': 'Centerville', 'state_code': 'MN'}, {'city_state': 'Hugo_MN', 'name': 'Hugo', 'state_code': 'MN'}, {'city_state': 'Grant_MN', 'name': 'Grant', 'state_code': 'MN'}, {'city_state': 'St. Anthony_MN', 'name': 'St. Anthony', 'state_code': 'MN'}, {'city_state': 'Arden Hills_MN', 'name': 'Arden Hills', 'state_code': 'MN'}, {'city_state': 'New Brighton_MN', 'name': 'New Brighton', 'state_code': 'MN'}, {'city_state': 'Mounds View_MN', 'name': 'Mounds View', 'state_code': 'MN'}, {'city_state': 'White Bear Township_MN', 'name': 'White Bear Township', 'state_code': 'MN'}, {'city_state': 'Vadnais Heights_MN', 'name': 'Vadnais Heights', 'state_code': 'MN'}, {'city_state': 'Shoreview_MN', 'name': 'Shoreview', 'state_code': 'MN'}, {'city_state': 'Little Canada_MN', 'name': 'Little Canada', 'state_code': 'MN'}, {'city_state': 'Columbia Heights_MN', 'name': 'Columbia Heights', 'state_code': 'MN'}, {'city_state': 'Hilltop_MN', 'name': 'Hilltop', 'state_code': 'MN'}, {'city_state': 'Fridley_MN', 'name': 'Fridley', 'state_code': 'MN'}, {'city_state': 'Linwood_MN', 'name': 'Linwood', 'state_code': 'MN'}, {'city_state': 'East Bethel_MN', 'name': 'East Bethel', 'state_code': 'MN'}, {'city_state': 'Spring Lake Park_MN', 'name': 'Spring Lake Park', 'state_code': 'MN'}, {'city_state': 'North St. Paul_MN', 'name': 'North St. Paul', 'state_code': 'MN'}, {'city_state': 'Maplewood_MN', 'name': 'Maplewood', 'state_code': 'MN'}, {'city_state': 'St. Paul_MN', 'name': 'St. Paul', 'state_code': 'MN'}], 'mls': [{'member': {'id': '506004321'}, 'id': 416, 'abbreviation': 'MIMN', 'type': 'A', 'primary': True}], 'nar_only': 1, 'nick_name': '', 'nrds_id': '506004321', 'office': {'name': 'Bridge Realty, Llc', 'mls': [{'member': {'id': '10982'}, 'id': 416, 'abbreviation': 'MIMN', 'type': 'O', 'primary': True}], 'phones': [{'ext': '', 'number': '(952) 368-0021', 'type': 'Home'}], 'phone_list': {'phone_1': {'type': 'Home', 'number': '(952) 368-0021', 'ext': ''}}, 'photo': {'href': ''}, 'slogan': '', 'website': None, 'video': None, 'fulfillment_id': 3027311, 'address': {'line': '1101 E 78TH ST STE 300', 'line2': '', 'city': 'BLOOMINGTON', 'postal_code': '55420', 'state_code': 'MN', 'state': '', 'country': 'US'}, 'email': 'tony@thebridgerealty.com', 'nrds_id': None}, 'party_id': 23115328, 'person_name': 'John Palomino', 'phones': [{'ext': '', 'number': '(763) 458-0788', 'type': 'Mobile'}], 'photo': {'href': 'https://ap.rdcpix.com/900899898/cc48579b6a0fe6ccbbf44d83e8f82145a-c0o.jpg'}, 'recommendations_count': 2, 'review_count': 7, 'role': 'agent', 'served_areas': [{'name': 'Circle Pines', 'state_code': 'MN'}, {'name': 'Forest Lake', 'state_code': 'MN'}, {'name': 'Hugo', 'state_code': 'MN'}, {'name': 'St. Paul', 'state_code': 'MN'}, {'name': 'Minneapolis', 'state_code': 'MN'}, {'name': 'Wyoming', 'state_code': 'MN'}], 'settings': {'share_contacts': False, 'full_access': False, 'recommendations': {'realsatisfied': {'user': 'John-Palomino', 'id': '1073IJk', 'linked': '3d91C', 'updated': '1529551719'}}, 'display_listings': True, 'far_override': True, 'show_stream': True, 'terms_of_use': True, 'has_dotrealtor': False, 'display_sold_listings': True, 'display_price_range': True, 'display_ratings': True, 'loaded_from_sb': True, 'broker_data_feed_opt_out': False, 'unsubscribe': {'autorecs': False, 'recapprove': False, 'account_notify': False}, 'new_feature_popup_closed': {'agent_left_nav_avatar_to_profile': False}}, 'slogan': 'Bridging the gap between buyers & sellers', 'specializations': [{'name': '1st time home buyers'}, {'name': 'Residential Listings'}, {'name': 'Rental/Investment Properties'}, {'name': 'Move Up Buyers'}], 'title': 'Agent', 'types': 'agent', 'user_languages': [], 'web_url': 'https://www.realtor.com/realestateagents/John-Palomino_BLOOMINGTON_MN_2121274_876599394', 'zips': ['55014', '55025', '55038', '55112', '55126', '55421', '55449', '55092', '55434', '55109'], 'email': 'johnpalomino@live.com', 'full_name': 'John Palomino', 'name': 'John Palomino, Agent', 'social_media': {'facebook': {'type': 'facebook', 'href': 'https://www.facebook.com/Johnpalominorealestate'}}, 'for_sale_price': {'count': 1, 'min': 299900, 'max': 299900, 'last_listing_date': '2021-01-29T11:10:24Z'}, 'recently_sold': {'count': 35, 'min': 115000, 'max': 460000, 'last_sold_date': '2020-12-18'}, 'agent_team_details': {'is_team_member': False}}
代码:
import requests
import pandas as pd
import math
# Function to pull the data
def get_agent_info(jsonData, rows):
agents = jsonData['agents']
for agent in agents:
name = agent['person_name']
if 'email' in agent.keys():
email = agent['email']
else:
email = 'N/A'
if 'href' in agent.keys():
website = agent['href']
else:
website = 'N/A'
try:
office_data = agent['office']
office_email = office_data['email']
except:
office_email = 'N/A'
row = {'name':name, 'email':email, 'website':website, 'office_email':office_email}
rows.append(row)
return rows
rows = []
url = 'https://www.realtor.com/realestateagents/api/v3/search'
headers= {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36'}
payload = {'nar_only': '1','offset': '','limit': '300','marketing_area_cities': '_',
'postal_code': '','is_postal_search': 'true','name': 'john','types': 'agent',
'sort': 'recent_activity_high','far_opt_out': 'false','client_id': 'FAR2.0',
'recommendations_count_min': '','agent_rating_min': '','languages': '',
'agent_type': '','price_min': '','price_max': '','designations': '',
'photo': 'true'}
# Gets 1st page, finds how many pages yoyu'll need to go through, and parses the data
jsonData = requests.get(url, headers=headers, params=payload).json()
total_matchs = jsonData['matching_rows']
total_pages = math.ceil(total_matchs/300)
rows = get_agent_info(jsonData, rows)
print ('Completed: %s of %s' %(1,total_pages))
# Iterate through next pages
for page in range(1,total_pages):
payload.update({'offset':page*300})
jsonData = requests.get(url, headers=headers, params=payload).json()
rows = get_agent_info(jsonData, rows)
print ('Completed: %s of %s' %(page+1,total_pages))
df = pd.DataFrame(rows)
输出:仅前 10 行 30,600
print(df.head(10).to_string())
name email website office_email
0 John Croteau jcrot45@gmail.com https://www.facebook.com/JCtherealtor/ 1worcesterhomes@gmail.com
1 Stephanie St John sstjohn@shorewest.com https://stephaniestjohn.shorewest.com customercare@shorewest.com
2 Johnine Larsen info@realestategals.com http://realestategals.com seattle@northwestrealtors.com
3 Leonard Johnson americandreams@comcast.net http://www.adrhomes.net americandreams@comcast.net
4 John C Fitzgerald john@jcfhomes.com http://www.JCFHomes.com
5 John Vrsansky Jr John@OnTargetRealty.com http://www.OnTargetRealty.com john@ontargetrealty.com
6 John Williams jwilliamsidaho@gmail.com http://www.johnwilliamsidaho.com mpickford@kw.com
7 John Zeiter j.zeiter@ggsir.com info@ggsir.com
8 Mitch Johnson mitchjohnson1316@gmail.com miaroberson@creedrealty.com
9 John Lowe jplowe4@gmail.com http://johnlowegroup.com thedavisgrouponline@gmail.com
推荐阅读
- java - 限制 Android 测验应用程序中的问题,我的问题列表来自 Firebase 实时数据库
- excel - 如何在 VBA 中公开变量?
- java - 接受输入直到输入正数
- javascript - 如何在 Node.js 中创建人类可读但机器不可读的文本?
- sql - 条件运行计数/累积和:寻找公式或脚本
- java - Web 应用程序数据库或地图以提高性能
- asp.net - 模拟的 UserManager 和 roleManager 方法返回 null
- c# - 如何在解析响应中取消 webapi bulit
- amazon-web-services - 如何从使用 CloudFormation 部署的自动化文档中执行 EC2 实例中的 bash 代码
- r - For循环警告:“要替换的项目数不是替换长度的倍数”,带有两个数据框