python - 如何使用python将完成的抓取扩展到第一页以上
问题描述
嗨,我正在浏览一个 Python 代码(粘贴在下面)。该代码适用于抓取第一页结果(每页 25 个列表)。但是,我想扩展它的可用性以从至少 10 个页面中抓取结果
例如,我想生成邮政编码的结果 - 98021 总共有 80 个列表(直到第 4 页)。但是,当我使用 运行下面的代码时python zillow.py 980021 newest
,它只显示 25 个列表
由于我是 python 的新手,我请求你帮助我实现这个目标。
from lxml import html
import requests
import unicodecsv as csv
import argparse
def parse(zipcode,filter=None):
if filter=="newest":
url = "https://www.zillow.com/homes/for_sale/{0}/0_singlestory/days_sort".format(zipcode)
elif filter == "cheapest":
url = "https://www.zillow.com/homes/for_sale/{0}/0_singlestory/pricea_sort/".format(zipcode)
else:
url = "https://www.zillow.com/homes/for_sale/{0}_rb/?fromHomePage=true&shouldFireSellPageImplicitClaimGA=false&fromHomePageTab=buy".format(zipcode)
for i in range(10):
# try:
headers= {
'accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'accept-encoding':'gzip, deflate, sdch, br',
'accept-language':'en-GB,en;q=0.8,en-US;q=0.6,ml;q=0.4',
'cache-control':'max-age=0',
'upgrade-insecure-requests':'1',
'user-agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
}
response = requests.get(url,headers=headers)
print(response.status_code)
parser = html.fromstring(response.text)
search_results = parser.xpath("//div[@id='search-results']//article")
properties_list = []
for properties in search_results:
raw_address = properties.xpath(".//span[@itemprop='address']//span[@itemprop='streetAddress']//text()")
raw_city = properties.xpath(".//span[@itemprop='address']//span[@itemprop='addressLocality']//text()")
raw_state= properties.xpath(".//span[@itemprop='address']//span[@itemprop='addressRegion']//text()")
raw_postal_code= properties.xpath(".//span[@itemprop='address']//span[@itemprop='postalCode']//text()")
raw_price = properties.xpath(".//span[@class='zsg-photo-card-price']//text()")
raw_info = properties.xpath(".//span[@class='zsg-photo-card-info']//text()")
raw_broker_name = properties.xpath(".//span[@class='zsg-photo-card-broker-name']//text()")
url = properties.xpath(".//a[contains(@class,'overlay-link')]/@href")
raw_title = properties.xpath(".//h4//text()")
address = ' '.join(' '.join(raw_address).split()) if raw_address else None
city = ''.join(raw_city).strip() if raw_city else None
state = ''.join(raw_state).strip() if raw_state else None
postal_code = ''.join(raw_postal_code).strip() if raw_postal_code else None
price = ''.join(raw_price).strip() if raw_price else None
info = ' '.join(' '.join(raw_info).split()).replace(u"\xb7",',')
broker = ''.join(raw_broker_name).strip() if raw_broker_name else None
title = ''.join(raw_title) if raw_title else None
property_url = "https://www.zillow.com"+url[0] if url else None
is_forsale = properties.xpath('.//span[@class="zsg-icon-for-sale"]')
properties = {
'address':address,
'city':city,
'state':state,
'postal_code':postal_code,
'price':price,
'facts and features':info,
'real estate provider':broker,
'url':property_url,
'title':title
}
if is_forsale:
properties_list.append(properties)
return properties_list
# except:
# print ("Failed to process the page",url)
if __name__=="__main__":
argparser = argparse.ArgumentParser(formatter_class=argparse.RawTextHelpFormatter)
argparser.add_argument('zipcode',help = '')
sortorder_help = """
available sort orders are :
newest : Latest property details,
cheapest : Properties with cheapest price
"""
argparser.add_argument('sort',nargs='?',help = sortorder_help,default ='Homes For You')
args = argparser.parse_args()
zipcode = args.zipcode
sort = args.sort
print ("Fetching data for %s"%(zipcode))
scraped_data = parse(zipcode,sort)
print ("Writing data to output file")
with open("properties-%s.csv"%(zipcode),'wb')as csvfile:
fieldnames = ['title','address','city','state','postal_code','price','facts and features','real estate provider','url']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for row in scraped_data:
writer.writerow(row)
解决方案
您需要从当前页面抓取到下一页的链接,然后更新您用来抓取的 url。
这是一个如何工作的粗略示例:
def parse(zipcode, url, filter=None):
# get results how you are
# get url from next page button
return results, next_page_url
full_results = []
results, next_page_url = parse(zipcode, initial_page_url, filter=filter)
full_results += results
while (len(results) >= 25 and next_page_url):
results, next_page_url = parse(zipcode, next_page_url, filter=filter)
full_results += results
因此,在此示例parse
中,将要抓取的 url 作为第二个位置参数,并返回结果和要抓取的下一页的 url。
只要页面上有最大结果(25)并且返回下一页的网址,这将继续抓取。
推荐阅读
- angular - TSLint 设置:防止 void 作为函数返回
- centos - 如何在打包 RPM 时添加要安装的必备包与 RPM 安装
- authentication - 使用 cURL 和 Kerberos 在 Keycloak 上进行身份验证
- python - If else with an and 条件不执行底层语句
- c# - 单击未调用的 UICollectionViewCell 中的特定元素(而是从 Source 触发的 ItemSelected)Xamarin.ios
- php - geoip_record_by_name 始终返回 null
- rxjs - 使用 Rxjs 将多个文件上传到 sftp 服务器
- python - 从 Iron Python 脚本中创建了一个 spk,但是当我执行它时出现错误:'NameError: global name 'Application' is not defined'
- python - 十进制(0.19)会导致很长的数字?
- jquery - How to inverse GSAP animation