python-3.x - 如何使用 Selenium (Python) 抓取多个页面
问题描述
我已经看到了几种从网站上抓取多个页面的解决方案,但无法使其在我的代码上运行。
目前,我有这段代码,正在抓取第一页。我想创建一个循环来抓取网站的所有页面(从第 1 页到第 5 页)
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
options = Options()
options.add_argument("window-size=1400,600")
from fake_useragent import UserAgent
ua = UserAgent()
a = ua.random
user_agent = ua.random
print(user_agent)
options.add_argument(f'user-agent={user_agent}')
driver = webdriver.Chrome('/Users/raduulea/Documents/chromedriver', options=options)
driver.get('https://www.immoweb.be/fr/recherche/immeuble-de-rapport/a-vendre/liege/4000?page=1')
import time
time.sleep(10)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
results = soup.find_all("div", {"class":"result-xl"})
title=[]
address=[]
price=[]
surface=[]
desc=[]
for result in results:
title.append(result.find("div", {"class":"title-bar-left"}).get_text().strip())
address.append(result.find("span", {"result-adress"}).get_text().strip())
price.append(result.find("div", {"class":"xl-price rangePrice"}).get_text().strip())
surface.append(result.find("div", {"class":"xl-surface-ch"}).get_text().strip())
desc.append(result.find("div", {"class":"xl-desc"}).get_text().strip())
df = pd.DataFrame({"Title":title,"Address":address,"Price:":price,"Surface" : surface,"Description":desc})
df.to_csv("output.csv")
解决方案
试试下面的代码。它将遍历所有页面,而不仅仅是 5 个页面。如果可用,请检查下一个按钮,单击它,否则会打破诡计循环。
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
options = Options()
options.add_argument("window-size=1400,600")
from fake_useragent import UserAgent
ua = UserAgent()
a = ua.random
user_agent = ua.random
print(user_agent)
options.add_argument(f'user-agent={user_agent}')
driver = webdriver.Chrome('/Users/raduulea/Documents/chromedriver', options=options)
driver.get('https://www.immoweb.be/fr/recherche/immeuble-de-rapport/a-vendre')
import time
time.sleep(10)
Title = []
address = []
price = []
surface = []
desc = []
page=2
while True:
time.sleep(10)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
results = soup.find_all("div", {"class": "result-xl"})
for result in results:
Title.append(result.find("div", {"class": "title-bar-left"}).get_text().strip())
address.append(result.find("span", {"result-adress"}).get_text().strip())
price.append(result.find("div", {"class": "xl-price rangePrice"}).get_text().strip())
surface.append(result.find("div", {"class": "xl-surface-ch"}).get_text().strip())
desc.append(result.find("div", {"class": "xl-desc"}).get_text().strip())
if len(driver.find_elements_by_css_selector("a.next")) > 0:
url = "https://www.immoweb.be/fr/recherche/immeuble-de-rapport/a-vendre/?page={}".format(page)
driver.get(url)
page += 1
#It will traverse for only 5 pages as you are after if want more page just comment the below if block
if int(page)>5:
break
else:
break
df = pd.DataFrame({"Title": Title, "Address": address, "Price:": price, "Surface": surface, "Description": desc})
df.to_csv("output.csv")
推荐阅读
- qt - 当我们在旋转时使用 qpainter 时如何避免像素图切割问题
- javascript - DHTMLX 甘特图 API
- java - 在 Spring 中使用 RestTemplate 将 ISO 日期字符串解析为 ZoneDateTime
- javascript - 在 webpack 上,如何在不评估的情况下导入脚本?
- sql - 如何在 PostgreSQL 中将字符串(示例:2/May/19、24/4/19、2019/4/20)转换为日期?
- r - tmap tm_add_legend alpha 不适用于“填充”
- c++ - 如何在 Arduino 字符串的开头添加元素。类似于 JS unshift();
- java - Java 使方法在未来运行 1 分钟以进行测试
- angular - 如何将组件放入 Angular 2 中指令的宿主容器中?
- visual-studio - Visual Studio 突然出现白色