首页 > 解决方案 > 通过 Selenium 抓取亚马逊搜索页面

问题描述

我正在尝试从亚马逊搜索页面中抓取一些基本信息。我使用的 XPath 似乎是正确的,但是下面的代码只给了我 for 循环的每次迭代的第一个结果——基本上只有第一本书的标题 x 第 1 页上的搜索结果数。我是什么做错了吗?

from selenium import webdriver
from time import sleep

PATH = 'ChromeDriver/chromedriver'

driver = webdriver.Chrome(PATH)
driver.get('https://www.amazon.in/s?k=python+books&ref=nb_sb_noss')

sleep(2)

entries = driver.find_elements_by_xpath('//div[contains(@data-cel-widget, "search_result_")]')

for entry in entries:
    title = entry.find_element_by_xpath('//span[@class = "a-size-medium a-color-base a-text-normal"]')
    
    print(title.text)

标签: pythonseleniumweb-scraping

解决方案


entries不需要定位器。直接循环结果

for entry in driver.find_elements_by_xpath("//span[@class = 'a-size-medium a-color-base a-text-normal']"):
    print(entry.text)

印刷:

Learning with Python
Machine Learning using Python
Python Crash Course, 2nd Edition: A Hands-On, Project-Based Introduction to Programming
Python: This Book Includes: Learn Python Programming + Python Coding and Programming + Python Coding. Everything you need to know to Learn Coding ... Machine Learning, Data Science and more ....
Python Programming: Using Problem Solving Approach
Automate the Boring Stuff with Python, 2nd Edition: Practical Programming for Total Beginners
...

更新的解决方案

这是一种可以解析它并将变量名分配给不同部分的方法。请注意,作者和日期实际上在同一个元素中,因此它显示了两者。

for entry in driver.find_elements_by_xpath("//div[@data-component-type='s-search-result']"):
    title = entry.find_element_by_xpath(".//span[@class = 'a-size-medium a-color-base a-text-normal']").text
    authors = entry.find_element_by_xpath(".//div[@class='a-row a-size-base a-color-secondary']").get_attribute("innerText")
    print(title)
    print(authors)

印刷:

Learning with Python
by Allen Downey , Jeffrey Elkner, et al. | 1 January 2015
Machine Learning using Python
by U Dinesh Kumar Manaranjan Pradhan | 1 January 2019
...

还要注意在循环中的每个子元素中它以.//. 点是需要的,否则它每次都会回到根目录,我认为这是你最初面临的问题。


推荐阅读