selenium - 从python Selenium webscraping中的同一页面弹出窗口中提取信息
问题描述
注意:我在 python 方面经验丰富,但刚开始使用硒和网络抓取。如果这是一个不好的问题,或者我在硒方面的基础知识似乎有问题,请原谅。我在几个小时的搜索中找不到答案,因此我在这里问
目标:提取企业 Yelp 页面中的“关于企业”信息 某些页面在基于“阅读更多”按钮的弹出窗口中包含有关企业信息(例如:https ://www.yelp.com/biz/and-pizza -bethesda-bethesda)某些页面在基于“阅读更多”按钮的弹出窗口中没有其业务信息(例如:https ://www.yelp.com/biz/pneuma-fashions-upper-marlboro-3 )
问题:无法导航到单击“阅读更多”按钮并提取其中存在的文本后出现的“关于业务”弹出窗口。
截至目前的尝试:通过谷歌搜索,我找到了有关如何处理警报弹出窗口或窗口弹出窗口的解释。但是代码不起作用。单击阅读更多按钮时出现的弹出窗口不会导致 window_handles 发生变化
import re
# getting all sections of the page
result=driver.find_elements_by_tag_name("section")
About = None
for sec in result:
if sec.text.startswith("About the Business"):
# this pertains only to the About the business section
main_page=driver.current_window_handle
print(main_page) # Returns the current handle
sec.find_element_by_tag_name("button").click()
popup=None
for handle in driver.window_handles: # is an iterable with only one handle
# The only handle present is the main_page handle
print(handle)
if handle!=main_page:
popup = handle
print(popup) # returns None
driver.switch_to.window(popup) # Throws error because popup=None
# THE FOLLOWING SECTION IS NOT EXECUTED BECAUSE OF THE ERROR ABOVE
#////////////////////////////////////////////////////
button_contents=driver.find_elements_by_tag_name("p")
for b in button_contents:
print(b.text) # intended to print text contents
close=driver.find_element_by_tag_name("button")
close.click()
driver.switch_to.window(main_page)
请帮忙
感谢所有阅读此问题并提供建议和答案的人
解决方案
您应该知道的一件事是弹出窗口不会显示在新窗口中。相反,它显示在同一页面本身。以下是从弹出窗口中提取文本的完整代码:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.yelp.com/biz/and-pizza-bethesda-bethesda')
try:
driver.find_element_by_xpath('//*[@id="wrap"]/div[3]/div/div[4]/div/div/div[2]/div/div/div[1]/div/div[1]/section[5]/div[2]/button').click()
p1 = driver.find_element_by_xpath('//*[@id="modal-portal-container"]/div[2]/div/div/div/div[2]/div/div[2]/div/div[2]/div/div/div[1]/p').text
p2 = driver.find_element_by_xpath('//*[@id="modal-portal-container"]/div[2]/div/div/div/div[2]/div/div[2]/div/div[2]/div/div/div[2]/p[2]').text
print("Specialties --",p1)
print("History --",p2)
except:
print('Read more button not found')
输出:
Specialties -- Award-winning pizza: Named one of Fast Company's "World's Most Innovative Companies" in 2018, third-place in the Washington Post Express's of "Best Fast Casual" in 2018, third place in the Washington City Paper's "Best Gluten-Free Menu" in 2018 and won its "Best Pizza in D.C." in 2017, 11th on TripAdvisor's "Best Fast Casual Restaurants -- United States" in 2018.
History -- Since 2012, we've built pizza shops with an edge to their craft pies, beverages and shop design, created an environment where ALL of our Tribe can thrive, supported our local communities and now we'll text you back, if you want. Started with a pizza shop. Became a culture. That's &pizza.
编辑:
由于这不适用于本网站,请将第一个替换为find_element_by_xpath
:
driver.find_element_by_xpath("//div[@class='lemon--div__373c0__1mboc border-color--default__373c0__3-ifU']/button[.='Read more']").click()
这适用于两个网站。
推荐阅读
- loops - 如何在javascript中重新启动函数
- python - tkinter 输入框中的第一个字母没有被删除
- javascript - 使用 Webpack 延迟加载供应商 JS、CSS 文件
- pytest - pytest - 不同文件的相同`@fixture`
- python - 在 Session python (requests) 中使用代理
- php - WPGraphQL 为自定义帖子类型添加嵌套/多个查询
- python - 从多个源输入数据到hadoop(HDFS)
- r - 为什么我的 geom_col 没有显示统一的顺序?
- python - Seaborn 或 Matplotlib 圆角箱线图
- bash - 与 $ 不同的输出?在 bash -c 下运行