python - How to use web scraping to get visible text on the webpage?
问题描述
This is the link of the webpage I want to scrape: https://www.tripadvisor.in/Restaurants-g494941-Indore_Indore_District_Madhya_Pradesh.html
I have also applied additional filters, by clicking on the encircled heading1
This is how the webpage looks like after clicking on the heading2
I want to get names of all the places displayed on the webpage but I seem to be having trouble with it as the url doesn't get changed on applying the filter. I am using python urllib for this. Here is my code:
url = "https://www.tripadvisor.in/Hotels-g494941-Indore_Indore_District_Madhya_Pradesh-Hotels.html"
page = urlopen(url)
html_bytes = page.read()
html = html_bytes.decode("utf-8")
print(html)
解决方案
您可以使用bs4。Bs4 是一个 python 模块,可让您从网页中获取某些内容。这将从网站获取文本:
from bs4 import BeautifulSoup as bs
soup = bs(html, features='html5lib')
text = soup.get_text()
print(text)
如果你想得到不是文本的东西,也许是带有特定标签的东西,你也可以使用 bs4:
soup.findall('p') # Getting all p tags
soup.findall('p', class_='Title') #getting all p tags with a class of Title
找出所有地名有什么类和标签,然后使用上面的方法来获取所有地名。
推荐阅读
- android - 我正在使用 Android 制作聊天应用程序。我无法连接到 Firebase
- javascript - 这里的变量 a 如何具有本地范围及其始终返回的窗口对象(通过 this 关键字指向)
- react-native - 你能从 goBack 制作 navigation.goBack 到一些路线吗?
- provisioning - 如何在 ttyUSB0 以外的端口上进行配置?
- c# - 重复为 httpruntime.cache 赋值是不是很糟糕?
- praat - Shimmer 函数不会在 textgrid 间隔上运行,但 jitter 函数可以正常工作吗?
- excel - 宏突然禁用
- java - 如何在 Java 中更新 JSON 数组
- javascript - 您如何使用异步调用解决 IndexedDb Transaction
- javascript - 如何调用jquery函数?