首页 > 解决方案 > How to use web scraping to get visible text on the webpage?

问题描述

This is the link of the webpage I want to scrape: https://www.tripadvisor.in/Restaurants-g494941-Indore_Indore_District_Madhya_Pradesh.html

I have also applied additional filters, by clicking on the encircled heading1

This is how the webpage looks like after clicking on the heading2

I want to get names of all the places displayed on the webpage but I seem to be having trouble with it as the url doesn't get changed on applying the filter. I am using python urllib for this. Here is my code:

url = "https://www.tripadvisor.in/Hotels-g494941-Indore_Indore_District_Madhya_Pradesh-Hotels.html"
page = urlopen(url)
html_bytes = page.read()
html = html_bytes.decode("utf-8")
print(html)

标签: pythonhtmlpython-3.xweb-scrapingurllib

解决方案


您可以使用bs4。Bs4 是一个 python 模块,可让您从网页中获取某些内容。这将从网站获取文本:

from bs4 import BeautifulSoup as bs
soup = bs(html, features='html5lib')
text = soup.get_text()
print(text)

如果你想得到不是文本的东西,也许是带有特定标签的东西,你也可以使用 bs4:

soup.findall('p') # Getting all p tags
soup.findall('p', class_='Title') #getting all p tags with a class of Title

找出所有地名有什么类和标签,然后使用上面的方法来获取所有地名。

https://www.crummy.com/software/BeautifulSoup/bs4/doc/


推荐阅读