首页 > 解决方案 > 按列表元素内的类从 html 元素中抓取文本

问题描述

我正在尝试使用“regularitem”类从第一个 h4 元素中抓取标题。输出应该看起来像“ It took months to hit 3 million reported cases...”我一直让列表索引超出范围。

 headers = {
                'Access-Control-Allow-Origin': '*',
                'Access-Control-Allow-Methods': 'GET',
                'Access-Control-Allow-Headers': 'Content-Type',
                'Access-Control-Max-Age': '3600',
                'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
                }

            URL = 'http://rss.cnn.com/rss/cnn_topstories.rss'

            req = requests.get(URL, headers)
            soup = BeautifulSoup(req.content, 'html.parser')
            headline = soup.findAll('h4', attrs = {'class' : 'itemtitle'})[0]
            print(headline.get_text)

页面 html 如下所示:

<li xmlns:dc="http://purl.org/dc/elements/1.1/" class="regularitem">
<h4 class="itemtitle"><a href="http://rss.cnn.com/~r/rss/cnn_topstories/~3/3M8R-V8mvn8/index.html">It took months to hit 3 million reported cases. Now nearly two weeks later, US is on the verge of 4 million.</a></h4>
<h5 class="itemposttime">
<span>Posted:</span>Wed, 22 Jul 2020 14:52:09 GMT</h5>
<div class="itemcontent" name="decodeable">Tracking US cases | Podcast | Those you've lost<div class="feedflare">
<a href="http://rss.cnn.com/~ff/rss/cnn_topstories?a=3M8R-V8mvn8:YP_46RpyuXw:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?d=yIl2AUoC8zA" border="0"></a> <a href="http://rss.cnn.com/~ff/rss/cnn_topstories?a=3M8R-V8mvn8:YP_46RpyuXw:7Q72WNTAKBA"><img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?d=7Q72WNTAKBA" border="0"></a> <a href="http://rss.cnn.com/~ff/rss/cnn_topstories?a=3M8R-V8mvn8:YP_46RpyuXw:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?i=3M8R-V8mvn8:YP_46RpyuXw:V_sGLiPBpWU" border="0"></a> <a href="http://rss.cnn.com/~ff/rss/cnn_topstories?a=3M8R-V8mvn8:YP_46RpyuXw:qj6IDK7rITs"><img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?d=qj6IDK7rITs" border="0"></a> <a href="http://rss.cnn.com/~ff/rss/cnn_topstories?a=3M8R-V8mvn8:YP_46RpyuXw:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?i=3M8R-V8mvn8:YP_46RpyuXw:gIN9vFwOqvQ" border="0"></a>
</div><img src="http://feeds.feedburner.com/~r/rss/cnn_topstories/~4/3M8R-V8mvn8" height="1" width="1" alt=""></div>
</li>
<li xmlns:dc="http://purl.org/dc/elements/1.1/" class="regularitem">
<h4 class="itemtitle"><a href="http://rss.cnn.com/~r/rss/cnn_topstories/~3/e93U4xIXAew/h_b91e6cf9dac712028ccce8edf19ff634">'We are going as quickly as we possibly can' on vaccine development, Fauci says</a></h4>
<h5 class="itemposttime"></h5>
<div class="itemcontent" name="decodeable"><div class="feedflare">
<a href="http://rss.cnn.com/~ff/rss/cnn_topstories?a=e93U4xIXAew:NWKJndB2i28:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?d=yIl2AUoC8zA" border="0"></a> <a href="http://rss.cnn.com/~ff/rss/cnn_topstories?a=e93U4xIXAew:NWKJndB2i28:7Q72WNTAKBA"><img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?d=7Q72WNTAKBA" border="0"></a> <a href="http://rss.cnn.com/~ff/rss/cnn_topstories?a=e93U4xIXAew:NWKJndB2i28:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?i=e93U4xIXAew:NWKJndB2i28:V_sGLiPBpWU" border="0"></a> <a href="http://rss.cnn.com/~ff/rss/cnn_topstories?a=e93U4xIXAew:NWKJndB2i28:qj6IDK7rITs"><img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?d=qj6IDK7rITs" border="0"></a> <a href="http://rss.cnn.com/~ff/rss/cnn_topstories?a=e93U4xIXAew:NWKJndB2i28:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?i=e93U4xIXAew:NWKJndB2i28:gIN9vFwOqvQ" border="0"></a>
</div><img src="http://feeds.feedburner.com/~r/rss/cnn_topstories/~4/e93U4xIXAew" height="1" width="1" alt=""></div>
</li>

我已经尝试删除列表索引并更改为soup.find(请参见下面的示例)但是当我这样做时,我得到:“ NoneType' object has no attribute 'get_text

 headers = {
                'Access-Control-Allow-Origin': '*',
                'Access-Control-Allow-Methods': 'GET',
                'Access-Control-Allow-Headers': 'Content-Type',
                'Access-Control-Max-Age': '3600',
                'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
                }

            URL = 'http://rss.cnn.com/rss/cnn_topstories.rss'

            req = requests.get(URL, headers)
            soup = BeautifulSoup(req.content, 'html.parser')
            headline = soup.find('h4', attrs = {'class' : 'itemtitle'})
            print(headline.get_text)

标签: beautifulsoup

解决方案


您发出的请求不是拉动 html,因为页面上的 html 是动态加载的。解决方案是使用 selenium,因为 selenium 模拟真实的浏览器并将加载动态内容。

# these settings may differ based on where you install your chrome webdriver
# alternatively you can use the firefox webdrive - there are plenty of tutorials online that show
# you how to install those
chrome_options = Options()
chrome_options.add_argument('--headless') # or '-start maximized' if you want to see the window open
driver = webdriver.Chrome(options=chrome_options)

headers = {
    'Access-Control-Allow-Origin': '*',
    'Access-Control-Allow-Methods': 'GET',
    'Access-Control-Allow-Headers': 'Content-Type',
    'Access-Control-Max-Age': '3600',
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
}

URL = 'http://rss.cnn.com/rss/cnn_topstories.rss'

driver.get(URL)
soup = BeautifulSoup(driver.page_source, 'html.parser')
headlines = soup.findAll('h4', {'class': 'itemtitle'})
for headline in headlines:
    headlineText = headline.find('a').text
    print(headlineText)

让我知道这是否有帮助


推荐阅读