首页 > 解决方案 > 有没有办法在不使用浏览器的情况下呈现 HTML 页面,然后抓取它的内容?

问题描述

我需要从网页中提取一些文本,但网页是动态构建的(插件)。即我需要包含一个 javascript SDK

<div id="fb-root"></div>
<script async defer crossorigin="anonymous" src="https://connect.facebook.net/en_US/sdk.js#xfbml=1&version=v11.0" nonce="4HbUqy4w"></script>

然后将代码放在我希望插件出现在我的页面上的位置

<div class="fb-comments" data-href="https://developers.facebook.com/docs/plugins/comments#configurator" data-width="1" data-numposts="1"></div>

所以总的来说,我有类似的东西

<html>
    <body>
        <div id="fb-root"></div>
        <script async defer crossorigin="anonymous" src="https://connect.facebook.net/en_US/sdk.js#xfbml=1&version=v11.0" nonce="4HbUqy4w"></script>
        <div class="fb-comments" data-href="https://developers.facebook.com/docs/plugins/comments#configurator" data-width="1" data-numposts="1"></div>
    </body>
</html>

在浏览器上呈现此页面应该会自动加载一些我现在想要抓取的数据。有没有办法在 python 中呈现这个 HTML?我试过使用

from requests_html import HTML

doc = # the content above
html = HTML(html=doc)
page = html.render(keep_page=True, sleep=120)

但页面总是None

理想情况下,我想要类似的东西

html_code = #here
loaded_html_code = a_package.render(html_code) # This should render my HTML which in turn causes an Iframe to be loaded.

标签: web-scrapingbeautifulsouppython-requestspython-requests-html

解决方案


您可以使用 Beautiful Soup 和 Selenium Web Driver 来实现您的目标。这是一个示例代码:

import time
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup

URL = "https://example.com/"
driver = webdriver.Firefox()
driver.get(URL)

time.sleep(15) # in seconds. 15 seconds should be enough to load the contents from API, JS, AJAX, etc.
html = driver.page_source
soup  = BeautifulSoup(html)

# find elements by ID
results = soup.find(id="target_id")

推荐阅读