首页 > 解决方案 > BeautifulSoup 不起作用

问题描述

我是网络抓取的新手。BeautifulSoup 没有给我任何东西。这很奇怪。PS 我用“html.parser”替换了同样不起作用的“lxml”。

from urllib.request import urlopen
>>> from bs4 import BeautifulSoup
>>> html = urlopen("http://www.pythonscraping.com/pages/page1.html")
>>> bsObj = BeautifulSoup(html.read())

Warning (from warnings module):

File"C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-
packages\bs4\__init__.py", line 181
    markup_type=markup_type))
UserWarning: No parser was explicitly specified, so I'm using the best 
available HTML parser for this system ("lxml"). This usually isn't a 
problem, but if you run this code on another system, or in a different 
virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 1 of the file <string>. To get 
rid of this warning, change code that looks like this:

BeautifulSoup(YOUR_MARKUP})

to this:

BeautifulSoup(YOUR_MARKUP, "lxml")

>>> bsObj = BeautifulSoup(html.read(),"lxml")
>>> print(bsObj.h1)
None
>>> bsObj = BeautifulSoup(html.read())
>>> print(bsObj.h1)
None

标签: web-scrapingbeautifulsoup

解决方案


问题read()反复出现。在第一个返回预期内容之后,接下来的只是返回一个空bytes对象。

您可以简单地调用read()一次并将返回值存储在变量中,然后通过创建多个汤对象等方式随意重用它。

>>> html = urlopen("http://www.pythonscraping.com/pages/page1.html").read()
>>> bsObj = BeautifulSoup(html, "lxml")
>>> bsObj.h1
<h1>An Interesting Title</h1>

如果您不想下载任何其他解析器,上述代码也可以使用html.parser.


推荐阅读