python - Beautiful Soup 4 在将其转换为“html”或“lxml”后会删除所有内容?
问题描述
所以我在 Anaconda 的 Python 3.8 上使用 bs4 和 requests 包。我正在尝试从 voxforge.com 获取所有 .tgz 文件名。但是,在我使用请求并将其转换为汤之后,之后的所有信息都消失了。链接到页面
import requests
import bs4
r = requests.get('http://www.repository.voxforge1.org/downloads/fr/Trunk/Audio/Main/16kHz_16bit/')
r.text
这将返回我需要的一切(并且会持续一段时间):
'<title>VoxForge Repository</title>\n\n\t<style type="text/css">\n\t.siteFunctions {\n\t\ttext-align: right;\n\t}\n\t.copyright {\n\t\ttest-align: left;\n\t\tcolor: #2E3436;\n\t\tfont-family: sans-serif;\n font-size: small;\n\t}\n\n\tbody {\n\t\tfont-family: "DejaVu Sans", "Lucida Sans Unicode", sans-serif;\n\t\tfont-weight:\tnormal;\n\t\tword-spacing:\tnormal;\n\t\tletter-spacing:\tnormal; \n\t\ttext-transform:\tnone;\n\t\tfont-size: medium;\n text-align: justify;\n\t}\n\th2 {\n\t\tfont-size:\t1.5em;\n\t\tfont-weight:\t700;\n\t\tmargin-top:1em;\n\t\tmargin-bottom:0.8em;\n\t}\n\th3 {\n\t\tfont-size:\t1.1em;\n\t\tfont-weight:\t600;\n\t\tmargin-top:1em;\n\t\tmargin-bottom:0.4em;\n\t}\n\tp, ol, ul {\n\t\tfont-size:\t1em;\n\t\tmargin-top:0.4em;\n\t\tmargin-bottom:0.4em;\n\t}\t\n\t.heading {\n\t\tbackground-color: #555753;\n color: #D3D7CF;\n\t\tfont-size: 40px;\n\t\tvertical-align: bottom;\n\t}\n\t.logo {\n\t\twidth: 100px; \n\t\tfloat: left;\n\t\ttext-align: left;\n\t}\n\t.logo img {\n\t\tborder: 0px;\n\t}\n\timg {\n\t\tborder: 0px;\n\t}\n\t.clickableicons {\n\t}\n\t.endFloat {\n\t\tclear: both;\n\t\n\t}\n\t.padding {\n\t\tpadding: 10px;\n\t}\n\t.bodyContent {\n\t\tbackground-color: #ffffff;\n\t\tcolor: #2E3436;\n text-align: justify;\n\t}\n\t.menu {\n color: #D3D7CF;\n\t\tbackground-color: #555753;\n\t\ttext-align: left;\n\t}\n\n\t.menu2 {\n color: #D3D7CF;\n\t\tbackground-color: #555753;\n\t\ttext-align: center;\n\t\t\n\t}\n\ta {\n\t\tcolor: #f57900;\n\t\ttext-decoration:none;\n\t}\n\ta:visited {\n\t\tcolor: #ce5c00;\n\t}\n\ta:hover {\n text-decoration:underline;\n\t}\n\t.menu a {\n\t\tcolor: #D3D7CF;\n\t\tfont-weight: bold; \n\t}\n\t.menu a:hover {\n\t\tcolor: #eeeeec;\n\t\ttext-decoration:none;\n\t}\n\n\t</style>\n</head><body>\n\n\n\n<div class="heading">\n<div class="padding">\n<div class="logo"><a href="http://www.voxforge.org"><img src="http://www.voxforge.org/uploads/8k/N8/8kN884Cd96cmBZxRlzmbzQ/voxforge-logo.jpg" alt="VoxForge Repository"> </a></div> \n\n<div class="endFloat"></div>\n\n</div>\n</div>\n\n<div class="menu">\n\t<div class="padding">\t\t\n\t\t\n\t\t\n<span class="horizontalMenu">\n\n<a class="horizontalMenu" href="http://www.voxforge.org/home">Home</a>\n · \n\n<a class="horizontalMenu" href="http://www.voxforge.org/home/read">Read</a>\n · \n\n<a class="horizontalMenu" href="http://www.voxforge.org/home/listen">Listen</a>\n · \n\n<a class="horizontalMenu" href="http://www.voxforge.org/home/forums">Forums</a>\n · \n\n<a class="horizontalMenu" href="http://www.voxforge.org/home/dev">Dev</a>\n\n · \n\n<a class="horizontalMenu" href="http://www.voxforge.org/home/downloads">Downloads</a>\n · \n\n<a class="horizontalMenu" href="http://www.voxforge.org/home/about">About</a>\n \n\n \n\n</span></div>\n\n</div>\n\n\n\n</div>\n\n</body></html>\n<pre><img src="/spicons/blank.gif" alt="Icon "> <a href="?C=N;O=D">Name</a> <a href="?C=M;O=A">Last modified</a> <a href="?C=S;O=A">Size</a> <hr><img src="/spicons/back.gif" alt="[PARENTDIR]"> <a href="/downloads/fr/Trunk/Audio/Main/">Parent Directory</a> - \n<img src="/spicons/compressed.gif" alt="[ ]"> <a href="4h-20100505-vgm.tgz">4h-20100505-vgm.tgz</a> 2010-05-13 11:34 1.6M \n<img src="/spicons/compressed.gif" alt="[ ]"> <a href="Agoniste-20130928-bfg.tgz">Agoniste-20130928-bfg.tgz</a> 2014-02-17 05:02 1.8M \n<img src="/spicons/compressed.gif" alt="[ ]"> <a href="Agoniste-20130928-fnn.tgz">Agoniste-20130928-fnn.tgz</a> 2014-02-18 04:32 1.9M \n<img src="/spicons/compressed.gif" alt="[ ]"> <a href="Agoniste-20130928-gaf.tgz">Agoniste-20130928-gaf.tgz</a> 2014-02-18 04:32 2.0M \n<img src="/spicons/compressed.gif" alt="[ ]"> <a href="Agoniste-20130928-izd.tgz">Agoniste-20130928-izd.tgz</a> 2014-02-18 04:32 1.8M \n<img src="/spicons/compressed.gif" alt="[ ]"> <a href="Agoniste-20130928-ndz.tgz">Agoniste-20130928-ndz.tgz</a> 2014-02-18 04:32 1.8M \n<img src="/spicons/compressed.gif" alt="[ ]"> <a href="Agoniste-20130928-pzq.tgz">Agoniste-20130928-pzq.tgz</a> 2014-02-18 04:32 2.0M \n<img src="/spicons/compressed.gif" alt="[ ]"> <a href="Agoniste-20130928-qyu.tgz">Agoniste-20130928-qyu.tgz</a> 2014-02-18 04:32 2.1M \n<img src="/spicons/compressed.gif" alt="[ ]"> <a href="Agoniste-20130928-rva.tgz">Agoniste-20130928-rva.tgz</a> 2014-02-18 04:32 1.8M \n<img src="/spicons/compressed.gif" alt="[ ]"> <a href="Agoniste-20130928-vio.tgz">Agoniste-20130928-vio.tgz</a> 2014-06-10 04:44 1.7M \n<img src="/spicons/compressed.gif" alt="[ ]"> <a href="Alliage-20151109-cyf.tgz">Alliage-20151109-cyf.tgz</a> 2015-11-13 04:08 1.1M \n<img src="/spicons/compressed.gif" alt="[ ]"> <a href="Alliage-20151109-dqh.tgz">Alliage-20151109-dqh.tgz</a> 2015-11-13 04:08 960K \n<img src="/spicons/compressed.gif" alt="[ ]"> <a href="Alliage-20151109-ewg.tgz">Alliage-20151109-ewg.tgz</a> 2015-11-13 04:08 963K \n<img src="/spicons/compressed.gif" alt="[ ]"> <a href="Alliage-20151109-imx.tgz">Alliage-20151109-imx.tgz</a> 2015-11-13 04:08 855K \n<img src="/spicons/compressed.gif" alt="[ ]"> <a href="Alliage-20151109-kny.tgz">Alliage-20151109-kny.tgz</a> 2015-11-13 04:08 924K \n<img src="/spicons/compressed.gif" alt="[ ]"> <a href="Alliage-20151109-lcn.tgz">Alliage-20151109-lcn.tgz</a> 2015-11-13 04:08 910K \n<img src="/spicons/compressed.gif" alt="[ ]"> <a href="Alliage-20151109-rxi.tgz">Alliage-20151109-rxi.tgz</a>
但是当我使用 bs4 将其转换为 html 或 lxml 时:
soup = bs4.BeautifulSoup(r.text, 'html')
soup
我只取回了第一部分,在</body>之后我的所有其他信息都消失了:
<html><head><title>VoxForge Repository</title>
<style type="text/css">
.siteFunctions {
text-align: right;
}
.copyright {
test-align: left;
color: #2E3436;
font-family: sans-serif;
font-size: small;
}
body {
font-family: "DejaVu Sans", "Lucida Sans Unicode", sans-serif;
font-weight: normal;
word-spacing: normal;
letter-spacing: normal;
text-transform: none;
font-size: medium;
text-align: justify;
}
h2 {
font-size: 1.5em;
font-weight: 700;
margin-top:1em;
margin-bottom:0.8em;
}
h3 {
font-size: 1.1em;
font-weight: 600;
margin-top:1em;
margin-bottom:0.4em;
}
p, ol, ul {
font-size: 1em;
margin-top:0.4em;
margin-bottom:0.4em;
}
.heading {
background-color: #555753;
color: #D3D7CF;
font-size: 40px;
vertical-align: bottom;
}
.logo {
width: 100px;
float: left;
text-align: left;
}
.logo img {
border: 0px;
}
img {
border: 0px;
}
.clickableicons {
}
.endFloat {
clear: both;
}
.padding {
padding: 10px;
}
.bodyContent {
background-color: #ffffff;
color: #2E3436;
text-align: justify;
}
.menu {
color: #D3D7CF;
background-color: #555753;
text-align: left;
}
.menu2 {
color: #D3D7CF;
background-color: #555753;
text-align: center;
}
a {
color: #f57900;
text-decoration:none;
}
a:visited {
color: #ce5c00;
}
a:hover {
text-decoration:underline;
}
.menu a {
color: #D3D7CF;
font-weight: bold;
}
.menu a:hover {
color: #eeeeec;
text-decoration:none;
}
</style>
</head><body>
<div class="heading">
<div class="padding">
<div class="logo"><a href="http://www.voxforge.org"><img alt="VoxForge Repository" src="http://www.voxforge.org/uploads/8k/N8/8kN884Cd96cmBZxRlzmbzQ/voxforge-logo.jpg"/> </a></div>
<div class="endFloat"></div>
</div>
</div>
<div class="menu">
<div class="padding">
<span class="horizontalMenu">
<a class="horizontalMenu" href="http://www.voxforge.org/home">Home</a>
·
<a class="horizontalMenu" href="http://www.voxforge.org/home/read">Read</a>
·
<a class="horizontalMenu" href="http://www.voxforge.org/home/listen">Listen</a>
·
<a class="horizontalMenu" href="http://www.voxforge.org/home/forums">Forums</a>
·
<a class="horizontalMenu" href="http://www.voxforge.org/home/dev">Dev</a>
·
<a class="horizontalMenu" href="http://www.voxforge.org/home/downloads">Downloads</a>
·
<a class="horizontalMenu" href="http://www.voxforge.org/home/about">About</a>
</span></div>
</div>
</body></html>
我试图获取 < /body> 之后的所有链接,所以我需要找到一种方法来提取它们,而 bs4 似乎正在删除它们。任何人都可以帮忙吗?
解决方案
另一种解决方案:
import bs4
import requests
r = requests.get('http://www.repository.voxforge1.org/downloads/fr/Trunk/Audio/Main/16kHz_16bit/')
soup = bs4.BeautifulSoup(r.content, 'html.parser')
for a in soup.select('a[href*=".tgz"]'):
print(a['href'])
印刷:
4h-20100505-vgm.tgz
Agoniste-20130928-bfg.tgz
Agoniste-20130928-fnn.tgz
Agoniste-20130928-gaf.tgz
Agoniste-20130928-izd.tgz
Agoniste-20130928-ndz.tgz
Agoniste-20130928-pzq.tgz
Agoniste-20130928-qyu.tgz
Agoniste-20130928-rva.tgz
...and so on.
推荐阅读
- java - 有没有办法在 Scala 中创建自定义注释并编写自定义注释处理器来验证注释?
- visual-studio - Visual Studio 2019 安装向导 - 添加带有文件路径的桌面快捷方式
- mysql - 需要帮助重写缓慢的 mysql 查询
- c++ - Clang AST Match 调用具有特定类的 make_unique
- jenkins - 如何检查构建命令结果,如管道
- c++ - 当我在结构中包含多个数组时,我的程序跳过了一堆代码
- javascript - 我可以让 div 一次只显示一个,但现在我无法让 div 关闭
- form-data - 如何从 POST 获取 multipart/form-data 的内容到 Feathers.JS 中的创建服务?
- c++ - LNK2019:函数“private:int __cdecl StartProcess::ComInit(void)”中引用的未解析外部符号 __imp_GetComName
- python - AttributeError:“成员”对象没有属性“服务器”