python - 获取html表内元素的href
问题描述
我有一个 HTML 列表,从这个列表中我只想要<tr>
具有class=""
. 我想稍后下载文件,所以我以后只需要第三个<td>
和这个元素的内部,href
我<a>
怎样才能将这些直接作为字符串读出?
我想要所有<tr>
带有class = ""
.
例如:
<tr class="">
<td>29 September, 2021</td>
<td>Antwerp</td>
<td><a href="http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/listings.csv.gz" onclick="var that=this;ga('send','event', 'download','listings',this.href);setTimeout(function(){location.href=that.href;},200);return false;">listings.csv.gz</a></td>
<td>Detailed Listings data for Antwerp</td>
在这个<tr>
元素里面有一个<td>
元素。我想在第三个<td>
元素中包含元素的href <a>
。所以我想要的是网址http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/listings.csv.gz
(不仅是这个:D,我想要所有的网址)
代码
import requests
from bs4 import BeautifulSoup
from datetime import datetime
DATASET_URL = "http://insideairbnb.com/get-the-data.html"
DATASET_CITY = "Antwerp"
r = requests.get(DATASET_URL)
content = r.content
soup = BeautifulSoup(content, "html.parser")
antwerp_table = soup.find(class_=DATASET_CITY.lower())
print(antwerp_table)
# antwerp_table is my html table
html 示例(更多信息请访问http://insideairbnb.com/get-the-data.html)
<table class="table table-hover table-striped antwerp">
<thead>
<tr>
<th class="col-md-3" data-field="host_id">Date Compiled</th>
<th class="col-md-3" data-field="host_id">Country/City</th>
<th class="col-md-3" data-field="host_id">File Name</th>
<th class="col-md-3" data-align="right" data-field="count">
Description
</th>
</tr>
</thead>
<tbody>
<tr class="">
<td>29 September, 2021</td>
<td>Antwerp</td>
<td><a href="http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/listings.csv.gz" onclick="var that=this;ga('send','event', 'download','listings',this.href);setTimeout(function(){location.href=that.href;},200);return false;">listings.csv.gz</a></td>
<td>Detailed Listings data for Antwerp</td>
</tr>
<tr class="">
<td>29 September, 2021</td>
<td>Antwerp</td>
<td><a href="http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/calendar.csv.gz" onclick="var that=this;ga('send','event', 'download','calendar',this.href);setTimeout(function(){location.href=that.href;},200);return false;">calendar.csv.gz</a></td>
<td>Detailed Calendar Data for listings in Antwerp</td>
</tr>
...
<tr class="archived">
<td>17 August, 2021</td>
<td>Antwerp</td>
<td><a href="http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/calendar.csv.gz" onclick="var that=this;ga('send','event', 'download','calendar',this.href);setTimeout(function(){location.href=that.href;},200);return false;">calendar.csv.gz</a></td>
<td>Detailed Calendar Data for listings in Antwerp</td>
</tr>
解决方案
有不同的方法来获取我建议的由表结构引起的未归档 ,以使用css 选择器,该选择器全部包含空和包含:href
bs4
<tr>
class
<a>
soup.select(f'.{DATASET_CITY.lower()} tr[class=""] a')
例子
import requests
from bs4 import BeautifulSoup
from datetime import datetime
DATASET_URL = "http://insideairbnb.com/get-the-data.html"
DATASET_CITY = "Antwerp"
r = requests.get(DATASET_URL)
content = r.content
soup = BeautifulSoup(content, "html.parser")
antwerp_table = [url['href'] for url in soup.select(f'.{DATASET_CITY.lower()} tr[class=""] a')]
输出
['http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/listings.csv.gz',
'http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/calendar.csv.gz',
'http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/reviews.csv.gz',
'http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/visualisations/listings.csv',
'http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/visualisations/reviews.csv',
'http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/visualisations/neighbourhoods.csv',
'http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/visualisations/neighbourhoods.geojson']
推荐阅读
- c++ - 如何使用 C++ 使用 xbee 传输数据?
- microsoft-graph-api - Microsoft Graph 用户增量扩展重复结果处理
- apache-nifi - 无法使用合并内容处理器合并 NIFI 中的内容
- swiftui - 与 SwiftUi 分配作斗争
- c# - CsvHelper - 验证整行
- javascript - 用一种形式自定义整个 symfony 网站的 UI
- php - 访问我在 wordpress 中创建的子菜单页面时出错
- azure - 如何配置逻辑应用以在 ARM 模板中登录到 Log Analytics?
- android - 在 Android 10 中使用存储访问框架输出 URI 文件比使用 DocumentFile 更好的方法是什么?
- c# - 如何从数据库创建位图图像并在其中显示数据