首页 > 解决方案 > 获取html表内元素的href

问题描述

HTML网站

我有一个 HTML 列表,从这个列表中我只想要<tr>具有class="". 我想稍后下载文件,所以我以后只需要第三个<td>和这个元素的内部,href<a>怎样才能将这些直接作为字符串读出?

我想要所有<tr>带有class = "".

例如:

<tr class="">
<td>29 September, 2021</td>
<td>Antwerp</td>
<td><a href="http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/listings.csv.gz" onclick="var that=this;ga('send','event', 'download','listings',this.href);setTimeout(function(){location.href=that.href;},200);return false;">listings.csv.gz</a></td>
<td>Detailed Listings data for Antwerp</td>

在这个<tr>元素里面有一个<td>元素。我想在第三个<td>元素中包含元素的href <a>。所以我想要的是网址http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/listings.csv.gz(不仅是这个:D,我想要所有的网址)

代码

import requests
from bs4 import BeautifulSoup
from datetime import datetime

DATASET_URL = "http://insideairbnb.com/get-the-data.html"
DATASET_CITY = "Antwerp"
r = requests.get(DATASET_URL)
content = r.content
soup = BeautifulSoup(content, "html.parser")
antwerp_table = soup.find(class_=DATASET_CITY.lower())
        
print(antwerp_table)
# antwerp_table is my html table 

html 示例(更多信息请访问http://insideairbnb.com/get-the-data.html

<table class="table table-hover table-striped antwerp">
<thead>
<tr>
<th class="col-md-3" data-field="host_id">Date Compiled</th>
<th class="col-md-3" data-field="host_id">Country/City</th>
<th class="col-md-3" data-field="host_id">File Name</th>
<th class="col-md-3" data-align="right" data-field="count">
                        Description
                    </th>
</tr>
</thead>
<tbody>
<tr class="">
<td>29 September, 2021</td>
<td>Antwerp</td>
<td><a href="http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/listings.csv.gz" onclick="var that=this;ga('send','event', 'download','listings',this.href);setTimeout(function(){location.href=that.href;},200);return false;">listings.csv.gz</a></td>
<td>Detailed Listings data for Antwerp</td>
</tr>
<tr class="">
<td>29 September, 2021</td>
<td>Antwerp</td>
<td><a href="http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/calendar.csv.gz" onclick="var that=this;ga('send','event', 'download','calendar',this.href);setTimeout(function(){location.href=that.href;},200);return false;">calendar.csv.gz</a></td>
<td>Detailed Calendar Data for listings in Antwerp</td>
</tr>
...
<tr class="archived">
<td>17 August, 2021</td>
<td>Antwerp</td>
<td><a href="http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/calendar.csv.gz" onclick="var that=this;ga('send','event', 'download','calendar',this.href);setTimeout(function(){location.href=that.href;},200);return false;">calendar.csv.gz</a></td>
<td>Detailed Calendar Data for listings in Antwerp</td>
</tr>

标签: pythonweb-scraping

解决方案


有不同的方法来获取我建议的由表结构引起的未归档 ,以使用css 选择器,该选择器全部包含空和包含:hrefbs4<tr>class<a>

soup.select(f'.{DATASET_CITY.lower()} tr[class=""] a')

例子

import requests
from bs4 import BeautifulSoup
from datetime import datetime

DATASET_URL = "http://insideairbnb.com/get-the-data.html"
DATASET_CITY = "Antwerp"
r = requests.get(DATASET_URL)
content = r.content
soup = BeautifulSoup(content, "html.parser")
antwerp_table = [url['href'] for url in soup.select(f'.{DATASET_CITY.lower()} tr[class=""] a')]

输出

['http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/listings.csv.gz',
 'http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/calendar.csv.gz',
 'http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/reviews.csv.gz',
 'http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/visualisations/listings.csv',
 'http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/visualisations/reviews.csv',
 'http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/visualisations/neighbourhoods.csv',
 'http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/visualisations/neighbourhoods.geojson']

推荐阅读