python - python web 抓取选项卡窗格
问题描述
嘿,我正在尝试为网站提取一些信息,这些信息就像球员的年龄、身高和体重,下表中的信息是网站的链接 http://www.ultimatetennisstatistics.com/playerProfile?playerId=4742
源页面看起来像这样
</ul>
<div class="tab-content">
<div id="profile" class="tab-pane fade"><div class="loading"></div></div>
<div id="season" class="tab-pane fade"><div class="loading"></div></div>
<div id="events" class="tab-pane fade"><div class="loading"></div></div>
<div id="matches" class="tab-pane fade"><div class="loading"></div></div>
<div id="timeline" class="tab-pane fade"><div class="loading"></div></div>
<div id="rivalries" class="tab-pane fade"><div class="loading"></div></div>
<div id="ranking" class="tab-pane fade"><div class="loading"></div></div>
<div id="performance" class="tab-pane fade"><div class="loading"></div></div>
<div id="performanceChart" class="tab-pane fade"><div class="loading"></div></div>
<div id="statistics" class="tab-pane fade"><div class="loading"></div></div>
<div id="statisticsChart" class="tab-pane fade"><div class="loading"></div></div>
<div id="tournaments" class="tab-pane fade"><div class="loading"></div></div>
<div id="goatPoints" class="tab-pane fade"><div class="loading"></div></div>
<div id="records" class="tab-pane fade"><div class="loading"></div></div>
</div>
我试图提取的信息在<div id="profile" class="tab-pane fade"
但是当我检查页面时它看起来与源页面中的内容不同
<div class="tab-content">
<div id="profile" class="tab-pane fade active in">
<div class="row">
<div class="col-md-4 col-lg-3">
<table class="table table-condensed text-nowrap">
<tbody><tr>
<th>Age</th>
<td>32 (03-06-1986)</td>
</tr>
<tr>
<th>Country</th>
<td><img src="/images/flags/es.png" title="ESP" width="24" height="20"> <span>Spain</span></td>
</tr>
<tr>
<th>Birthplace</th>
<td>Manacor, Mallorca, Spain</td>
</tr>
<tr>
<th>Residence</th>
<td>Manacor, Mallorca, Spain</td>
</tr>
<tr>
<th>Height</th>
<td>185 cm</td>
</tr>
<tr>
<th>Weight</th>
<td>85 kg</td>
</tr>
<tr>
<th>Plays</th>
<td>Left-handed</td>
</tr>
<tr>
<th>Backhand</th>
<td>Two-handed</td>
</tr>
<tr>
<th>Favorite Surface</th>
<td><span id="favoriteSurface" class="label label-danger" data-surface="C">Clay</span></td>
</tr>
<tr>
<th>Coach</th>
<td>Carlos Moya</td>
</tr>
<tr>
<th>Turned Pro</th>
<td>2001</td>
</tr>
<tr>
<th>Seasons</th>
<td><a href="/playerProfile?playerId=4742&tab=timeline" title="Show timeline">17</a></td>
</tr>
<tr>
<th>Active</th>
<td>Yes <img src="/images/active.png" title="Active" width="12" height="12" style="vertical-align: 0"></td>
</tr>
<tr>
<th>Prize Money</th>
<td>US$100,564,598 3rd all-time leader in earnings</td>
</tr>
<tr>
这就是我检查页面时显示的内容,但它不在源页面中 有人可以帮我弄清楚如何提取表格中的信息 这是我到目前为止的代码
import urllib
import urllib.request
from bs4 import BeautifulSoup
url = "http://www.ultimatetennisstatistics.com/playerProfile?playerId=4742"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
link = soup.find('body', attrs={'class': 'container'})
mk = link.find('div', attrs={'class':'tab-content'})
print(mk)
new = mk.find('div' , {'class': 'tab-pane fade'})
print(new)
它输出
<div class="tab-content">
<div class="tab-pane fade" id="profile"><div class="loading"></div></div>
<div class="tab-pane fade" id="season"><div class="loading"></div></div>
<div class="tab-pane fade" id="events"><div class="loading"></div></div>
<div class="tab-pane fade" id="matches"><div class="loading"></div></div>
<div class="tab-pane fade" id="timeline"><div class="loading"></div></div>
<div class="tab-pane fade" id="rivalries"><div class="loading"></div></div>
<div class="tab-pane fade" id="ranking"><div class="loading"></div></div>
<div class="tab-pane fade" id="performance"><div class="loading"></div></div>
<div class="tab-pane fade" id="performanceChart"><div class="loading"></div></div>
<div class="tab-pane fade" id="statistics"><div class="loading"></div></div>
<div class="tab-pane fade" id="statisticsChart"><div class="loading"></div></div>
<div class="tab-pane fade" id="tournaments"><div class="loading"></div></div>
<div class="tab-pane fade" id="goatPoints"><div class="loading"></div></div>
<div class="tab-pane fade" id="records"><div class="loading"></div></div>
</div>
<div class="tab-pane fade" id="profile"><div class="loading"></div></div>
解决方案
您感兴趣的内容是通过 javascript 加载的。它是动态的,因此发出通常的请求无助于访问内容。尝试使用任何浏览器模拟器,例如selenium
:
这是您可以获取的一种方式:
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.get("http://www.ultimatetennisstatistics.com/playerProfile?playerId=4742")
soup = BeautifulSoup(driver.page_source,"lxml")
for items in soup.select('#profile table.table tr'):
data = [item.get_text(strip=True) for item in items.select("th,td")]
print(data)
driver.quit()
它产生的输出:
['Age', '32 (03-06-1986)']
['Country', 'Spain']
['Birthplace', 'Manacor, Mallorca, Spain']
['Residence', 'Manacor, Mallorca, Spain']
['Height', '185 cm']
['Weight', '85 kg']
['Plays', 'Left-handed']
['Backhand', 'Two-handed']
['Favorite Surface', 'Clay']
['Coach', 'Carlos Moya']
['Turned Pro', '2001']
['Seasons', '17']
['Active', 'Yes']
['Prize Money', 'US$100,564,598 3rd all-time leader in earnings']
['Wikipedia', 'Wikipedia']
['Web Site', 'rafaelnadal.com']
['Facebook', 'Nadal']
['Twitter', '@RafaelNadal']
['Nicknames', 'Rafa, Bull']
['Titles', '79']
['Grand Slams', '17']
['Masters', '32']
['Olympics', '1']
['']
['Current Rank', '1 (9310)']
['Best Rank', '1 (18-08-2008)']
['Current Elo Rank', '2 (2393)']
['Best Elo Rank', '1 (16-06-2008)']
['Peak Elo Rating', '2544 (09-09-2013)']
['GOAT Rank', '2 (720)']
['Weeks at No. 1', '184']
['']
['Best Season', '2013']
['Last Appearance', '02-07-2018']
['WimbledonGrassSF']
['Overall', '82.8% (906-188)', '']
['Hard', '77.0% (425-127)', '18']
['Clay', '92.0% (413-36)', '57']
['Grass', '77.6% (66-19)', '4']
['Carpet', '25.0% (2-6)']
['H2H', '11523']
['H2H %', '96.7%']
推荐阅读
- javascript - 如何循环遍历 HTML 元素
- javascript - 如何产生这些“星星”(椭圆)的随机数量(有限制)?
- python - 如何在烧瓶中创建私有端点
- linux - 了解图像和组件之间 bitbake 的使用
- javascript - 追加新的 React 组件 onclick
- php - file_get_contents 未能打开流 http 请求失败
- javascript - 当新视频以模态开始时停止播放其他 Youtube 视频
- r - (anti) 基于部分匹配的join
- python - python 3.9 需要 Microsoft Visual C++ 14.0 或更高版本吗?
- php - PHP如何从特定键输出字符串