首页 > 解决方案 > python web 抓取选项卡窗格

问题描述

嘿,我正在尝试为网站提取一些信息,这些信息就像球员的年龄、身高和体重,下表中的信息是网站的链接 http://www.ultimatetennisstatistics.com/playerProfile?playerId=4742

源页面看起来像这样

</ul>
<div class="tab-content">
    <div id="profile" class="tab-pane fade"><div class="loading"></div></div>
    <div id="season" class="tab-pane fade"><div class="loading"></div></div>
    <div id="events" class="tab-pane fade"><div class="loading"></div></div>
    <div id="matches" class="tab-pane fade"><div class="loading"></div></div>
    <div id="timeline" class="tab-pane fade"><div class="loading"></div></div>
    <div id="rivalries" class="tab-pane fade"><div class="loading"></div></div>
    <div id="ranking" class="tab-pane fade"><div class="loading"></div></div>
    <div id="performance" class="tab-pane fade"><div class="loading"></div></div>
    <div id="performanceChart" class="tab-pane fade"><div class="loading"></div></div>
    <div id="statistics" class="tab-pane fade"><div class="loading"></div></div>
    <div id="statisticsChart" class="tab-pane fade"><div class="loading"></div></div>
    <div id="tournaments" class="tab-pane fade"><div class="loading"></div></div>
    <div id="goatPoints" class="tab-pane fade"><div class="loading"></div></div>
    <div id="records" class="tab-pane fade"><div class="loading"></div></div>
</div>

我试图提取的信息在<div id="profile" class="tab-pane fade" 但是当我检查页面时它看起来与源页面中的内容不同

<div class="tab-content">
    <div id="profile" class="tab-pane fade active in">

<div class="row">
  <div class="col-md-4 col-lg-3">
    <table class="table table-condensed text-nowrap">

        <tbody><tr>
            <th>Age</th>
            <td>32 (03-06-1986)</td>
        </tr>
        <tr>
            <th>Country</th>
            <td><img src="/images/flags/es.png" title="ESP" width="24" height="20"> <span>Spain</span></td>
        </tr>
        <tr>
            <th>Birthplace</th>
            <td>Manacor, Mallorca, Spain</td>
        </tr>
        <tr>
            <th>Residence</th>
            <td>Manacor, Mallorca, Spain</td>
        </tr>
        <tr>
            <th>Height</th>
            <td>185 cm</td>
        </tr>
        <tr>
            <th>Weight</th>
            <td>85 kg</td>
        </tr>

        <tr>
            <th>Plays</th>
            <td>Left-handed</td>
        </tr>
        <tr>
            <th>Backhand</th>
            <td>Two-handed</td>
        </tr>
        <tr>
            <th>Favorite Surface</th>
            <td><span id="favoriteSurface" class="label label-danger" data-surface="C">Clay</span></td>
        </tr>
        <tr>
            <th>Coach</th>
            <td>Carlos Moya</td>
        </tr>
        <tr>
            <th>Turned Pro</th>
            <td>2001</td>
        </tr>
        <tr>
            <th>Seasons</th>
            <td><a href="/playerProfile?playerId=4742&amp;tab=timeline" title="Show timeline">17</a></td>
        </tr>
        <tr>
            <th>Active</th>
            <td>Yes <img src="/images/active.png" title="Active" width="12" height="12" style="vertical-align: 0"></td>
        </tr>

        <tr>
            <th>Prize Money</th>
            <td>US$100,564,598 3rd all-time leader in earnings</td>
        </tr>

        <tr>

这就是我检查页面时显示的内容,但它不在源页面中 有人可以帮我弄清楚如何提取表格中的信息 这是我到目前为止的代码

import urllib
import urllib.request
from bs4 import BeautifulSoup

url = "http://www.ultimatetennisstatistics.com/playerProfile?playerId=4742"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')

link = soup.find('body', attrs={'class': 'container'})

mk = link.find('div', attrs={'class':'tab-content'})
print(mk)
new = mk.find('div' , {'class': 'tab-pane fade'})
print(new)

它输出

<div class="tab-content">
<div class="tab-pane fade" id="profile"><div class="loading"></div></div>
<div class="tab-pane fade" id="season"><div class="loading"></div></div>
<div class="tab-pane fade" id="events"><div class="loading"></div></div>
<div class="tab-pane fade" id="matches"><div class="loading"></div></div>
<div class="tab-pane fade" id="timeline"><div class="loading"></div></div>
<div class="tab-pane fade" id="rivalries"><div class="loading"></div></div>
<div class="tab-pane fade" id="ranking"><div class="loading"></div></div>
<div class="tab-pane fade" id="performance"><div class="loading"></div></div>
<div class="tab-pane fade" id="performanceChart"><div class="loading"></div></div>
<div class="tab-pane fade" id="statistics"><div class="loading"></div></div>
<div class="tab-pane fade" id="statisticsChart"><div class="loading"></div></div>
<div class="tab-pane fade" id="tournaments"><div class="loading"></div></div>
<div class="tab-pane fade" id="goatPoints"><div class="loading"></div></div>
<div class="tab-pane fade" id="records"><div class="loading"></div></div>
</div>
<div class="tab-pane fade" id="profile"><div class="loading"></div></div>

标签: pythonweb-scraping

解决方案


您感兴趣的内容是通过 javascript 加载的。它是动态的,因此发出通常的请求无助于访问内容。尝试使用任何浏览器模拟器,例如selenium

这是您可以获取的一种方式:

from selenium import webdriver
from bs4 import BeautifulSoup

driver = webdriver.Chrome()
driver.get("http://www.ultimatetennisstatistics.com/playerProfile?playerId=4742")
soup = BeautifulSoup(driver.page_source,"lxml")
for items in soup.select('#profile table.table tr'):
    data = [item.get_text(strip=True) for item in items.select("th,td")]
    print(data)
driver.quit()

它产生的输出:

['Age', '32 (03-06-1986)']
['Country', 'Spain']
['Birthplace', 'Manacor, Mallorca, Spain']
['Residence', 'Manacor, Mallorca, Spain']
['Height', '185 cm']
['Weight', '85 kg']
['Plays', 'Left-handed']
['Backhand', 'Two-handed']
['Favorite Surface', 'Clay']
['Coach', 'Carlos Moya']
['Turned Pro', '2001']
['Seasons', '17']
['Active', 'Yes']
['Prize Money', 'US$100,564,598 3rd all-time leader in earnings']
['Wikipedia', 'Wikipedia']
['Web Site', 'rafaelnadal.com']
['Facebook', 'Nadal']
['Twitter', '@RafaelNadal']
['Nicknames', 'Rafa, Bull']
['Titles', '79']
['Grand Slams', '17']
['Masters', '32']
['Olympics', '1']
['']
['Current Rank', '1 (9310)']
['Best Rank', '1 (18-08-2008)']
['Current Elo Rank', '2 (2393)']
['Best Elo Rank', '1 (16-06-2008)']
['Peak Elo Rating', '2544 (09-09-2013)']
['GOAT Rank', '2 (720)']
['Weeks at No. 1', '184']
['']
['Best Season', '2013']
['Last Appearance', '02-07-2018']
['WimbledonGrassSF']
['Overall', '82.8% (906-188)', '']
['Hard', '77.0% (425-127)', '18']
['Clay', '92.0% (413-36)', '57']
['Grass', '77.6% (66-19)', '4']
['Carpet', '25.0% (2-6)']
['H2H', '11523']
['H2H %', '96.7%']

推荐阅读