首页 > 解决方案 > Beautifulsoup 不会检索所有的 html

问题描述

我尝试抓取该游戏的玩家统计数据:“https://siege.gg/matches/5694-invitational-intl-faze-clan-vs-team-liquid”但看起来我的代码无法检索所有 html 可以有人帮我吗?

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

url="https://siege.gg/matches/5694-invitational-intl-faze-clan-vs-team-liquid"
match_page=requests.get(url, headers=headers)



match_soup = BeautifulSoup(match_page.content, features="lxml")

all_stats_soup=match_soup.find(id="DataTables_Table_0_wrapper")

1 这部分 html 没有出现在“match_soup”上,所以当我做汤的时候。发现它返回一个无

标签: pythonweb-scrapingbeautifulsoup

解决方案


数据在 javascript 变量中。您可以使用re模块来解析它。

此示例将表格数据解析为pandaDataFrame:

import re
import requests
import pandas as pd
from io import StringIO

url = "https://siege.gg/matches/5694-invitational-intl-faze-clan-vs-team-liquid"

html_doc = requests.get(url).text
df = pd.read_html(StringIO(re.search(r"var a = `(.*)`", html_doc).group(1)))[0]

print(df)

印刷:

  Unnamed: 0  Rating    K-D (+/-) Entry (+/-) KOST   KPR  SRV  1vX  Plant  HS%       Atk     Def  Team
0  cameram4n    0.74  16-27 (-11)    1-4 (-3)  56%  0.44  25%    1      0  47%      Iana    Mute    50
1    muringa    0.83   15-20 (-5)    1-3 (-2)  58%  0.42  44%    0      1  67%  Thatcher   Smoke    19
2      Astro    1.03   24-23 (+1)    2-3 (-1)  56%  0.67  36%    2      3  50%       Ace    Kaid    50
3    NESKWGA    1.20  35-25 (+10)    5-5 (+0)  58%  0.97  31%    0      1  56%    Hibana   Jager    19
4    Bullet1    0.84   22-29 (-7)    5-7 (-2)  53%  0.61  19%    0      1  32%       Ash   Jager    50
5       psk1    0.83   16-23 (-7)    2-6 (-4)  61%  0.44  36%    0      1  31%     Nomad    Mute    19
6  xS3xyCake    1.13   27-23 (+4)    5-1 (+4)  78%  0.75  36%    0      3  50%  Maverick    Echo    19
7      Cyber    0.90   25-28 (-3)    4-4 (+0)  56%  0.69  22%    0      0  36%    Sledge   Smoke    50
8      Paluh    1.47  42-21 (+21)    6-2 (+4)  72%  1.17  42%    3      0  72%    Sledge  Melusi    19
9     soulz1    0.88   24-29 (-5)    5-1 (+4)  58%  0.67  19%    0      1  52%  Maverick    Echo    50

或与bs4

from bs4 import BeautifulSoup

soup = BeautifulSoup(
    re.search(r"var a = `(.*)`", html_doc).group(1), "html.parser"
)

for tr in soup.select("tr"):
    print(*tr.get_text(strip=True, separator="|").split("|"), sep="\t")

印刷:

Rating  K-D (+/-)       Entry (+/-)     KOST    KPR     SRV     1vX     Plant   HS%     Atk     Def     Team
cameram4n       0.74    16-27 (-11)     1-4 (-3)        56%     0.44    25%     1       0       47%     Iana    Mute    50
muringa 0.83    15-20 (-5)      1-3 (-2)        58%     0.42    44%     0       1       67%     Thatcher        Smoke   19
Astro   1.03    24-23 (+1)      2-3 (-1)        56%     0.67    36%     2       3       50%     Ace     Kaid    50
NESKWGA 1.20    35-25 (+10)     5-5 (+0)        58%     0.97    31%     0       1       56%     Hibana  Jager   19
Bullet1 0.84    22-29 (-7)      5-7 (-2)        53%     0.61    19%     0       1       32%     Ash     Jager   50
psk1    0.83    16-23 (-7)      2-6 (-4)        61%     0.44    36%     0       1       31%     Nomad   Mute    19
xS3xyCake       1.13    27-23 (+4)      5-1 (+4)        78%     0.75    36%     0       3       50%     Maverick        Echo    19
Cyber   0.90    25-28 (-3)      4-4 (+0)        56%     0.69    22%     0       0       36%     Sledge  Smoke   50
Paluh   1.47    42-21 (+21)     6-2 (+4)        72%     1.17    42%     3       0       72%     Sledge  Melusi  19
soulz1  0.88    24-29 (-5)      5-1 (+4)        58%     0.67    19%     0       1       52%     Maverick        Echo    50

推荐阅读