首页 > 解决方案 > 使用 XPath 进行网页抓取 - 复制文本 xpath 后找不到元素

问题描述

尝试从此网页获取特定部分的文本...尝试使用我从类似帖子中找到的代码:

# Import required modules
from lxml import html
import requests
  
# Request the page
page = requests.get('https://www.baseball-reference.com/players/k/kershcl01.shtml')
  
# Parsing the page
tree = html.fromstring(page.content)
  
# Get element using XPath
share = tree.xpath(
    '//div[@id="leaderboard_cyyoung"]/table/tbody/tr[11]/td/a')
print(share)

输出只是空括号 []

标签: pythonhtmlxpathpython-requestslxml

解决方案


如前所述,这是网站的呈现/动态部分。它在评论中,因此您需要提取 html 的评论,然后进行解析。另一个问题是在评论中,没有<tbody>标签,所以它找不到任何东西,你需要删除它。不过,我不确定您要提取什么(是链接,还是文本?)。我提醒您的代码向您展示如何将它与 lxml 一起使用,但不是很喜欢。我宁愿只使用 BeautifulSoup。但是 BeautifulSoup 不与 xpath 结合,因此使用了 css 选择器。

您的代码已更改:

import requests
from lxml import html
from bs4 import BeautifulSoup, Comment

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'}

url = "https://www.baseball-reference.com/players/k/kershcl01.shtml"

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))


for each in comments:
    if 'leaderboard_cyyoung' in str(each):
        htmlStr = str(each)   

        # Parsing the page
        tree = html.fromstring(htmlStr)
          
        # Get element using XPath
        share = tree.xpath('//div[@id="leaderboard_cyyoung"]/table/tr[11]/td/a')
        print(share)

我会怎么做:

import requests
from bs4 import BeautifulSoup, Comment

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'}

url = "https://www.baseball-reference.com/players/k/kershcl01.shtml"

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))

for each in comments:
    if 'leaderboard_cyyoung' in str(each):
        soup = BeautifulSoup(str(each), 'html.parser')
        share = soup.select('div#leaderboard_cyyoung > table > tr:nth-child(12) > td > a')
        print(share)
        break

输出:

[<a href="/leaders/mvp_cya.shtml">4.58 Career Shares</a>]

推荐阅读