首页 > 解决方案 > Parsing HTML with Python. requests and LXML

问题描述

So I'm trying to parse an HTML page to extract two pieces of data from an unordered list.

There are thousands of <li> elements in the page which have the following structure ...

            <li>
            <a href="/lesson/check/119" target="_blank">
                Check lesson <b>#119</b> "structure-of-the-blood-vessels"
            </a>
        </li>

This is the Python code I have got so far ...

import requests
from lxml import html

auth = {
  'user_login_form[_username]'      : 'USERNAME',
  'user_login_form[_plainPassword]' : 'PASSWORD',
  'user_login_form[csrf_token]'     : 'TOKEN'
  }

login_url = 'https://example.com/login'
page_url = 'https://example.com/lesson/list'

session = requests.Session()

p = session.post(
  login_url,
  data=auth
  )

print('Connecting to site ...',p.ok)

r = session.get(
  page_url
  )

print('Connecting to page ...',r.ok)

# Parsing text of the webpage into a DOM tree
tree = html.fromstring(r.text)
collection = tree.xpath('//li/a/descendant::text()')

for element in collection:
  print(element)

... and the output I get from this is ...


                Check lesson 
#106
 "functions-of-the-skeleton-4"


                Check lesson 
#107
 "classification-of-bones-1"


... etc.

The output I want from the script is ...

106,functions-of-the-skeleton-4

I then want to follow the URL from each <li><a> tag to grab a single piece of information from that page ...

    <h1 class="head-h1" style="padding: 1%;">Lesson #106 - Functions of the Skeleton</h1>

... so the final line of data generated by the script is ...

106,functions-of-the-skeleton-4,Functions of the Skeleton

Basically, I'm trying to make sure that the 'slug' for the lesson on the first page is the same as the lesson title on the child page.

Please can you help with the XPATH / Python?

标签: pythonhtmlxpathpython-requests

解决方案


关于 XPath 部分。对于每个 a 元素:

使用以下命令生成最后一行的第一部分:

concat(//b,",",translate(normalize-space(//a/text()[2]),'"',""),",")

(输出:106,functions-of-the-skeleton-4,)。

将 //b/text (#106) 的值存储在一个对象中(例如“foo”)。然后在第二页,得到你需要的

normalize-space(substring-after(//h1[contains(.,{foo})],"-"))

(输出:骨架的功能)。连接前两个结果以获得最后一行数据。


推荐阅读