python - Parsing HTML with Python. requests and LXML
问题描述
So I'm trying to parse an HTML page to extract two pieces of data from an unordered list.
There are thousands of <li>
elements in the page which have the following structure ...
<li>
<a href="/lesson/check/119" target="_blank">
Check lesson <b>#119</b> "structure-of-the-blood-vessels"
</a>
</li>
This is the Python code I have got so far ...
import requests
from lxml import html
auth = {
'user_login_form[_username]' : 'USERNAME',
'user_login_form[_plainPassword]' : 'PASSWORD',
'user_login_form[csrf_token]' : 'TOKEN'
}
login_url = 'https://example.com/login'
page_url = 'https://example.com/lesson/list'
session = requests.Session()
p = session.post(
login_url,
data=auth
)
print('Connecting to site ...',p.ok)
r = session.get(
page_url
)
print('Connecting to page ...',r.ok)
# Parsing text of the webpage into a DOM tree
tree = html.fromstring(r.text)
collection = tree.xpath('//li/a/descendant::text()')
for element in collection:
print(element)
... and the output I get from this is ...
Check lesson
#106
"functions-of-the-skeleton-4"
Check lesson
#107
"classification-of-bones-1"
... etc.
The output I want from the script is ...
106,functions-of-the-skeleton-4
I then want to follow the URL from each <li><a>
tag to grab a single piece of information from that page ...
<h1 class="head-h1" style="padding: 1%;">Lesson #106 - Functions of the Skeleton</h1>
... so the final line of data generated by the script is ...
106,functions-of-the-skeleton-4,Functions of the Skeleton
Basically, I'm trying to make sure that the 'slug' for the lesson on the first page is the same as the lesson title on the child page.
Please can you help with the XPATH / Python?
解决方案
关于 XPath 部分。对于每个 a 元素:
使用以下命令生成最后一行的第一部分:
concat(//b,",",translate(normalize-space(//a/text()[2]),'"',""),",")
(输出:106,functions-of-the-skeleton-4,)。
将 //b/text (#106) 的值存储在一个对象中(例如“foo”)。然后在第二页,得到你需要的
normalize-space(substring-after(//h1[contains(.,{foo})],"-"))
(输出:骨架的功能)。连接前两个结果以获得最后一行数据。
推荐阅读
- css - 如何开始使用 NodeJS 和 Sass 项目,这样我就可以自定义 bootstrap 4 主题
- makefile - 使用 make 安装库
- html - flexbox - 圣杯页面中的圣杯弹出窗口
- javascript - Bootstrap javascript 标签卡
- coldfusion - 如何在 Coldfusion 中将 Active Directory objectGUID 转换为 UUID
- android - 如何更改runnable中的变量值?或者如何通过按下按钮来重置数据计数器?
- mongodb - MongoDB中的嵌套插入
- reactjs - 避免在 React 中重新渲染子组件
- python - 具有多个参数python的函数列表
- twilio - Twilio:从录音中的出站呼叫中获取参数