python - 将 LXML 与 Html、Requests 和 ETree 一起使用,它会提供链接,但不会让我搜索特定文本的链接
问题描述
我正在尝试从下面提供的链接中提取特定数据。当我运行代码时,它按预期为我提供了所有 href 链接,但是当我尝试进一步测试相同的字符串但使用 contains 语法时,它返回为空。
我检查了阅读文档以及 DevHints,并且在我所看到的任何地方,“包含”语法都是推荐的方法来捕获我正在寻找的内容,而我只知道将包含该语法,但不知道在哪里或如何。
我正在尝试构建一个爬虫来帮助很多最近下岗的人找到新工作,因此非常感谢任何帮助。
代码:
from lxml import html, etree
import requests
page = requests.get('https://ea.gr8people.com/index.gp?method=cappportal.showPortalSearch&sysLayoutID=123')
# print(page.content)
tree = html.fromstring(page.content)
print(tree)
# Select All Nodes
AllNodes = tree.xpath("//*")
# Select Only hyperlink nodes
AllHyperLinkNodes = tree.xpath("//*/a")
# Iterate through all Node Links
for node in AllHyperLinkNodes:
print(node.values())
print("======================================================================================================================")
# select using a condition 'contains'
# NodeThatContains = tree.xpath('//td[@class="search-results-column-left"]/text()')
NodeThatContains = tree.xpath('//*/a[contains(text(),"opportunityid")]')
for node in NodeThatContains:
print(node.values())
# Print the link that 'contains' the text
# print(NodeThatContains[0].values())
解决方案
基于 BeautifulSoup 的解决方案
from bs4 import BeautifulSoup
import requests
page = requests.get('https://ea.gr8people.com/index.gp?method=cappportal.showPortalSearch&sysLayoutID=123').content
soup = BeautifulSoup(page, 'html.parser')
links = soup.find_all('a')
links = [a for a in links if a.attrs.get('href') and 'opportunityid' in a.attrs.get('href')]
print('-- opportunities --')
for idx, link in enumerate(links):
print('{}) {}'.format(idx, link))
输出
-- opportunities --
0) <a href="index.gp?method=cappportal.showJob&layoutid=2092&inp1541=&inp1375=154761&opportunityid=154761">
2D Capture Artist - 6 month contract
</a>
1) <a href="index.gp?method=cappportal.showJob&layoutid=2092&inp1541=&inp1375=154426&opportunityid=154426">
Accounting Supervisor
</a>
2) <a href="index.gp?method=cappportal.showJob&layoutid=2092&inp1541=&inp1375=152147&opportunityid=152147">
Advanced Analyst
</a>
3) <a href="index.gp?method=cappportal.showJob&layoutid=2092&inp1541=&inp1375=153395&opportunityid=153395">
Advanced UX Researcher
</a>
4) <a href="index.gp?method=cappportal.showJob&layoutid=2092&inp1541=&inp1375=151309&opportunityid=151309">
AI Engineer
</a>
5) <a href="index.gp?method=cappportal.showJob&layoutid=2092&inp1541=&inp1375=150468&opportunityid=150468">
AI Scientist
</a>
6) <a href="index.gp?method=cappportal.showJob&layoutid=2092&inp1541=&inp1375=151310&opportunityid=151310">
AI Scientist - NLP Focus
</a>
7) <a href="index.gp?method=cappportal.showJob&layoutid=2092&inp1541=&inp1375=153351&opportunityid=153351">
AI Software Engineer (Apex Legends)
</a>
8) <a href="index.gp?method=cappportal.showJob&layoutid=2092&inp1541=&inp1375=152737&opportunityid=152737">
AI Software Engineer (Frostbite)
</a>
9) <a href="index.gp?method=cappportal.showJob&layoutid=2092&inp1541=&inp1375=154764&opportunityid=154764">
Analyste Qualité Sénior / Senior Quality Analyst
</a>
10) <a href="index.gp?method=cappportal.showJob&layoutid=2092&inp1541=&inp1375=153948&opportunityid=153948">
Animator 1
</a>
11) <a href="index.gp?method=cappportal.showJob&layoutid=2092&inp1541=&inp1375=151353&opportunityid=151353">
Applications Agreement Analyst
</a>
12) <a href="index.gp?method=cappportal.showJob&layoutid=2092&inp1541=&inp1375=154668&opportunityid=154668">
AR Analyst I
</a>
13) <a href="index.gp?method=cappportal.showJob&layoutid=2092&inp1541=&inp1375=153609&opportunityid=153609">
AR Specialist
</a>
14) <a href="index.gp?method=cappportal.showJob&layoutid=2092&inp1541=&inp1375=154773&opportunityid=154773">
Artiste Audio / Audio Artist
</a>
推荐阅读
- palantir-foundry - 您可以将参数传递给 Slate 中的函数吗?
- sql - 一次更新多个 SQL 数据库中的 SQL Schema
- python-3.x - 在 Python 3.8 中,如何将时区转换为单词?
- python - 从与 slack bot 的对话中检索用户 ID
- cordova - iOS14 上的 Cordova 显示 iOS 键盘而不是原生 HTML 日期选择器
- api - 将 API 响应保存到浏览器?
- palantir-foundry - 如何在 Foundry 的 SQL 转换中设置变量?
- python - 使用 cron 在 Apline 容器中运行 python aws 上传脚本
- python - Optional NoReturn: TypeHint for a function which sometimes raises an exception
- python - 带有udf pyspark的fasttext