python - Python requests.get 未在 html 文档中的标签之一中返回文本
问题描述
我正在尝试为个人项目解析Djinni的工作描述。我正在使用 Python 3.6、BeautifulSoup4 和 requests 库。当我使用 requests.get 获取职位空缺页面的 html 时,它返回的 html 没有最关键的部分 - 描述文本。例如,获取此页面的 url -示例和我编写的以下代码:
def scrape_job_desc(self, url):
job_desc_html = self._get_search_page_html(url)
soup = BeautifulSoup(job_desc_html, features='html.parser')
try:
short_desc = str(soup.find('p', {'class': 'job-teaser svelte-a3rpl2'}).getText())
full_desc = soup.find('div', {'class': 'job-description-wrapper svelte-a3rpl2'}).find('p').getText()
except AttributeError:
short_desc = None
full_desc = None
return short_desc, full_desc
def _get_search_page_html(self, url):
html = requests.get(url=url, headers={'User-Agent': 'Mozilla/5.0 CK={} (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko'})
return html.text
它将返回 short_desc 但不返回 full_desc。此外,所需的 <p> 标记的文本根本不存在于 html 中。但是当我使用浏览器下载页面时,它就在那里。这是什么原因造成的?
解决方案
作业的完整描述以 JavaScript 变量的形式存储在页面中。您可以使用selenium
它来提取它,或者re
模块:
import re
import requests
from bs4 import BeautifulSoup
url = 'https://djinni.co/jobs2/144172-data-scientist'
html_data = requests.get(url).text
full_desc = re.search(r'fullDescription:"(.*?)",', html_data).group(1).replace(r'\r\n', '\n')
short_desc = BeautifulSoup(html_data, 'html.parser').select_one('.job-teaser').get_text()
print(short_desc)
print('-' * 80)
print(full_desc)
印刷:
Together Networks is looking for an experienced Data Scientist to join our Agile team. Together Networks is a worldwide leader in the online dating niche with millions of users across more than 45 countries. Our brands are BeNaughty, CheekyLovers, Flirt, Click&Flirt, Flirt Spielchen.
--------------------------------------------------------------------------------
What you get to deal with:
- Active collaboration with stakeholders throughout the organization;
- User experience modelling;
- Advanced segmentation;
- User behavior analytics;
- Anomaly detection, fraud detection;
- Looking for bottlenecks;
- Churn prediction.
You need to have (required):
- Masterâs or PHD in Statistics, Mathematics, Computer Science or another quantitative field;
- 2+ years of experience manipulating data sets and building statistical models;
- Strong knowledge in a wide range of machine learning methods and algorithms for classification, regression, clustering, and others;
- Knowledge and experience in statistical and data mining techniques;
- Experience using statistical computer languages (Python, SLQ, etc.) to manipulate data and draw insights from large data sets.
- Knowledge of a variety of machine learning techniques and their real-world advantages\u002Fdrawbacks;
- Experience visualizing\u002Fpresenting insights for stakeholders;
- Independent, creative thinking, and ability to learn fast.
Would be a great plus:
- Experience dealing with end to end machine learning projects: data exploration, feature engineering\u002Fdefinition, model building, production, maintenance;
- Experience in data visualization with Tableau;
- Experience in dating, game dev, social projects.
推荐阅读
- javascript - 了解何时以及在 node.js 中实现哪种设计模式的最佳方法是什么
- python - 修改类的 Python 方法装饰器
- python - 在 tkinter 画布上动态调整矩形大小
- kotlin - 在当前上下文中如何访问 Kotlin 协程元素?
- python - python中'lambda'表达式的问题
- windows - 如果机器规格发生变化,我们能否预测 Windows 应用程序将如何工作?
- python - Selenium driver.requests.Timeout 在加载超时后不会停止请求
- asp.net-core - MVC Core 中的混合用户身份
- javascript - 使移动视图在 +90 度的横向上保持纵向模式
- javascript - 通过对象数组的 Javascript 映射