python - 我想提取每个主题的名称和每个视频的名称
问题描述
我有以下代码似乎不像我想要的那样工作:
import pathlib
import requests
from bs4 import BeautifulSoup as bs
import re
import sys
import os
import lxml.html
url = sys.argv[1]
page = requests.get(url)
tree = lxml.html.fromstring(page.content)
names = tree.xpath('//div[@class="cd-timeline-block"]/text()')
names = filter(lambda n: n.strip(), names)
table = str.maketrans(dict.fromkeys('?:/'))
for index, name in enumerate(names, start = 1):
print('/{}.{}'.format(index, name.strip().translate(table)))
所以我想提取每个主题的名称和该主题中每个视频的名称,并从打印命令中获取此输出。格式应该是这样的:
/0.Project Tools & Documentation/1.Organizational Change
/0.Project Tools & Documentation/2.Project Management Tools
/0.Project Tools & Documentation/3.Project Documentation
/0.Project Tools & Documentation/4.Vendor Documentation
第一个主题完成后,然后转到下一个主题和下一个主题和视频的另一个输出:
/1.Glossary/1.Review of Terms & Acroynms
/1.Glossary/2.Review of Formulas
并像这样打印:
https://streaming.ine.com/play/dfdf64b8-30a5-4bce-8ade-
a09ec56bcd6d/vendor-documentation
我要从中提取此信息的页面是:
https://streaming.ine.com/c/ine-comptia-pk0-004-project-plus
谢谢!
解决方案
url = "https://streaming.ine.com/c/ine-comptia-pk0-004-project-plus"
import requests
page = requests.get(url)
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
listed = soup.find_all('div',class_="cd-timeline-block")
for i,sth in enumerate(listed):
soup1 = BeautifulSoup(sth.encode().decode('utf-8'))
main_title = soup1.find_all('div',class_="cd-timeline-topic")[0].contents[0]
sub_list = soup1.find_all('div',class_="cd-timeline-level")
for j,elem in enumerate(sub_list):
temp = elem.contents[2].rstrip().strip('\n')
temp = re.sub(' +', ' ', temp)
print("/%s.%s/%s.%s"%(str(i),main_title,str(j),temp[1:]))
了解 BeautifulSoup 如何构建您的 HTML 页面,然后使用标准数据结构(如列表)非常重要。
输出是:
/0.Overview/0.Course Introduction
/1.Project Basics/0.What is a Project?
/1.Project Basics/1.Project Roles & Responsibilities
/1.Project Basics/2.Project Phases
/1.Project Basics/3.Cost Control
/1.Project Basics/4.Organizational Structures
/1.Project Basics/5.Project Schedules
/1.Project Basics/6.Agile
/1.Project Basics/7.Project Resources
/2.Project Constraints/0.Contraints
/2.Project Constraints/1.Risk Management
/3.Communication & Change Management/0.Communication Methods :: Overview
/3.Communication & Change Management/1.Use of Communication Methods
/3.Communication & Change Management/2.Communication Triggers
/3.Communication & Change Management/3.Change Control Processes
/4.Project Tools & Documentation/0.Organizational Change
/4.Project Tools & Documentation/1.Project Management Tools
/4.Project Tools & Documentation/2.Project Documentation
/4.Project Tools & Documentation/3.Vendor Documentation
/5.Glossary/0.Review of Terms & Acroynms
/5.Glossary/1.Review of Formulas
/6.Exam/0.Exam Preparation
推荐阅读
- java - 片段查看器没有附加适配器;跳过布局
- javascript - 如何在没有 ReferenceError 的情况下导入 javascript 模块?
- rust - 如何在 rust 中声明除生命周期之外的相同类型的泛型参数?
- javascript - 用js或jquery上传一个空文件夹
- wso2-am - 将 WSO2 IS 配置为密钥管理器
- security - 同一模型具有不同权限的两个安全组 - Odoo
- python - WinError 10054 一个现有的连接被远程主机强行关闭,我应该使用什么来代替 Windows 的选择库?
- java - 将动态 XLS 文件路径传递给 Drools 规则引擎
- amazon-web-services - Docker 组合依赖传递到 AWS Elastic Beanstalk
- php - 需要在数据库中关闭 WordPress 的自动加载