python-3.x - 用文本替换 html 文件中的表格......(例如@@##这里有一个表格)
问题描述
我正在使用 beautifulsoup 从 python 中的 html 文件中提取文本。我想提取所有文本数据并丢弃表格。但是我们可以做些什么来用文本替换 html 中的表格(例如“@@##这里有一个表格@@##”)
我能够使用 beautifulsoup 读取 html 文件并删除了 table uisng strip_tables(html)。但不确定如何删除表格并替换为指定表格的文本。
def strip_tables(soup):
"""Removes all tables from the soup object."""
for script in soup(["table"]):
script.extract()
return soup
sample_html_file = "/Path/file.html"
html = read_from_file(sample_html_file)
# This function reads the file and returns a file handle for beautifulsoup
soup = BeautifulSoup(html, "lxml")
my_text = strip_tables( soup ).text
这是带有表格的 html 文件:
By order of the Board of Directors, /s/ JOSHUA H. LEVINE Joshua H. Levine President and Chief Executive OfficerSunnyvale, California October 4, 2018
Table of Contents TABLE OF CONTENTS Page QUESTIONS AND ANSWERS REGARDING THIS SOLICITATION AND VOTING AT THE ANNUAL MEETING 1 PROPOSAL ONEELECTION OF DIRECTORS 7 Classes of our Board 7 Director NomineesClass III Directors 7 Continuing DirectorsClass I and Class II Directors 8 Board of Directors Recommendation 11 PROPOSAL TWOTO APPROVE AN AMENDMENT TO OUR 2016 EQUITY INCENTIVE PLAN TO INCREASE THE NUMBER OF SHARES OF COMMON STOCK AUTHORIZED FOR ISSUANCE UNDER SUCH PLAN 12 Summary of the Amended 2016 Plan 13 Summary of U.S. Federal Income Tax Consequences 20 New Plan Benefits 22 Existing Plan Benefits to Employees and Directors 23 Board of Directors Recommendation 23 PROPOSAL THREETO APPROVE AN AMENDMENT TO OUR 2007 EMPLOYEE STOCK PURCHASE PLAN TO INCREASE THE NUMBER OF SHARES OF COMMON STOCK AUTHORIZED FOR ISSUANCE UNDER SUCH PLAN A-1 APPENDIX B AMENDED AND RESTATED 2007 EMPLOYEE STOCK PURCHASE PLAN B-1 ii Table of Contents PROXY STATEMENT FOR ACCURAY INCORPORATED 2018 ANNUAL MEETING OF STOCKHOLDERS TO BE HELD ON NOVEMBER 16, 2018
This proxy statement (Proxy Statement) is furnished to our stockholders of record as of the close of business on September 20, 2018 (the Record Date)
这是 strip_tables 之后的数据:
By order of the Board of Directors, /s/ JOSHUA H. LEVINE Joshua H. Levine President and Chief Executive OfficerSunnyvale, California October 4, 2018
This proxy statement (Proxy Statement) is furnished to our stockholders of record as of the close of business on September 20, 2018 (the Record Date)
预期结果
By order of the Board of Directors, /s/ JOSHUA H. LEVINE Joshua H. Levine President and Chief Executive OfficerSunnyvale, California October 4, 2018
" @@## There was a table here @@## "
This proxy statement (Proxy Statement) is furnished to our stockholders of record as of the close of business on September 20, 2018 (the Record Date)
解决方案
请尝试在 strip_tables 函数中replaceWith()
使用。extract()
希望对您有所帮助。
def strip_tables(soup):
"""Removes all tables from the soup object."""
for script in soup(["table"]):
script.replaceWith(" @@## There was a table here @@## ")
推荐阅读
- python - 执行 threading.Thread 对象的 run 函数,独立于 main
- postgresql - 查询当前日期的条件排序
- python - 我的机器人如何从我的 PC 文件夹中发送图片(不是在线图片)
- android - 颤动中角落半径的方形对话框
- casting - 转换名为“file”的列时出现 SQL 编译错误
- python - 清理 try 和 except 语句
- bayesian - 如何使用 Rjags 随机初始化链?
- javascript - 使用 RegEx 提取 HTML 元素的属性
- python - Itertools.combinations() 提高超过时间限制
- javascript - googletag 从 API 响应动态定义槽不起作用