首页 > 解决方案 > 用文本替换 html 文件中的表格......(例如@@##这里有一个表格)

问题描述

我正在使用 beautifulsoup 从 python 中的 html 文件中提取文本。我想提取所有文本数据并丢弃表格。但是我们可以做些什么来用文本替换 html 中的表格(例如“@@##这里有一个表格@@##”)

我能够使用 beautifulsoup 读取 html 文件并删除了 table uisng strip_tables(html)。但不确定如何删除表格并替换为指定表格的文本。

def strip_tables(soup):
    """Removes all tables from the soup object."""
    for script in soup(["table"]): 
        script.extract()
    return soup

sample_html_file = "/Path/file.html"
html = read_from_file(sample_html_file) 
# This function reads the file and returns a file handle for beautifulsoup
soup = BeautifulSoup(html, "lxml")
my_text = strip_tables( soup ).text

这是带有表格的 html 文件:

By order of the Board of Directors, /s/ JOSHUA H. LEVINE Joshua H. Levine  President and Chief Executive OfficerSunnyvale, California  October 4, 2018

Table of Contents  TABLE OF CONTENTS             Page   QUESTIONS AND ANSWERS REGARDING  THIS SOLICITATION AND VOTING AT THE ANNUAL MEETING      1   PROPOSAL ONEELECTION OF  DIRECTORS      7   Classes of our Board      7   Director NomineesClass III Directors      7   Continuing DirectorsClass I and Class II Directors      8   Board of Directors Recommendation      11   PROPOSAL TWOTO APPROVE  AN AMENDMENT TO OUR 2016 EQUITY INCENTIVE PLAN TO INCREASE THE NUMBER OF SHARES OF COMMON STOCK AUTHORIZED FOR ISSUANCE UNDER SUCH PLAN      12   Summary of the Amended 2016 Plan      13   Summary of U.S. Federal Income Tax Consequences      20   New Plan Benefits      22   Existing Plan Benefits to Employees and Directors      23   Board of Directors Recommendation      23   PROPOSAL THREETO APPROVE  AN AMENDMENT TO OUR 2007 EMPLOYEE STOCK PURCHASE PLAN TO INCREASE THE NUMBER OF SHARES OF COMMON STOCK AUTHORIZED FOR ISSUANCE UNDER SUCH PLAN        A-1   APPENDIX B     AMENDED AND RESTATED 2007 EMPLOYEE STOCK PURCHASE PLAN      B-1    ii    Table of Contents    PROXY STATEMENT FOR  ACCURAY INCORPORATED  2018 ANNUAL MEETING OF STOCKHOLDERS  TO BE HELD ON NOVEMBER 16, 2018      

This proxy statement (Proxy Statement) is furnished to our stockholders of record as of the close of business on September 20, 2018 (the Record Date)

这是 strip_tables 之后的数据:

By order of the Board of Directors, /s/ JOSHUA H. LEVINE Joshua H. Levine  President and Chief Executive OfficerSunnyvale, California  October 4, 2018
     This proxy statement (Proxy Statement) is furnished to our stockholders of record as of the close of business on September 20, 2018 (the Record Date)

预期结果

By order of the Board of Directors, /s/ JOSHUA H. LEVINE Joshua H. Levine  President and Chief Executive OfficerSunnyvale, California  October 4, 2018 
" @@## There was a table here @@## "
This proxy statement (Proxy Statement) is furnished to our stockholders of record as of the close of business on September 20, 2018 (the Record Date)

标签: python-3.xweb-scrapingbeautifulsoup

解决方案


请尝试在 strip_tables 函数中replaceWith()使用。extract()希望对您有所帮助。

def strip_tables(soup):
    """Removes all tables from the soup object."""
    for script in soup(["table"]): 
        script.replaceWith(" @@## There was a table here @@## ")

推荐阅读