首页 > 解决方案 > 如何在忽略特定标签的情况下从 Python 中的 html 中提取文本

问题描述

我有一个如下所示的输入文件,我正在尝试提取其文本并删除 html 标签。请注意,我希望每个 p 都在换行符中,但如果它是 br 我想将它保留在同一行中,但无论如何都删除 br 标签。

<tt xmlns="http://www.w3.org/ns/ttml" xml:lang="en" xmlns:tts="http://www.w3.org/ns/ttml#parameter"><head><styling><style id="b1"/></sty    ling></head><body><div xml:lang="en" style="b1"><p begin="" end="0.143">HISTORY</p><p begin="0.143" end="0.286">HISTORY TV"</p><p begin=    "0.286" end="0.714">HISTORY TV" THIS</p><p begin="0.714" end="0.857">HISTORY TV" THIS<br/>WEEKEND</p><p begin="0.857" end="3">HISTORY TV    " THIS<br/>WEEKEND ON</p><p begin="3" end="3.333">HISTORY TV" THIS<br/>WEEKEND ON C-SPAN3.</p><p begin="3.333" end="3.667">WEEKEND ON C-    SPAN3.<br/>&gt;&gt;&gt;</p><p begin="3.667" end="4">WEEKEND ON C-SPAN3.<br/>&gt;&gt;&gt; "THE</p><p begin="4" end="4.5">WEEKEND ON C-SPA    N3.<br/>&gt;&gt;&gt; "THE MARCH</p><p begin="4.5" end="5">WEEKEND ON C-SPAN3.<br/>&gt;&gt;&gt; "THE MARCH ON</p><p begin="5" end="5.5">W    EEKEND ON C-SPAN3.<br/>&gt;&gt;&gt; "THE MARCH ON WASHINGTON"</p><p begin="5.5" end="5.667">&gt;&gt;&gt; "THE MARCH ON WASHINGTON"<br/>F    OR</p><p begin="5.667" end="5.833">&gt;&gt;&gt; "THE MARCH ON WASHINGTON"<br/>FOR JOBS</p><p begin="5.833" end="6">&gt;&gt;&gt; "THE MAR    CH ON WASHINGTON"<br/>FOR JOBS AND</p><p begin="6" end="6.2">&gt;&gt;&gt; "THE MARCH ON WASHINGTON"<br/>FOR JOBS AND FREEDOM</p><p begin    ="6.2" end="6.4">&gt;&gt;&gt; "THE MARCH ON WASHINGTON"<br/>FOR JOBS AND FREEDOM WAS</p><p begin="6.4" end="7">&gt;&gt;&gt; "THE MARCH O    N WASHINGTON"<br/>FOR JOBS AND FREEDOM WAS 49</p><p begin="7" end="8">FOR JOBS AND FREEDOM WAS 49<br/>YEARS</p><p begin="8" end="8.5">FO    R JOBS AND FREEDOM WAS 49<br/>YEARS AGO.</p><p begin="8.5" end="8.75">YEARS AGO.<br/>ON</p><p begin="8.75" end="9">YEARS AGO.<br/>ON AUG    UST</p><p begin="9" end="13">YEARS AGO.<br/>ON AUGUST 28th,</p><p begin="13" end="13.333">YEARS AGO.<br/>ON AUGUST 28th, 1963.</p><p beg    in="13.333" end="13.5">ON AUGUST 28th, 1963.<br/>THE</p><p begin="13.5" end="13.667">ON AUGUST 28th, 1963.<br/>THE MARCH</p><p begin="13    .667" end="13.833">ON AUGUST 28th, 1963.<br/>THE MARCH WAS</p><p begin="13.833" end="14">ON AUGUST 28th, 1963.<br/>THE MARCH WAS ORKGANI    ZED</p><p begin="14" end="14.167">ON AUGUST 28th, 1963.<br/>THE MARCH WAS ORKGANIZED TO</p><p begin="14.167" end="14.667">ON AUGUST 28th    , 1963.<br/>THE MARCH WAS ORKGANIZED TO PUSH</p><p begin="14.667" end="14.833">THE MARCH WAS ORKGANIZED TO PUSH<br/>FOR</p>

所以最后我想拥有

HISTORY
HISTORY TV"
HISTORY TV" THIS
HISTORY TV" THIS WEEKEND
HISTORY TV" THIS WEEKEND ON
HISTORY TV" THIS WEEKEND ON C-SPAN3.
...etc

我如何完成这项任务?

我用了这段代码

import re
import os

def remove_html_tags(data):
    p = re.compile(r'<.*?>')
    return p.sub(' ', str(data)).strip()

directory = './reprocess'
for filename in os.listdir(directory):
    if filename.endswith(".dfxp"):
        print("Processing: {}".format(filename))
        with open("./reprocess/"+filename, "r") as inputFile:
            data = inputFile.read().splitlines()
            new_data = ""
            for line in data:
                new_data = new_data + remove_html_tags(line) + "\n"
        with open("./rmout/"+filename, "w") as text_file:
            text_file.write(new_data)

但它给了我一个可怕的输出

HISTORY TV"
WEEKEND ON
HISTORY TV" THIS
 "THE MARCH
 "THE MARCH ON
WEEKEND ON C-SPAN3.
FOR JOBS AND FREEDOM WAS
 "THE MARCH ON WASHINGTON"
YEARS
FOR JOBS AND FREEDOM WAS 49
ON AUGUST 28th,
YEARS AGO.
THE MARCH WAS ORKGANIZED TO
ON AUGUST 28th, 1963.
FOR COMPREHENSIVE CIVIL
THE MARCH WAS ORKGANIZED TO PUSH
INCLUDING PUBLIC
FOR COMPREHENSIVE CIVIL RIGHTS
DESEGREGATION,
DESEGREGATION, VOTING

标签: pythonpython-3.xregex

解决方案


消化

该代码使用bs4(BeautifulSoup4,官方文档)并包含 3 个主要步骤:

  1. 汤前数据清理:有时清理原始文本中的一些数据比清理汤中的数据更方便。如果是这种情况,请不要犹豫。
  2. 构建汤(DOM)
  3. 汤元素提取和后处理提取的文本。

代码

免责声明:广泛测试并始终期待异常。问题解决者无法预见样本数据中没有出现的问题。

import bs4
import re
from pprint import pprint

# raw data
html = "(as provided)"

# 1. cleansing

# (1) remove known unwanted patterns
html = html.replace("    ", "")
html = html.replace("&gt;&gt;&gt;", "")
# remove <br> tags (can also remove after the soup is built)
html = re.sub(r"<br\s*/?>", " ", html)  # careful! error-prone!

# (2) regularize multiple spaces
html = re.sub(r"\s{2,}", " ", html)

# 2. construct soup (DOM)
soup = bs4.BeautifulSoup(html, 'html.parser')

# 3. extract text in target elements    
ls_lines = []
for el in soup.find_all("p"):
    ls_lines.append(el.get_text().strip())

# check
for line in ls_lines:
    print(line)

输出

现在的输出看起来非常不错。然而,这是因为在小样本数据集中找不到太多问题。在实际情况下,可能需要更多的预处理和元素选择规则。这部分超出了这个问题的范围。

HISTORY
HISTORY TV"
HISTORY TV" THIS
HISTORY TV" THIS WEEKEND
HISTORY TV" THIS WEEKEND ON
HISTORY TV" THIS WEEKEND ON C-SPAN3.
WEEKEND ON C-SPAN3.
WEEKEND ON C-SPAN3. "THE
WEEKEND ON C-SPAN3. "THE MARCH
WEEKEND ON C-SPAN3. "THE MARCH ON
WEEKEND ON C-SPAN3. "THE MARCH ON WASHINGTON"
"THE MARCH ON WASHINGTON" FOR
"THE MARCH ON WASHINGTON" FOR JOBS
"THE MARCH ON WASHINGTON" FOR JOBS AND
"THE MARCH ON WASHINGTON" FOR JOBS AND FREEDOM
"THE MARCH ON WASHINGTON" FOR JOBS AND FREEDOM WAS
"THE MARCH ON WASHINGTON" FOR JOBS AND FREEDOM WAS 49
FOR JOBS AND FREEDOM WAS 49 YEARS
FOR JOBS AND FREEDOM WAS 49 YEARS AGO.
YEARS AGO. ON
YEARS AGO. ON AUGUST
YEARS AGO. ON AUGUST 28th,
YEARS AGO. ON AUGUST 28th, 1963.
ON AUGUST 28th, 1963. THE
ON AUGUST 28th, 1963. THE MARCH
ON AUGUST 28th, 1963. THE MARCH WAS
ON AUGUST 28th, 1963. THE MARCH WAS ORKGANIZED
ON AUGUST 28th, 1963. THE MARCH WAS ORKGANIZED TO
ON AUGUST 28th, 1963. THE MARCH WAS ORKGANIZED TO PUSH
THE MARCH WAS ORKGANIZED TO PUSH FOR

参考


推荐阅读