首页 > 解决方案 > 如何在python中为数据添加标点符号

问题描述

我正在为 Tacotron 2 编写代码,它将从 youtube 获取成绩单并将其格式化为文件。不幸的是,它从 YT 收到的数据并没有说明句子在哪里结束。所以,我尝试在最后添加句号,但大多数句子都不是一个完整的句子。那么,我怎样才能让它只在句子的结尾添加句号。它收到的唯一其他数据是时间戳。

# Batch file for Tacotron 2

from youtube_transcript_api import YouTubeTranscriptApi
transcript_txt = YouTubeTranscriptApi.get_transcript('DY0ekRZKtm4')


def write_transcript():
    with open('transcript.txt', 'a+') as transcript_object:
        transcript_object.seek(0)
        subtitles = transcript_object.read(100)
        if len(subtitles) > 0:
            transcript_object.write('\n')
        for i in transcript_txt:
            ii = i['text']
            if ii[-1] != '.':
                iii = ii + '.'
            else:
                iii = ii
            print(iii)
            transcript_object.write(iii + '\n')
   transcript_object.close()


write_transcript()

这是一个例子:

 What it saves:
    sometimes it was possible to completely.
    fall.
    out of the world if the lag was bad.
    enough.
 What I want:
    sometimes it was possible to completely
    fall
    out of the world if the lag was bad
    enough.

标签: pythonpython-3.x

解决方案


没有简单的解决方案。我能想到的最省力的方法是设置spaCynlp整个成绩单并希望最好。虽然它不是在没有标点符号的数据上训练的,所以不要期望完美的结果,但它会检测一些句子边界(大部分基于语法)。

import spacy

nlp = spacy.load('en_core_web_trf')

text = """sometimes it was possible to completely
    fall
    out of the world if the lag was bad
    enough
    we solved that by
    adding more test data"""

doc = nlp(text)

for s in doc.sents:
    print(f"'{s}'")

输出:

'sometimes it was possible to completely
    fall
    out of the world if the lag was bad
    enough
    '
'we solved that by
    adding more test data'

所以在这种情况下,它奏效了。一旦你有了它,你可以做一些额外的处理,手动添加标点符号等。


推荐阅读