首页 > 解决方案 > 根据 NLP 预处理的空格数创建自定义空格标记

问题描述

为了避免被错误地标记为重复项(尽管如果我在 Google 搜索遗漏了某些内容,我会很高兴地证明我是错误的),我自己做了一些研究,并在处理空格方面发现了这一点:

我在网上能找到的很多东西似乎都是为了(1)找到空白并用静态的东西替换它,(2)量化给定字符串中的空白的方法,而不是大块的。

很难找到的是如何沿着字符串滑动,当到达一段空白时停止,并用一个变量替换字符串的那一部分,这取决于该空白的大小。

我的问题:

我正在做一些 NLP 工作,我的数据通常在值之间有离散量的空白(有时在行的开头)

例如:

field_header field_value field_code\n

. Sometimes there are gaps at the beginning too.

数据还包含一些标准文本,中间有单个空格:

There are standard sentences which are embedded in the documents as well.\n

我想替换所有大于单个空格的空格,所以我的文档现在看起来像这样:

field_head WS_10 field_value WS_4 field_code\n

. WS_6 Sometimes WS_3 there are gaps WS_6 at the beginning too.

There are standard sentences which are embedded in the documents as well.\n

其中WS_n是一个标记,表示每个单词之间的空白数量 (n >= 2),并在两侧用空格填充。

我尝试使用正则表达式查找空格并分别计算使用的空格数.count()- 但这显然不起作用。我知道如何使用re.sub,但它不允许特定的替换,这取决于正则表达式拾取的内容。

s = 'Some part      of my     text file   with irregular     spacing.\n'
pattern = '\ {2,}'

subsitution = ' WS_'+str(???.count(' '))+' '

re.sub(pattern, substitution, s)

如果上面的例子做了它应该做的,我会回来:

'Some part WS_6 of my WS_5 text file WS_3 with irregular WS_6 spacing.\n'

标签: pythonregexreplace

解决方案


没有正则表达式:

s1 = 'Some part      of my     text file   with irregular     spacing.\n'
s2 = '          Some part      of my     text file   with irregular     spacing.\n'

def fix_sentence(sentence: str) -> str:
    ws_1st_char = True  # used to properly count whitespace at the beginning of the sentence
    count, new_sentence = 0, ''
    for x in sentence.split(' '):
        if x != '':
            if count != 0:
                if ws_1st_char: z = count
                else: z = count + 1
                new_sentence = new_sentence + f'WS_{z} '
            new_sentence = new_sentence + f'{x} '
            count = 0
            ws_1st_char = False
        else:
            count+=1
    return new_sentence.rstrip(' ')

fixed1 = fix_sentence(s1)
fixed2 = fix_sentence(s2)

print(fixed1)
>>> 'Some part WS_6 of my WS_5 text file WS_3 with irregular WS_5 spacing.\n'

print(fixed2)
>>> 'WS_10 Some part WS_6 of my WS_5 text file WS_3 with irregular WS_5 spacing.\n'

如果句子开头没有空格:

def fix_sentence(sentence: str) -> str:
    count, new_sentence = 0, ''
    for x in sentence.split(' '):
        if x != '':
            if count != 0:
                new_sentence = new_sentence + f'WS_{count + 1} '
            new_sentence = new_sentence + f'{x} '
            count = 0
        else:
            count+=1
    return new_sentence.rstrip(' ')

推荐阅读