首页 > 解决方案 > 无法在文本预处理中用空格替换数字

问题描述

我正在尝试将文本作为 NLP 的一部分进行预处理。我是新手。我不明白为什么我无法替换数字

para = "support leaders around the world who do not speak for the big 
polluters, but who speak for all of humanity, for the indigenous people of 
the world, for the first 100 people.In 90's it seems true."

import re
import nltk

sentences = nltk.sent_tokenize(para)

for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [re.sub(r'\d','',words)]
    sentences[i] = ' '.join(words)

在执行此操作时,我收到以下错误:


TypeError                                 Traceback (most recent call last)
<ipython-input-28-000671b45ee1> in <module>()
       2 for i in range(len(sentences)):
       3     words = nltk.word_tokenize(sentences[i])
 ----> 4     words = [re.sub(r'\d','',words)].encode('utf8')
       5     sentences[i] = ' '.join(words)

~\Anaconda3\lib\re.py in sub(pattern, repl, string, count, flags)
  189     a callable, it's passed the match object and must return
  190     a replacement string to be used."""
  --> 191     return _compile(pattern, flags).sub(repl, string, count)
  192 
  193  def subn(pattern, repl, string, count=0, flags=0):

  TypeError: expected string or bytes-like object

我如何转换为像对象一样的字节。我很困惑,因为我是新手。

标签: pythonnlp

解决方案


要替换字符串中的所有数字,您可以使用re模块来匹配和替换正则表达式模式。从你的最后一个例子:

import re

processed_words = [re.sub('\d',' ', word) for word in tokenized]

推荐阅读