首页 > 解决方案 > 是否有任何用于将文本转换为单词的 python 库?

问题描述

我有一个数据框,其列名称为描述。请查看示例说明

这是我的样本观察:

在此处输入图像描述

说明上有编号。我需要将数字转换为文本并需要输出如下

在此处输入图像描述

从上面,我想将数字转换为我的 NLP 过程的单词。有没有将数字转换为单词的库?我有 50000 个观察值

请指教。

标签: python-3.xnlp

解决方案


我将在这个答案的开头说,我认为 Pandas 本身肯定有一个更稳定的内置解决方案。

话虽如此,这是使用该num2words软件包的解决方案:

import num2words
import random
import re
import pandas as pd

def randomSentence(wordList):
    """Uses wordList to create a sentance with random-numbers strewn in."""
    words = [random.choice(wordList) for i in range(3)]
    for i in range(random.randint(1,4)):
        words.append(round(random.uniform(0, 10), 2))
    random.shuffle(words)
    return " ".join(str(i) for i in words)

def transInt(string):
    """checks if there is a '.' in the given number, and returns the translation."""
    if "." in string:
        return num2words.num2words(float(string))
    return num2words.num2words(int(string))

def replaceInt(string):
    """Replaces integers and floats with a translated string using the function transInt """
    return re.sub(r"(\d+\.*\d*)", lambda x: transInt(x.group()), string)

# Lorem ipsum that is used as a wordlist to create sentences.
x = """Lorem ipsum dolor sit amet, consectetur adipiscing elit.
       Nam sit amet nunc sollicitudin, viverra dolor ut, feugiat tellus.
       Curabitur erat arcu, viverra vitae augue sed, maximus vestibulum ante."""
x = [i.strip(",.") for i in x.split()]

# Creating a list of random sentences, with numbers strewn in
sentences = [randomSentence(x) for i in range(2)]

# Creating a df with each of the sentences.
df = pd.DataFrame(sentences, columns=["Sentence"])

# Adds a new column 'Translated' to the dataframe with the numbers translated.
df["Translated"] = df.Sentence.apply(replaceInt)
for i in df.iterrows():
    _, data = i
    print(f'Original: {data.Sentence}')
    print(f'Translated: {data.Translated}')
    print("-"*20)

因为您没有提供简单的复制/粘贴版本DataFrame,所以我创建了一个函数来返回一个随机句子以供使用。

样本输出:

Original: arcu 7.48 ut 1.53 8.72 sit 7.13
Translated: arcu seven point four eight ut one point five three eight point seven two sit seven point one three
--------------------
Original: elit 3.55 amet 7.88 tellus
Translated: elit three point five five amet seven point eight eight tellus
--------------------

推荐阅读