首页 > 解决方案 > 如何优化 NLP 项目的文本运行时特征提取

问题描述

我正在 Google Colab (Python) 中使用涉及约 100,000 个实例的文本数据集进行 NLP 项目。现在,对于每个实例,我都在对大约 5-10 个特征进行特征提取,每次尝试运行代码大约需要 5-10 分钟。因为我正在尝试不同类型的功能,所以我运行了很多次特征提取过程,一段时间后总运行时间加起来。

我怀疑这可能是因为我的代码效率不高,目前依赖于列表理解、映射和迭代。由于数据的大小以及它如何存储文本的多个副本,该代码也占用了大量内存。

所以我想知道是否有更好的方法来执行特征提取以加快进程(并节省空间)。我听说 numpy 有矢量化操作,但不知道如何去做。

这是我的代码的骨架版本。

import nltk
import numpy as np
import pandas as pd

df = pd.DataFrame([["The quick brown fox jumps over the lazy dog.",
                    "Energy is sustainable if it meets the needs of the present without compromising the ability of future generations to meet their needs."],
                   ["The scientific literature on limiting global warming describes pathways in which the world rapidly phases out coal-fired power plants, produces more electricity from clean sources such as wind and solar, shifts towards using electricity instead of fuels in sectors such as transport and heating buildings, and takes measures to conserve energy.",
                    "Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s"]], columns=['text1', 'text2'])


def process(text):
    tokens = nltk.word_tokenize(text)

    # Other techniques like stemming and lemmatization

    return tokens

def get_features(text1, text2):
    features = []

    feature1 = len(text1) + len(text2)
    features.append(feature1)
    feature2 = len([word1 for word1 in text1 if word1 in text2])
    features.append(feature2)

    # Continued for about 5-10 features. Some features involve multiple steps like doing named entity recognition and creating features from there

    return features

df.loc[:, 'text1_tokens'] = df.loc[:, 'text1'].apply(process)
df.loc[:, 'text2_tokens'] = df.loc[:, 'text2'].apply(process)

features = df.apply(lambda x: get_features(x['text1_tokens'], x['text2_tokens']), axis='columns')

df.loc[:, 'feature1'] = list(map(lambda x: x[0], features))
df.loc[:, 'feature2'] = list(map(lambda x: x[1], features))

标签: pythondataframenumpy

解决方案


feature2 = len([word1 for word1 in text1 if word1 in text2])

该行的运行时复杂度为words_in_text1 * words_in_text2. 根据这些文本大小,您可能会通过仅获取text2.

您还正在创建一个列表,该列表在同一行中只是被浪费了。如果文本中的单词顺序始终无关紧要,则可能使用collections.Counter或类似的对象会进一步提高速度。

例如:

from collections import Counter


text1_counts = Counter(text1)
text2_counts = Counter(text2)
feature2 = sum(count for word, count in text2_counts.items()
               if word in text2_counts)

如果您有更多具有类似问题的特征,解决这些问题应该会加快您的特征提取速度。


推荐阅读