首页 > 解决方案 > 如何有效地将列表附加到列表中?

问题描述

我试图追加filtered_sentence到列表中wiki_train_lst。我发现删除的步骤stop_words很快,但删除common_name的速度很慢(可能是里面的字太多common_name)。如何快速过滤掉stop_wordscommon_name?另外,要追加的总内容wiki_train_lst大约是 416,000 项,这使得追加过程非常缓慢:如何优化它?

from nltk.tokenize import RegexpTokenizer

wiki_train_lst = []

for text in wiki_train_df.original_text:

    tokenizer = RegexpTokenizer(r'\w+')
    tokenizer = tokenizer.tokenize(text)

    #print(word_tokens)

    filtered_sentence = [w.lower() for w in tokenizer if not w.lower() in stop_words] #remove stop words

    #filtered_sentence = [w for w in filtered_sentence if not w in common_surname_lst or not w in common_name_lst]

    filtered_sentence = [w for w in filtered_sentence if not w in common_name_lst] #remove common names

    filtered_sentence = [w for w in filtered_sentence if w.isalpha()] #remove non alphabatics words
    
    wiki_train_lst.append(filtered_sentence)

    #print(filtered_sentence)

wiki_train_lst

标签: python-3.xlistdataframeappendnltk

解决方案


使其更快的一种方法是将所有列表表达式合并为一个:

def my_filter(w):
    w_lower = w.lower()
    if w_lower in stop_words:
        return False
    if w_lower in common_surname_lst 
        return False
    if w_lower in common_name_lst:
        return False
    if not w.isalpha()
        return False
    return True

filtered_sentence = [w.lower() for w in tokenizer if my_filter(w)]

wiki_train_lst.append(filtered_sentence)

请注意,这会增加函数查找的开销,您可以将函数重写为一堆and语句:

filtered_sentence = [w for w in tokenizer if w.lower() not in stop_words 
                                              and w.lower() not in common_surname_lst 
                                              and w.lower() not in common_name_lst 
                                              and w.isalpha()]

现在我们有一堆w.lowers(),让我们做点什么:我们可以使用生成器表达式,它就像一个列表推导式,但是很懒:

filtered_sentence = [w for w in (w.lower() for w in tokenizer) if w not in stop_words 
                                                                   and w not in common_surname_lst 
                                                                   and w not in common_name_lst 
                                                                   and w.isalpha()]

更好的可能是使用filter

filtered_sentence = filter(tokenizer, my_filter)

为了提高common_name搜索速度,首先将列表转换为set上述方法之一:

common_name_lst = set(common_name_lst)

其余代码可以保持不变,除非您想重命名变量以使类型更清晰。

最终,如果您需要性能,CPython 通常是次优选择。有一些方法可以让它更快(参见PyPyCython),但你最好用一种更容易优化的语言重写你的代码。


推荐阅读