首页 > 解决方案 > 没有使用 python 删除停用词

问题描述

我正在尝试从我拥有的标记列表中删除停用词。但是,似乎这些词没有被删除。会有什么问题?谢谢。

试过:

Trans = []
    with open('data.txt', 'r') as myfile:
        file = myfile.read()
            #start readin from the start of the charecter
        myfile.seek(0)
        for row in myfile:
            split = row.split()
            Trans.append(split)
        myfile.close()


    stop_words = list(get_stop_words('en'))         
    nltk_words = list(stopwords.words('english')) 
    stop_words.extend(nltk_words)

    output = [w for w in Trans if not w in stop_words]


    Input: 

    [['Apparent',
      'magnitude',
      'is',
      'a',
      'measure',
      'of',
      'the',
      'brightness',
      'of',
      'a',
      'star',
      'or',
      'other']]

    output:

    It returns the same words as input.

标签: pythonnlpstop-words

解决方案


为了提高可读性,请创建一个函数。前任:

def drop_stopwords(row):
    stop_words = set(stopwords.words('en'))
    return [word for word in row if word not in stop_words and word not in list(string.punctuation)]

并且with open()不需要 aclose() 并创建一个字符串(句子)列表并应用该函数。前任:

Trans = Trans.map(str).apply(drop_stopwords)

这将应用于每个句子...您可以为lemmitize等添加其他功能。这里有一个非常清晰的示例(代码): https ://github.com/SamLevinSE/job_recommender_with_NLP/blob/master/job_recommender_data_mining_JOBS.ipynb


推荐阅读