首页 > 解决方案 > 将字符串数据移动到值的数量是任意的新列中

问题描述

我正在从包含字符串数据的列中提取专有名词。我想将提取的名词作为列表移动到一个新列中(或者,作为每个附加列的一个名词)。我要提取的每个条目都有任意数量的(有时是大量的)名词。

我已经完成了提取并将我感兴趣的值移动到一个列表中,但是我不知道如何将它们作为一个列添加到我提取它们的情况下,因为长度不同在我提取的列表和它需要与单行对应的事实之间。

    from nltk.tokenize import PunktSentenceTokenizer

    data = []
    norm_data['words'] = []
    for sent in norm_data['gtd_summary']:
        sentences = nltk.sent_tokenize(sent) 
        data = data + nltk.pos_tag(nltk.word_tokenize(sent))
        for word in data: 
            if 'NNP' in word[1]: 
                nouns = list(word)[0]
                norm_data['words'].append(nouns)

当前数据

X   Y
1   Joe Montana walks over to the yard
2   Steve Smith joins the Navy
3   Anne Johnson wants to go to a club
4   Billy is interested in Sally

我想要的是

X   Y                                       Z
1   Joe Montana walks over to the yard      [Joe, Montana]
2   Steve Smith joins the Navy              [Steve, Smith, Navy]
3   Anne Johnson wants to go to a club      [Anne, Johnson]
4   Billy is interested in Sally            [Billy, Sally]

或者这也可以

    X   Y                                       Z      L            M
    1   Joe Montana walks over to the yard      Joe    Montana      NA
    2   Steve Smith joins the Navy              Steve  Smith        Navy
    3   Anne Johnson wants to go to a club      Anne   Johnson      NA
    4   Billy is interested in Sally            Billy  Sally        NA

标签: pythonpandasdata-structuresnlp

解决方案


您可以建立一个包含列表的系列。在循环之后将列 Z 添加到数据框中(我猜您的数据在数据框中?)

# Init before the loop
noun_series = pd.Series()
    ...
    # Build up series 
    nouns = list(word)[0]
    noun_series.at[index] = nouns
    index += 1
    ...
# After the loop - add the Z column
df['Z'] = noun_series

但是,您需要正确设置索引,以便它匹配正确的行。


推荐阅读