首页 > 解决方案 > 在 pandas 数据框中标记自定义 NER

问题描述

我有一个包含 3 列的数据框:'text', 'in', 'tar'分别是type(str, list, list)

                   text                                       in       tar
0  This is an example text that I use in order to  ...       [2]       [6]
1  Discussion: We are examining the possibility of ...       [3]     [6, 7]

intar表示我要标记到文本中的特定实体,它们返回每个找到的实体术语在文本中的位置。

例如,在数据框的第二行 where in = [3],我想从text列中取出第三个单词(即:“are”)并将其标记为<IN>are</IN>

同样,对于同一行,由于tar = [6,7],我还想从text列中取出第 6 个和第 7 个单词(即"possibility""of")并将它们标记为 <TAR>possibility</TAR>, <TAR>of</TAR>

有人可以帮我怎么做吗?

标签: pythonpandasnlpspacynamed-entity-recognition

解决方案


这不是最优化的实现,但值得得到启发。

data = {'text': ['This is an example text that I use in order to',
                 'Discussion: We are examining the possibility of the'],
        'in': [[2], [3]],
        'tar': [[6], [6, 7]]}
df = pd.DataFrame(data)
cols = list(df.columns)[1:]
new_text = []
for idx, row in df.iterrows():
    temp = list(row['text'].split())
    for pos, word in enumerate(temp):
        for col in cols:
            if pos in row[col]:
                temp[pos] = f'<{col.upper()}>{word}</{col.upper()}>'
    new_text.append(' '.join(temp))
df['text'] = new_text
print(df.text.to_list())

输出:

['This is <IN>an</IN> example text that <TAR>I</TAR> use in order to', 
 'Discussion: We are <IN>examining</IN> the possibility <TAR>of</TAR> <TAR>the</TAR>']

更新 1

合并连续出现的相似标签可以如下完成:

data = {'text': ['This is an example text that I use in order to',
                 'Discussion: We are examining the possibility of the'],
        'in': [[2], [3, 4, 5]],
        'tar': [[6], [6, 7]]}
df = pd.DataFrame(data)
cols = list(df.columns)[1:]
new_text = []
for idx, row in df.iterrows():
    temp = list(row['text'].split())
    for pos, word in enumerate(temp):
        for col in cols:
            if pos in row[col]:
                temp[pos] = f'<{col.upper()}>{word}</{col.upper()}>'
    new_text.append(' '.join(temp))
    
df['text'] = new_text
for col in cols:
    df['text'] = df['text'].apply(lambda text:text.replace("</"+col.upper()+"> <"+col.upper()+">", " "))
print(df.text.to_list())

输出:

['This is <IN>an</IN> example text that <TAR>I</TAR> use in order to', 'Discussion: We are <IN>examining the possibility</IN> <TAR>of the</TAR>']

推荐阅读