首页 > 解决方案 > 查找关键字+1并创建新列

问题描述

目标:

1)定位关键字旁边的单词(例如brca

2)用这个词创建一个新列

背景:

1)我有一个列表l,我在其中制作了一个数据框并使用以下代码df从中提取单词:brca

l = ['carcinoma brca positive completion mastectomy',
     'clinical brca gene mutation',
     'carcinoma brca positive chemotherapy']
df = pd.DataFrame(l, columns=['Text'])
df['Gene'] = df['Text'].str.extract(r"(brca)")

输出:

                                                Text    Gene
0   breast invasive lobular carcinoma brca positiv...   brca
1   clinical history brca gene mutation . gross de...   brca
2   left breast invasive ductal carcinoma brca pos...   brca

问题:

但是,我现在正试图在brca每一行的单词旁边找到单词并创建一个新列。

期望的输出:

                                                Text    Gene  NextWord
0   breast invasive lobular carcinoma brca positiv...   brca  positive
1   clinical history brca gene mutation . gross de...   brca  gene
2   left breast invasive ductal carcinoma brca pos...   brca  positive

在上下文中查看了 python pandas 数据框单词:get 3 words before and after and PANDAS 在一列字符串中找到确切的单词和之前的单词,并将该新列附加到 python (pandas) 列中,但它们对我不太有用。

问题:

我如何实现我的目标?

标签: regexpandastextnlpkeyword

解决方案


利用:

import pandas as pd

l = ['carcinoma brca positive completion mastectomy',
     'clinical brca gene mutation',
     'carcinoma brca positive chemotherapy']
df = pd.DataFrame(l, columns=['Text'])

df['NextWord'] = df['Text'].str.extract(r"(?<=brca)(.+?) ")
print(df)

输出:

                                            Text   NextWord
0  carcinoma brca positive completion mastectomy   positive
1                    clinical brca gene mutation       gene
2           carcinoma brca positive chemotherapy   positive

推荐阅读