python - 使用正则表达式,在数据框中的列表中搜索模式并将匹配结果放入熊猫的新列中
问题描述
我有一个带有文本列的 csv 文件,PF 示例数据如下
text
['Hello world', 'Welcome to the universe.']
['Hey Hello world', 'I am learning Pandas Welcome to the universe.']
['Hello world how are you', 'Good Morning', 'I am learning Pandas.']
['Hi', 'Iam version 3.6 Welcome', 'Nice to meet you.']
我想迭代每一行并检查模式
如果句子有模式 Hello 或 world 那么我想把那个句子放在一个新的列 text1
如果这个句子有一个模式 Welcome 或 Universe 那么我想把那个句子放在一个新的列 text2
所以在搜索模式并将其放入新列后,我的输出如下所示
text,text1,text2
['Hello world', 'Welcome to the universe.'],Hello world,Welcome to the universe.
['Hey Hello world', 'I am learning Pandas Welcome to the universe.'],Hey Hello world,I am learning Pandas Welcome to the universe.
['Hello how are you', 'Good Morning', 'I am learning Pandas.'],Hello how are you,None
['Hi', 'Iam version 3.6 Welcome', 'Nice to meet you.'],None,Iam version 3.6 Welcome
谁能指导我如何做到这一点?
解决方案
从你的DataFrame
:
>>> df = pd.DataFrame({'text': ["['Hello world', 'Welcome to the universe.']",
... "['Hey Hello world', 'I am learning Pandas Welcome to the universe.']",
... "['Hello world how are you', 'Good Morning', 'I am learning Pandas.']",
... "['Hi', 'Iam version 3.6 Welcome', 'Nice to meet you.']"]},
... index = [0, 1, 2, 3])
>>> df
text
0 ['Hello world', 'Welcome to the universe.']
1 ['Hey Hello world', 'I am learning Pandas Welc...
2 ['Hello world how are you', 'Good Morning', 'I...
3 ['Hi', 'Iam version 3.6 Welcome', 'Nice to mee...
我们可以使用apply
两个函数,find_substring_text1
并find_substring_text2
在text
列上,eval
即为list
:
>>> def find_substring_text1(row):
... return [s for s in row if any(k in s for k in ['Hello', 'world'])]
>>> def find_substring_text2(row):
... return [s for s in row if any(k in s for k in ['Welcome', 'universe'])]
>>> df['text1'] = df['text'].apply(eval).apply(find_substring_text1)
>>> df['text2'] = df['text'].apply(eval).apply(find_substring_text2)
然后我们得到预期的结果:
>>> df
text text1 text2
0 ['Hello world', 'Welcome to the universe.'] [Hello world] [Welcome to the universe.]
1 ['Hey Hello world', 'I am learning Pandas Welc... [Hey Hello world] [I am learning Pandas Welcome to the universe.]
2 ['Hello world how are you', 'Good Morning', 'I... [Hello world how are you] []
3 ['Hi', 'Iam version 3.6 Welcome', 'Nice to mee... [] [Iam version 3.6 Welcome]
如果需要,我们甚至可以将list
格式更改为string
:
>>> df['text1'] = [','.join(map(str, l)) for l in df['text1']]
>>> df['text2'] = [','.join(map(str, l)) for l in df['text2']]
>>> df
text text1 text2
0 ['Hello world', 'Welcome to the universe.'] Hello world Welcome to the universe.
1 ['Hey Hello world', 'I am learning Pandas Welc... Hey Hello world I am learning Pandas Welcome to the universe.
2 ['Hello world how are you', 'Good Morning', 'I... Hello world how are you
3 ['Hi', 'Iam version 3.6 Welcome', 'Nice to mee... Iam version 3.6 Welcome
推荐阅读
- pyspark - pyspark - 并行化文件处理的问题
- html - 带有短垂直线的 CSS 样式导航栏
- php - 如何通过 Ajax jQuery 发送多个输入值
- python - 是否可以使 __init__ 文件动态描述父包?
- javascript - 使用jquery的相同嵌套数组数据的总和
- flutter - 如何在自己弹出时导航到命名路线?
- python - 请帮助计算使用 Python 或 Pandas 的设备数量?
- html - npm 包正在创建一个带有已编译 css 的 dist 文件夹,该 css 覆盖了主 css 文件夹
- phpmyadmin - 我正在尝试设置 phpmyadmin 并收到警告。我试图解决这些问题,但对我没有用。谁能帮我?
- reactjs - 通过单击按钮显示来自 firebase 的单个数据信息