首页 > 解决方案 > 数据框中的文本操作:单词提取

问题描述

我想检查数字旁边的单词。例如,我的数据框中有此列:Recipes

Halve the clementine and place into the cavity along with the bay leaves. Transfer the duck to a medium roasting tray and roast for around 1 hour 20 minutes.
Add the stock, then bring to the boil and reduce to a simmer for around 15 minutes.
2 heaped teaspoons Chinese five-spice 
100 ml Marsala
1 litre organic chicken stock

我想在其中提取它们的新列:

New Column
[1 hour, 20 minutes]
15 minutes
2 heaped
100 ml
1 litre

因为我需要与值列表进行比较:

to_compare= ["1 hour", "20 litres", "100 ml", "2", "15 minutes", "20 minutes"]

查看每行有多少元素是共同的。感谢您的帮助。

标签: pythonregexpandas

解决方案


我们使用Series.str.extractall与模式numbers - space - letter。然后我们检查有哪些匹配项to_compare,最后我们GroupBy.sum用来获取我们有多少匹配项

matches = df['Col'].str.extractall('(\d+\s\w+)')
df['matches'] = matches[0].isin(to_compare).groupby(level=0).sum()

                                                 Col  matches
0  Halve the clementine and place into the cavity...      2.0
1  Add the stock, then bring to the boil and redu...      1.0
2              2 heaped teaspoons Chinese five-spice      0.0
3                                     100 ml Marsala      1.0
4                      1 litre organic chicken stock      0.0

此外,matches返回:

                  0
  match            
0 0          1 hour
  1      20 minutes
1 0      15 minutes
2 0        2 heaped
3 0          100 ml
4 0         1 litre

要将这些放在列表中,请使用:

matches.groupby(level=0).agg(list)

                      0
0  [1 hour, 20 minutes]
1          [15 minutes]
2            [2 heaped]
3              [100 ml]
4             [1 litre]

推荐阅读