首页 > 解决方案 > 用熊猫中的其他列子字符串替换子字符串

问题描述

我有一个数据框,其中包含一些模板字符串和相应的字符串变量来替换。例如,给定:

template,variable
"{color} shirt in {size}", "blue,medium"
"{capacity} bottle in {color}", "24oz,teal"
"{megapixel}mp camera", "24.1"

我想制作以下内容:

"blue shirt in medium"
"24oz bottle in teal"
"24.1mp camera"

保证第一列中模板子字符串的数量将等于第二列中字符串中的变量数量。字符串的格式与上面的示例一致。

我的第一个想法是使用extractall然后加入创建一个多索引数据框:

templates = df['template'].str.extractall('({\w+\})')
variables = df['variable'].str.extractall('(\w+)')
multi_df = templates.join(variables, how='inner')

但我不知道从那里去哪里。或者有没有更简单的方法?

标签: pythonregexpandas

解决方案


用于string.Formattertemplate列中提取变量并构建字典以进行替换。

>>> df
                       template        value  # I modified your column name
0       {color} shirt in {size}  blue,medium
1  {capacity} bottle in {color}    24oz,teal
2          {megapixel}mp camera         24.1
from string import Formatter

def extract_vars(s):
    return tuple(fn for _, fn, _, _ in Formatter().parse(s) if fn is not None)

df['variable'] = df['template'].apply(extract_vars)
df['value'] = df['value'].str.split(',')
df['combined'] = df.apply(lambda x: dict(zip(x['variable'], x['value'])), axis=1)

此时,您的数据框如下所示:

                       template           value           variable                               combined
0       {color} shirt in {size}  [blue, medium]      [color, size]    {'color': 'blue', 'size': 'medium'}
1  {capacity} bottle in {color}    [24oz, teal]  [capacity, color]  {'capacity': '24oz', 'color': 'teal'}
2          {megapixel}mp camera          [24.1]        [megapixel]                  {'megapixel': '24.1'}

最后,评估你的字符串:

>>> df.apply(lambda x: x['template'].format(**x['combined']), axis=1)
0    blue shirt in medium
1     24oz bottle in teal
2           24.1mp camera
dtype: object

推荐阅读