首页 > 解决方案 > 通过替换列表中的子字符串来创建字符串的笛卡尔积

问题描述

我有一个带有占位符的字典及其可能的值列表,如下所示:

{
    "~GPE~": ['UK', 'USA'],
    "~PERSON~": ['John Davies', 'Tom Banton', 'Joe Morgan'],
    # and so on ...
}

我想通过替换模板中的占位符(即~GPE~和)来创建所有可能的字符串组合:~PERSON~

"My name is ~PERSON~. I travel to ~GPE~ with ~PERSON~ every year".

预期输出为:

"My name is John Davies. I travel to UK with Tom Banton every year."
"My name is John Davies. I travel to UK with Joe Morgan every year."
"My name is John Davies. I travel to USA with Tom Banton every year."
"My name is John Davies. I travel to USA with Joe Morgan every year."
"My name is Tom Banton. I travel to UK with John Davies every year."
"My name is Tom Banton. I travel to UK with Joe Morgan every year."
"My name is Tom Banton. I travel to USA with John Davies every year."
"My name is Tom Banton. I travel to USA with Joe Morgan every year."
"My name is Joe Morgan. I travel to UK with Tom Banton every year."
"My name is Joe Morgan. I travel to UK with John Davies every year."
"My name is Joe Morgan. I travel to USA with Tom Banton every year."
"My name is Joe Morgan. I travel to USA with John Davies every year."

还要注意与字典中的键对应的值如何不在同一个句子中重复。例如,我不想:“我的名字是乔摩根。我每年都和乔摩根一起去美国旅行。” (所以不完全是笛卡尔积,但足够接近)

我是 python 新手,正在尝试 re 模块,但找不到解决这个问题的方法。

编辑

我面临的主要问题是替换字符串会导致长度发生变化,这使得后续修改字符串变得困难。这尤其是由于字符串中相同占位符的多个实例的可能性。下面是一个片段来详细说明:

label_dict = {
    "~GPE~": ['UK', 'USA'],
    "~PERSON~": ['John Davies', 'Tom Banton', 'Joe Morgan']
}


template = "My name is ~PERSON~. I travel to ~GPE~ with ~PERSON~ every year."

for label in label_dict.keys():
    modified_string = template
    offset = 0
    for match in re.finditer(r'{}'.format(label), template):
        for label_text in label_dict.get(label, []):
            start, end = match.start() + offset, match.end() + offset
            offset += (len(label_text) - (end - start))
#             print ("Match was found at {start}-{end}: {match}".format(start = start, end = end, match = match.group()))
            modified_string = modified_string[: start] + label_text + modified_string[end: ]
            print(modified_string)

给出不正确的输出为:

My name is ~PERSON~. I travel to UK with ~PERSON~ every year.
My name is ~PERSON~. I travel USA with ~PERSON~ every year.
My name is John Davies. I travel to ~GPE~ with ~PERSON~ every year.
My name is JohTom Banton. I travel to ~GPE~ with ~PERSON~ every year.
My name is JohToJoe Morgan. I travel to ~GPE~ with ~PERSON~ every year.
My name is JohToJoe Morgan. I travel to ~GPE~ with John Davies every year.
My name is JohToJoe Morgan. I travel to ~GPE~ with JohTom Banton every year.
My name is JohToJoe Morgan. I travel to ~GPE~ with JohToJoe Morgan every year.

标签: python-3.xstringalgorithmcombinatoricsre

解决方案


这里有两种方法,如果你包含我刚才添加的新代码,那么三种方法,你可以做到,它们都会产生所需的输出。

嵌套循环

data_in ={
    "~GPE~": ['UK', 'USA'],
    "~PERSON~": ['John Davies', 'Tom Banton', 'Joe Morgan']
}

data_out = []
for gpe in data_in['~GPE~']:
    for person1 in data_in['~PERSON~']:
        for person2 in data_in['~PERSON~']:
            if person1 != person2: 
                data_out.append(f'My name is {person1}. I travel to {gpe} with {person2} every year.')

print('\n'.join(data_out))

列表理解

data_in ={
    "~GPE~": ['UK', 'USA'],
    "~PERSON~": ['John Davies', 'Tom Banton', 'Joe Morgan']
}

data_out = [f'My name is {person1}. I travel to {gpe} with {person2} every year.' for gpe in data_in['~GPE~'] for person1 in data_in['~PERSON~'] for person2 in data_in['~PERSON~'] if person1!=person2]

print('\n'.join(data_out))

使用 Pandas 的合并

请注意,此代码需要 Pandas 1.2 或更高版本。

import pandas as pd

data = {
    "~GPE~": ['UK', 'USA'],
    "~PERSON~": ['John Davies', 'Tom Banton', 'Joe Morgan'],
    # and so on ...
}

country = pd.DataFrame({'country':data['~GPE~']})
person = pd.DataFrame({'person':data['~PERSON~']})

cart = country.merge(person, how='cross').merge(person, how='cross')

cart.columns = ['country', 'person1', 'person2']

cart = cart.query('person1 != person2').reset_index()

cart['sentence'] = cart.apply(lambda row: f"My name is {row['person1']}. I travel to {row['country']} with {row['person2']} every year." , axis=1)

sentences = cart['sentence'].to_list()

print('\n'.join(sentences))

推荐阅读