python-3.x - 通过替换列表中的子字符串来创建字符串的笛卡尔积

问题描述

我有一个带有占位符的字典及其可能的值列表，如下所示：

{
    "~GPE~": ['UK', 'USA'],
    "~PERSON~": ['John Davies', 'Tom Banton', 'Joe Morgan'],
    # and so on ...
}

我想通过替换模板中的占位符（即~GPE~和）来创建所有可能的字符串组合：~PERSON~

"My name is ~PERSON~. I travel to ~GPE~ with ~PERSON~ every year".

预期输出为：

"My name is John Davies. I travel to UK with Tom Banton every year."
"My name is John Davies. I travel to UK with Joe Morgan every year."
"My name is John Davies. I travel to USA with Tom Banton every year."
"My name is John Davies. I travel to USA with Joe Morgan every year."
"My name is Tom Banton. I travel to UK with John Davies every year."
"My name is Tom Banton. I travel to UK with Joe Morgan every year."
"My name is Tom Banton. I travel to USA with John Davies every year."
"My name is Tom Banton. I travel to USA with Joe Morgan every year."
"My name is Joe Morgan. I travel to UK with Tom Banton every year."
"My name is Joe Morgan. I travel to UK with John Davies every year."
"My name is Joe Morgan. I travel to USA with Tom Banton every year."
"My name is Joe Morgan. I travel to USA with John Davies every year."

还要注意与字典中的键对应的值如何不在同一个句子中重复。例如，我不想：“我的名字是乔摩根。我每年都和乔摩根一起去美国旅行。” （所以不完全是笛卡尔积，但足够接近）

我是 python 新手，正在尝试 re 模块，但找不到解决这个问题的方法。

编辑

我面临的主要问题是替换字符串会导致长度发生变化，这使得后续修改字符串变得困难。这尤其是由于字符串中相同占位符的多个实例的可能性。下面是一个片段来详细说明：

label_dict = {
    "~GPE~": ['UK', 'USA'],
    "~PERSON~": ['John Davies', 'Tom Banton', 'Joe Morgan']
}


template = "My name is ~PERSON~. I travel to ~GPE~ with ~PERSON~ every year."

for label in label_dict.keys():
    modified_string = template
    offset = 0
    for match in re.finditer(r'{}'.format(label), template):
        for label_text in label_dict.get(label, []):
            start, end = match.start() + offset, match.end() + offset
            offset += (len(label_text) - (end - start))
#             print ("Match was found at {start}-{end}: {match}".format(start = start, end = end, match = match.group()))
            modified_string = modified_string[: start] + label_text + modified_string[end: ]
            print(modified_string)

给出不正确的输出为：

My name is ~PERSON~. I travel to UK with ~PERSON~ every year.
My name is ~PERSON~. I travel USA with ~PERSON~ every year.
My name is John Davies. I travel to ~GPE~ with ~PERSON~ every year.
My name is JohTom Banton. I travel to ~GPE~ with ~PERSON~ every year.
My name is JohToJoe Morgan. I travel to ~GPE~ with ~PERSON~ every year.
My name is JohToJoe Morgan. I travel to ~GPE~ with John Davies every year.
My name is JohToJoe Morgan. I travel to ~GPE~ with JohTom Banton every year.
My name is JohToJoe Morgan. I travel to ~GPE~ with JohToJoe Morgan every year.

标签： python-3.xstringalgorithmcombinatoricsre

解决方案

这里有两种方法，如果你包含我刚才添加的新代码，那么三种方法，你可以做到，它们都会产生所需的输出。

嵌套循环

data_in ={
    "~GPE~": ['UK', 'USA'],
    "~PERSON~": ['John Davies', 'Tom Banton', 'Joe Morgan']
}

data_out = []
for gpe in data_in['~GPE~']:
    for person1 in data_in['~PERSON~']:
        for person2 in data_in['~PERSON~']:
            if person1 != person2: 
                data_out.append(f'My name is {person1}. I travel to {gpe} with {person2} every year.')

print('\n'.join(data_out))

列表理解

data_in ={
    "~GPE~": ['UK', 'USA'],
    "~PERSON~": ['John Davies', 'Tom Banton', 'Joe Morgan']
}

data_out = [f'My name is {person1}. I travel to {gpe} with {person2} every year.' for gpe in data_in['~GPE~'] for person1 in data_in['~PERSON~'] for person2 in data_in['~PERSON~'] if person1!=person2]

print('\n'.join(data_out))

使用 Pandas 的合并

请注意，此代码需要 Pandas 1.2 或更高版本。