python - 在 Python 中组合元组列表中的值以保留预期的输出格式
问题描述
我有一个元组列表。每个元组由一个字符串和一个字典组成。现在,其中的每个 dict 都包含一个元组列表。列表的大小约为 8K 条目。
样本数据:
dataset = [('made of iron oxide', {'entities': [(12, 16, 'PRODUCT'), (17, 20, 'PRODUCT'), (15, 24, 'PRODUCT'), (12, 19, 'PRODUCT')]}),('made of ferric oxide', {'entities': [(10, 15, 'PRODUCT'), (12, 15, 'PRODUCT'), (624, 651, 'PRODUCT'), (1937, 1956, 'PRODUCT')]})]
从这里预期的输出是:
dataset = [('made of iron oxide', {'entities': [(12, 16, 'PRODUCT'), (17, 20, 'PRODUCT'), (15, 24, 'PRODUCT')]}), ('made of ferric oxide', {'entities': [(10, 15, 'PRODUCT'), (624, 651, 'PRODUCT'), (1937, 1956, 'PRODUCT')]})]
我编写了删除元组列表中所有重叠值的代码:示例:
newinput = [(12, 16, 'PRODUCT'), (17, 20, 'PRODUCT'), (15, 24, 'PRODUCT'), (12, 19, 'PRODUCT'),(10, 15, 'PRODUCT'), (12, 15, 'PRODUCT'), (624, 651, 'PRODUCT'), (1937, 1956, 'PRODUCT')]
# using set
visited = set()
# Output list initialization
Outputs = []
# Iteration
for a, b, c in newinput:
if not a in visited:
# print(a)
visited.add(a)
# print(visited)
Outputs.append((a, b,c))
# print(Outputs)
# elif not b in visited:
# visited.add(b)
# Output.append((a, b,c))
# else:
# pass
agn = []
newv = set()
for a, b, c in Outputs:
# print(b)
if not b in newv:
newv.add(b)
# print(newv)
agn.append((a,b,c))
print(agn)
#Output:
#[(12, 16, 'PRODUCT'), (17, 20, 'PRODUCT'), (15, 24, 'PRODUCT'), (10, 15, 'PRODUCT'), (624, 651, 'PRODUCT'), (1937, 1956, 'PRODUCT')]
该代码工作正常,我能够保留列表中只有唯一数字的元组。我现在想要的是保留与唯一元组相关的句子(如预期的输出格式中所述)。此外,我的示例数据集是一个巨大的列表,我想就地进行操作并保留相关的句子(例如:'由氧化铁制成')也与实体而不是将它们分开。我怎样才能有效地做到这一点,以便我不使用多个列表以及获得预期格式的结果?
解决方案
我重写了代码以查找重复值,然后组合成一个新元组。
# dataset = [('made of iron oxide', {'entities': [(12, 16, 'PRODUCT'), (17, 20, 'PRODUCT'), (15, 24, 'PRODUCT'), (12, 19, 'PRODUCT')]}),('made of ferric oxide', {'entities': [(10, 15, 'PRODUCT'), (12, 15, 'PRODUCT'), (624, 651, 'PRODUCT'), (1937, 1956, 'PRODUCT')]})]
# NEW DATA SET BASED ON COMMENT
dataset = [('made of iron oxide', {'entities': [(12, 16, 'PRODUCT'), (17, 20, 'PRODUCT'), (15, 24, 'PRODUCT'), (12, 19, 'PRODUCT')]}),('made of ferric oxide', {'entities': [(10, 15, 'PRODUCT'), (17, 20, 'PRODUCT'), (624, 651, 'PRODUCT'), (1937, 1956, 'PRODUCT')]})]
seen_values = []
clean_data = []
# loop through each sentence and dict of values
for sentence, values in dataset:
for value in values['entities']:
if value[0] in seen_values:
# remove if we have seen this before
values['entities'].remove(value)
else:
# add to list if we have not seen this before
seen_values.append(value[0])
clean_data.append((sentence, values))
# ADDED TO ADDRESS REQUEST IN THE COMMENTS
seen_values = []
print(clean_data)
输出:
# clean_data = [('made of iron oxide', {'entities': [(12, 16, 'PRODUCT'), (17, 20, 'PRODUCT'), (15, 24, 'PRODUCT')]}), ('made of ferric oxide', {'entities': [(10, 15, 'PRODUCT'), (624, 651, 'PRODUCT'), (1937, 1956, 'PRODUCT')]})]
# NEW DATA SET OUTPUT
clean_data = [('made of iron oxide', {'entities': [(12, 16, 'PRODUCT'), (17, 20, 'PRODUCT'), (15, 24, 'PRODUCT')]}), ('made of ferric oxide', {'entities': [(10, 15, 'PRODUCT'), (17, 20, 'PRODUCT'), (624, 651, 'PRODUCT'), (1937, 1956, 'PRODUCT')]})]
推荐阅读
- android - 微调器上的自定义 ArrayAdapter 显示微调器外部的自定义布局图像
- c# - WPF 将鼠标滚轮事件从 ContextMenu 发送到 Window
- asp.net-core-mvc - jsreport.AspNetCore mvc 页面到 PDF 正确呈现 Intranet 但在远程站点使用时大小不同
- c - while 循环内的表达式求值
- django - 序列化程序在保存时不返回所有字段
- r - 使用 %>% 这个,如何减少长度?
- c# - 如何将变量传递给 Lambda 表达式
- php - 如何在树枝中使用国际化功能
- typescript - 将 TypeScript 模块作为单独的 JavaScript 文件发出
- json - xslt 用模板匹配替换数组标签名称