首页 > 解决方案 > 如何根据其中一个子字符串对字符串进行分组?

问题描述

我有以下清单jargs

jargs = ['10192393\t15\t26\tskin tumour\tDiseaseClass\tD012878', 
         '10192393\t443\t449\tcancer\tDiseaseClass\tD009369',
         '10192393\t483\t496\tcolon cancers\tDiseaseClass\tD003110',
         '10194428\t30\t45\themochromatosis\tModifier\tD016399',
         '10194428\t102\t117\themochromatosis\tSpecificDisease\tD006432',
         '10194428\t119\t145\tHereditary hemochromatosis\tSpecificDisease\tD006432',
         '10194428\t147\t149\tHH\tDiseaseClass\tD006432']

我想编写一个输出以下内容的程序:

ents = 
[
'10192393', {"entities":[(15, 26,"DiseaseClass"), (443, 449, "DiseaseClass"), (483, 496, "DiseaseClass")]}, 
'10194428', {"entities": [(30, 45, "Modifier"), (102, 117, "SpecificDisease"), (119, 145, "SpecificDisease"), (147, 149, "DiseaseClass")]}
]

我尝试了以下方法:

ents = [list(set([jargs[i].split('\t')[0] for i in range(len(jargs))]))[0],\
       {"entities": [(int(jargs[i].split('\t')[1]), int(jargs[i].split('\t')[2]),\
       jargs[i].split('\t')[-2]) for i in range(len(jargs))]}]

不幸的是,此代码输出以下内容

['10194428',
 {'entities': [('15', '26', 'DiseaseClass'),
   ('443', '449', 'DiseaseClass'),
   ('483', '496', 'DiseaseClass'),
   ('30', '45', 'Modifier'),
   ('102', '117', 'SpecificDisease'),
   ('119', '145', 'SpecificDisease'),
   ('147', '149', 'DiseaseClass')]}]

这不是预期的输出。

标签: pythonlistdictionarytextnlp

解决方案


from pprint import pprint

tmp = {}
for item in jargs:
    id_, v1, v2, _, v3, *_ = item.split("\t")
    tmp.setdefault(id_, []).append((v1, v2, v3))

ents = []
for k, v in tmp.items():
    ents.append(k)
    ents.append({"entities": v})

pprint(ents)

印刷:

['10192393',
 {'entities': [('15', '26', 'DiseaseClass'),
               ('443', '449', 'DiseaseClass'),
               ('483', '496', 'DiseaseClass')]},
 '10194428',
 {'entities': [('30', '45', 'Modifier'),
               ('102', '117', 'SpecificDisease'),
               ('119', '145', 'SpecificDisease'),
               ('147', '149', 'DiseaseClass')]}]

推荐阅读