首页 > 解决方案 > 如何在python中的元组列表中有效地获取元素

问题描述

我有一个元组列表如下。

mydata = [(5274919, ['report', 'porcelain', 'record', 'technic'], "[b'Dental Porcelain', b'Dentistry']"), (5274920, ['implantology', 'dentistry'], "[b'Dental Implantation', b'Dentistry']"), (5274921, ['record', 'recognition', 'long', 'standing', 'root', 'perforation', 'molar'], "[b'Dentistry', b'Molar', b'Root Canal Therapy', b'adverse effects']"), (5274923, ['exogenic', 'endogenic', 'cause', 'tooth', 'jaw', 'anomaly'], "[b'Dentistry', b'Jaw Abnormalities', b'etiology', b'Tooth Abnormalities', b'etiology']"), (5274922, ['obscure', 'facial pain', 'unnatural', 'occlusal', 'height'], "[b'Dental Occlusion, Traumatic', b'complications', b'Dentistry', b'Facial Neuralgia', b'etiology']"), (11636455, ['presenting', 'development', 'denmark'], "[b'Demography', b'Denmark']"), (12255310, ['study', 'human lactation'], "[b'Biology', b'Health', b'Lactation', b'Nutritional Physiological Phenomena', b'Physiology', b'Pregnancy', b'Research']"), (12255446, ['the effect', 'testosterone propionate', 'estradiol', 'given', 'combination', 'reproductive organ', 'gonadotrophin', 'presenting', 'pituitary', 'the rat'], "[b'Androgens', b'Animals, Laboratory']"), (12259009, ['carcinoma of the cervix', 'epidemiologic', 'study'], "[b'Age Factors', b'Behavior', b'Birth Rate', b'Coitus', b'Contraception', b'Contraception Behavior', b'Contraceptives, Postcoital', b'Demography', b'Disease', b'Education', b'Epidemiologic Methods', b'Family Planning Services', b'Fertility', b'Infection', b'Marital Status', b'Marriage', b'Neoplasms', b'Parity', b'Population', b'Population Characteristics', b'Population Dynamics', b'Religion', b'Reproduction', b'Research', b'Sexual Behavior', b'Sexually Transmitted Diseases', b'Social Class', b'Uterine Cervical Neoplasms']"), (12278329, ['clitoridectomy', 'downfall', 'isaac baker brown', 'f r c s'], "[b'Attitude', b'Behavior', b'Delivery of Health Care', b'Developed Countries', b'England', b'Europe', b'Health', b'Health Personnel', b'Physicians', b'Psychology', b'United Kingdom']")]

我还有一个单词列表如下

mywords = ["presenting", "record"]

首先我想看看mywords列表中的每个单词是否出现在元组的第二个元素中。如果是这样,将它的第三个元素收集在一起。

所以,输出应该是;

presenting = [b'Demography', b'Denmark', b'Androgens', b'Animals, Laboratory']
record = [b'Dental Porcelain', b'Dentistry', b'Dentistry', b'Molar', b'Root Canal Therapy', b'adverse effects']

我当前的代码如下

for word in mywords:
  my_keywords = []
  for item in mydata:
     if word in item[1]:
         my_keywords.append(ast.literal_eval(item[2]))
  print(mykeywords)

但是,由于mydata非常非常庞大(即180万),处理一个单词大约需要1.5分钟mywords,非常慢。请让我知道在 python 中执行此操作的有效方法。

如果需要,我很乐意提供更多详细信息。

标签: python

解决方案


我猜当您说mydata非常非常巨大”时,这意味着它比您在问题中提供的示例数据要大得多。否则很难想象这需要几分钟才能运行。

该算法可以通过迭代列表一次来改进,而不是每个关键字只迭代一次。诀窍是使用一组交集来测试一个项目是否匹配一个或多个关键字。为了为每个关键字构建单独的结果列表,我们可以将它们存储在字典中。每个项目也只需要调用literal_eval一次,即使它匹配多个关键字。

my_words = {"presenting", "record"}

results = { k: [] for k in my_words }

for item in mydata:
    matching_words = my_words & set(item[1])
    if matching_words:
        item_result = ast.literal_eval(item[2])
        for w in matching_words:
            results[w].append(item_result)

for word in my_words:
    print(results[word])

请注意,my_words现在是集合而不是列表。


推荐阅读