首页 > 解决方案 > 使用 python 在 XML 文件中获取唯一的项目对

问题描述

我有一个设计如下的 XML 数据集:

<DataSet>
    <Record><!-- each DataSet can have zero to many Record tags -->
        <Identifier><!-- each Record will definitely have exactly one Identifier tag -->
            <MRN value="MRN"></MRN><!-- Each Identifier will have zero or at the most one MRN tag, with alphanumeric character as the patient's MRN in value attribute -->
        </Identifier>
        <Medication><!-- each Record will definitely have exactly one Medication tag -->
            <Item value="CUI"></Item><!-- Each Medication will have zero to many Item tags, with alphanumeric character as the Medication CUI in the value attribute -->
        </Medication>
    </Record>
</DataSet>

我想将 MRN 值/CUI 值的唯一对列表导出到 csv 文件中。最终的 CSV 文件将类似于以下两列:

在此处输入图像描述

如果 MRN 有多个 CUI,那么我希望 MRN 值在每个 CUI 的第一列上重复。此外,我不想要任何空值,这意味着我不想提取任何没有任何 CUI 的 MRN,反之亦然。

我曾尝试使用列表和字典,但问题是我无法让最终输出看起来像我想要的那样,每个 CUI 都重复 MRN 值。我什至创建了一个数据框来查看哪个 CUI 属于哪个 MRN,但这又不是我想要的输出。这是我使用的代码:

import pandas as pd
import xml.etree.ElementTree as ET
tree = ET.parse('/med/dataset.xml')
root = tree.getroot()


mrn = []
cui = []
for element in root:
    for item in element[0::2]:
        d=[]
        mrn.append(d)
        for child in item:
            d.append(child.attrib['value'])
    for item in element[1::2]:
        d=[]
        cui.append(d)
        for child in item:
            d.append(child.attrib['value'])
new_list = [a + b for a,b in zip(mrn, cui)]
print(new_list)
df = pd.DataFrame(new_list)
print(df)

我希望能够仅使用标准 Python 库(pandas、numpy、xml.etree.ElementTree 和 csv)来做到这一点。

有任何想法吗?

标签: pythonxmlpandascsvnumpy

解决方案


您可以通过 MRN 在循环内循环您的药物。尝试这样的事情。

mrn_li = []
cui_li = []
for record in root:
    for mrn in record[0]:
        for med in record[1]:
            mrn_li.append(mrn.attrib['value'])
            cui_li.append(med.attrib['value'])

new_list = [[i, j] for i, j in zip(mrn_li,cui_li)]
print new_list

推荐阅读