首页 > 解决方案 > 如何提取非常大的xml数据并存储在python中的字典中

问题描述

我有以下命名的 XML 文件Comments.xml,其大小为 15 GB。我想获得一个有 2 个键的字典,即UserIdText. UserId请注意,文件中的和有许多缺失值Text。我尝试了以下代码,但由于文件大小太大导致 RAM(13 GB RAM)崩溃。有没有一种有效的方法可以从xml文件中获取数据进行数据分析?

xml文件的一部分Comments.xml

<comments>
<row Id = '1' UserId = '143' Text = 'Hello World'>
<row Id = '2' UserId = '183' Text = 'Trigonometry is important.'>
<row Id = '3' UserId = '5645' Text = 'Mathematics is best.'>
<row Id = '4' UserId = '143' Text = 'Hello stack overflow'>
<row Id = '5' UserId = '143' Text = 'Hello'>

代码

import xml.etree.cElementTree as ET

tree = ET.iterparse('Comments.xml')

comments = {} #Dictionary to store the required data

for event, root in tree:

  if (('Text' in root.attrib) and ('UserId' in root.attrib)): #To check for missing values
    Text = root.attrib['Text']
    UserId = root.attrib['UserId']
    userid_comments.update({UserId:Text}) #Adding data to dictionary
    root.clear()

预期产出

{'143':'Hello World','183':'Trigonometry is important.','5645':'Mathematics is best.','143':'Hello stack overflow','143':'Hello'}

OR

{'UserId':['143','183','5645','143','143'],'Text':['Hello World','Trigonometry is important.','Mathematics is best.','Hello stack overflow','Hello']}

标签: pythonpython-3.xxmldictionaryxml-parsing

解决方案


另一种方法。

import io
from simplified_scrapy import SimplifiedDoc

def getComments(fileName):
    comments = {'UserId': [], 'Text': []}
    with io.open(fileName, "r", encoding='utf-8') as file:
        line = file.readline()  # Read data line by line
        while line != '':
            doc = SimplifiedDoc(line)  # Instantiate a doc
            row = doc.getElement('row')  # Get row
            if row:
                comments['UserId'].append(row['UserId'])
                comments['Text'].append(row['Text'])
            line = file.readline()
    return comments
comments = getComments('Comments.xml')  # This dictionary will be very large, too

这里有更多例子:https ://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples


推荐阅读