首页 > 解决方案 > 使用堆栈溢出数据文件的熊猫数据帧

问题描述

下面提到的代码按预期工作,并返回 8 条记录,如下所示。

!cat stack_test1.csv

rowId,UserId,Date,Class
4,1,2008-07-31T21:42:52.667,696
6,1,2008-07-31T22:08:08.620,301
7,2,2008-07-31T22:17:57.883,463
9,1,2008-07-31T23:40:59.743,1941
11,1,2008-07-31T23:55:37.967,1556
12,2,2008-07-31T23:56:41.303,332
13,1,2008-08-01T00:42:38.903,633
14,1,2008-08-01T00:59:11.177,437

有没有办法从文本文件中读取前几条记录并将 csv 保存到 file1.csv,其余的保存在 file2.txt 中?我不想拆分最终文件。我只需要从源文件中读取前 3 或 4 行,因为该文件非常大。(约 80 GB)

!wget https://testme162.s3.amazonaws.com/test1.xml
!echo '</posts>' > last.txt
!cat test1.xml last.txt > /root/test2.xml

from xml.etree.ElementTree import iterparse
#from cElementTree import iterparse
import pandas as pd

file_path = r"/root/test2.xml"
dict_list = []

for _, elem in iterparse(file_path, events=("end",)):
    if elem.tag == "row":
        dict_list.append({'rowId': elem.attrib['Id'],
                          'UserId': elem.attrib['PostTypeId'],
                          'Date': elem.attrib['CreationDate'],
                          'Class': elem.attrib['Score'] })

        # dict_list.append(elem.attrib)      # ALTERNATIVELY, PARSE ALL ATTRIBUTES

        elem.clear()

df = pd.DataFrame(dict_list)
df.to_csv('stack_test1.csv', index=False)

标签: pythonpandas

解决方案


推荐阅读