首页 > 解决方案 > 在不使用任何库的情况下将 .csv 文件提取到 2D 列表

问题描述

作为作业的一部分,我必须在不使用任何库的情况下提取 .csv 文件。前 3 个元素如下:-

"ID","Name","Sex","Age","Height","Weight","Team","NOC","Games","Year","Season","City","Sport","Event","Medal"
"1","A Dijiang","M",24,180,80,"China","CHN","1992 Summer",1992,"Summer","Barcelona","Basketball","Basketball Men's Basketball",NA
"2","A Lamusi","M",23,170,60,"China","CHN","2012 Summer",2012,"Summer","London","Judo","Judo Men's Extra-Lightweight",NA
"3","Gunnar Nielsen Aaby","M",24,NA,NA,"Denmark","DEN","1920 Summer",1920,"Summer","Antwerpen","Football","Football Men's Football",NA

我尝试按如下方式实现它:

csv_data = []
with open('olympic.csv') as csv_file:
    for line in csv_file:
        line = line.strip()
        line = line.split(',')
        temp = []
        for element in line:
            if element[0] == '"' or element[-1] == '"':
                temp.append(element[1 : -1])
            else:
                temp.append(element)
        csv_data.append(temp)

这给出了大致正确的答案,但问题是当名称事件列中包含“,”字符时,例如

"," in Name column
"5965","Dionisio Augustine, II","M",24,153,65,"Federated States of Micronesia","FSM","2016 Summer",2016,"Summer","Rio de Janeiro","Swimming","Swimming Men's 50 metres Freestyle",NA
"7208","Carlos Zenon Balderas, Jr.","M",19,175,60,"United States","USA","2016 Summer",2016,"Summer","Rio de Janeiro","Boxing","Boxing Men's Lightweight",NA

"," in Event column
"2304","Michael Albasini","M",31,172,67,"Switzerland","SUI","2012 Summer",2012,"Summer","London","Cycling","Cycling Men's Road Race, Individual",NA
"250","Saeid Morad Abdevali","M",22,170,80,"Iran","IRI","2012 Summer",2012,"Summer","London","Wrestling","Wrestling Men's Welterweight, Greco-Roman",NA

在不使用标准库的情况下,有没有合适的方法来解决这个问题?

标签: pythonpython-3.xdataframecsv

解决方案


是的...那么也许您将不得不处理转义的引号字符,然后(为什么不呢?)在列中使用换行符...

这就是为什么在现实生活中,最好的策略是使用库,而不是重新发明轮子(实际上是一个复杂的发条)。

您可以尝试使用正则表达式来捕获列值。对于引用的列,幼稚的列可能类似于 '"([^"]+)"';对于未引用的列(数字?),可能带有lookaraounds: '(?<,)(\d+)(?=,) '......然后试图把所有东西放在一起。

或者(作为一个班级作业,效率和速度可能不是强制性的)你可以编写一个状态机:一次读取一个字符,并相应地采取行动:如果它是一个 '"' 继续读取另一个 '"',否则读取直到下一个逗号,依此类推...


推荐阅读