首页 > 解决方案 > 如何拆分 csv 文件,将其标题保存在 Python 中的每个较小文件中?

问题描述

我使用此处的代码将 csv 文件拆分为许多较小的文件(向下滚动以查看完整代码):https ://dzone.com/articles/splitting-csv-files-in-python

文件已成功拆分并保留其结构,但标题已消失。我怀疑 pd.read() 函数中的参数有问题。

请帮我看看这个:

输入文件:

    Text Header    tag
0    textbody1    Y
1    textbody2    N
2    textbody2    Y

结果(结构仍然存在,但我的标题在我的拆分 csv 文件中消失了):

0    textbody1    Y
1    textbody2    N
2    textbody2    Y

请参阅下面的完整脚本:

    import pandas as pd
    
    #csv file name to be read in 
    in_csv = 'iii_baiterEmailTagged.csv'
    
    #get the number of lines of the csv file to be read
    number_lines = sum(1 for row in (open(in_csv)))
     
    #size of rows of data to write to the csv, 
    
    #you can change the row size according to your need
    rowsize = 10000
    
    #start looping through data writing it to a new file for each set
    for i in range(1,number_lines,rowsize):
    
        df = pd.read_csv(in_csv,
    
              header=None,
    
              nrows = rowsize,#number of rows to read at each loop
    
              skiprows = i)#skip rows that have been read
    
    
        #csv to write data to a new file with indexed name. input_1.csv etc.
        out_csv = 'Enronset' + str(i) + '.csv'
    
        df.to_csv(out_csv,
    
              index=False,
    
              header=False,
              mode='a',#append data to csv file
              chunksize=rowsize)#size of data to append for each loop

谢谢

标签: pythonpandasdataframecsv

解决方案


您正在跳过 for 循环中的第一行(1 而不是 0)

for i in range(1,number_lines,rowsize):

并且您明确告诉 pandas 没有可供阅读的标题(只需省略它)

pd.read_csv(...,header=None)

并且不写一个(将 False 替换为 True)

pd.write_csv(...,header=False,...)

这是一个完整的工作代码:

import pandas as pd

#csv file name to be read in
in_csv = 'iii_baiterEmailTagged.csv'

#get the number of lines of the csv file to be read
number_lines = sum(1 for row in (open(in_csv)))

#size of rows of data to write to the csv,

#you can change the row size according to your need
rowsize = 10000

#start looping through data writing it to a new file for each set
for i in range(0,number_lines,rowsize):

    df = pd.read_csv(in_csv,
          nrows = rowsize,#number of rows to read at each loop
          skiprows = i)#skip rows that have been read

    #csv to write data to a new file with indexed name. input_1.csv etc.
    out_csv = 'Enronset' + str(i) + '.csv'

    df.to_csv(out_csv,
          index=False,
          header=True,
          mode='a',#append data to csv file
          chunksize=rowsize)#size of data to append for each loop

推荐阅读