首页 > 解决方案 > 如何将有限数量的列加上其余行作为字符串读入 Pandas 数据帧?

问题描述

我有看起来像这样的数据文件:

   1         97289.7040474555       4115155.1896845801     0.00               !CBBT
   2        110001.7354024933       4137233.7577695986     0.00               !Kipp
   3         74939.1481210588       4112567.6513698865     0.00               !Sewell
   4         79229.7242711330       4187153.1872533290     0.00
   5         61695.8917768109       4228702.2687039487     0.00               !Lew
   6         60107.3402798224       4339752.4155515051     0.00
   7         50000.7890042169       4371192.7593760351     0.00
   8         83629.3190899160       4365284.1160797141     0.00               !Tolchester
   9        102439.5140892575       4294944.9082385190     0.00               !Cam
  10         57967.0160118927       4146324.9518534313     0.00
  11        134578.0549105595       4027953.9016875746     0.00             !Duck
  12         98480.8626             4069397.4626           0.0000           !       030102051108    Tull Bay-Northwest River
  13         79345.3319             4069313.0018           0.0000           !       030102051102    Culpeper Island-Dismal Swamp
  14         93152.7452             4069628.5738           0.0000           !       030102051104    Indian Creek-Northwest River
  15        105488.9967             4070521.3806           0.0000           !       030102051206    Milldam Creek-North Landing River
...

我想将它读入 Pandas 数据框中,前四列为数字,然后将其余行作为包含完整注释或注释的字符串读入第 5 列。我不能指望!那里的间距或符号。

如果我尝试

import pandas as pd

df = pd.read_csv('foo.txt',sep='\s+',
           dtype={'id': np.int32, 'x': np.float64, 'y': np.float64, 'note': str},)

我只将评论的第一个空格分隔的标记放入note列中。

我想要类似于limitperl 函数的split REGEX, STRING, LIMIT功能。

标签: pythonpandascsv

解决方案


首先将文件读取到带有一些分隔符的列中,该分隔符不在文件中|,然后在接下来的步骤中处理Series.str.split,分配新列和DataFrame.astype

df = pd.read_csv('foo.txt', sep="|", header=None)
print (df)


d = {'id': np.int32, 'x': np.float64, 'y': np.float64,'z':np.float64, 'note': str}

df = df[0].str.split(n=4, expand=True)
df.columns = list(d.keys())
df = df.astype(d)

print (df)
    id              x             y    z  \
0    1   97289.704047  4.115155e+06  0.0   
1    2  110001.735402  4.137234e+06  0.0   
2    3   74939.148121  4.112568e+06  0.0   
3    4   79229.724271  4.187153e+06  0.0   
4    5   61695.891777  4.228702e+06  0.0   
5    6   60107.340280  4.339752e+06  0.0   
6    7   50000.789004  4.371193e+06  0.0   
7    8   83629.319090  4.365284e+06  0.0   
8    9  102439.514089  4.294945e+06  0.0   
9   10   57967.016012  4.146325e+06  0.0   
10  11  134578.054911  4.027954e+06  0.0   
11  12   98480.862600  4.069397e+06  0.0   
12  13   79345.331900  4.069313e+06  0.0   
13  14   93152.745200  4.069629e+06  0.0   
14  15  105488.996700  4.070521e+06  0.0   

                                                 note  
0                                               !CBBT  
1                                               !Kipp  
2                                             !Sewell  
3                                                 nan  
4                                                !Lew  
5                                                 nan  
6                                                 nan  
7                                         !Tolchester  
8                                                !Cam  
9                                                 nan  
10                                              !Duck  
11   !       030102051108    Tull Bay-Northwest River  
12  !       030102051102    Culpeper Island-Dismal...  
13  !       030102051104    Indian Creek-Northwest...  
14  !       030102051206    Milldam Creek-North La...  

推荐阅读