首页 > 解决方案 > 重新排列 csv 中的唯一组

问题描述

我有一个像这样排列的巨大数据集。每个 ID 对应于一组唯一的组。

0
0 0
NUMBER        22 ADD_FLD    5  15 &11111
ID  382 START_TIME 2001052306
POINT  63
2010052306 119.464119 15.870264 1.682708e+00 & 1.213053
2010052312 119.910667 15.874892 1.934127e+00 & 1.221175 
2010052318 120.368523 16.022879 2.260490e+00 & 1.227459
2010052400 120.611115 15.788021 2.787007e+00 & 1.229084
2010052406 121.286072 15.984570 3.253321e+00 & 1.230381

ID  413 START_TIME 2010061006
POINT  40
2010061006 156.424057 5.559299 1.059667e+00 & 1.578506 
2010061012 153.899506 6.450210 1.150635e+00 & 1.516614 
2010061018 152.346802 7.281753 1.187466e+00 & 1.501871

我想做的是将它们重新排列成这个。

ID   YR     MONTH   DAY   HR  LON         LAT        RESULT1        RESULT2
382  2010   05      23    06  119.464119  15.870264  1.682708e+00   1.213053
382  2010   05      23    12  119.910667  15.874892  1.934127e+00   1.221175 
382  2010   05      23    18  120.368523  16.022879  2.260490e+00   1.227459
382  2010   05      24    00  120.611115  15.788021  2.787007e+00   1.229084
382  2010   05      24    06  121.286072  15.984570  3.253321e+00   1.230381
413  2010   06      10    06  156.424057  5.559299   1.059667e+00   1.578506 
413  2010   06      10    12  153.899506  6.450210   1.150635e+00   1.516614 
413  2010   06      10    18  152.346802  7.281753   1.187466e+00   1.501871

ID 列基于为每个组分配的相应唯一 ID。YR、MONTH、DAY 和 HR 基于输入的第一列。

我将不胜感激任何帮助。谢谢

标签: pythonpython-3.xpandascsv

解决方案


我花了一些时间,我希望它有帮助:)

t=open('your_file.txt').read() #or your_file.csv'
l=t.split('\n')
l=l[3:]
l=[i for i in l if i[:5] not in ('POINT', '')]
d={}
current_key=0
for i in range(len(l)):
  if l[i][:2]=='ID':
    current_key=l[i].split(' ')[2]
    d[current_key]=[]
  else:
    d[current_key].append(l[i])

for i in d:
  for k in range(len(d[i])):
    s=d[i][k]
    s=s.split(' ')
    s=[p for p in s if p not in ('&', '')]
    s=[s[0][:4]]+[s[0][4:6]]+[s[0][6:8]]+[s[0][8:]]+s[1:]
    d[i][k]=[i]+s
    
rows=sum(d.values(), [])   
columns=['ID','YR','MONTH','DAY','HR','LON','LAT','RESULT1','RESULT2']

result=pd.DataFrame(rows, columns=columns)

print(result)

输出:

    ID    YR MONTH DAY  HR         LON        LAT       RESULT1   RESULT2
0  382  2010    05  23  06  119.464119  15.870264  1.682708e+00  1.213053
1  382  2010    05  23  12  119.910667  15.874892  1.934127e+00  1.221175
2  382  2010    05  23  18  120.368523  16.022879  2.260490e+00  1.227459
3  382  2010    05  24  00  120.611115  15.788021  2.787007e+00  1.229084
4  382  2010    05  24  06  121.286072  15.984570  3.253321e+00  1.230381
5  413  2010    06  10  06  156.424057   5.559299  1.059667e+00  1.578506
6  413  2010    06  10  12  153.899506   6.450210  1.150635e+00  1.516614
7  413  2010    06  10  18  152.346802   7.281753  1.187466e+00  1.501871

推荐阅读