python - 将 pandas 数据帧转换为固定大小的段数组
问题描述
我正在努力将我的数据框转换为一组固定大小的片段,我应该将这些片段提供给卷积神经网络。具体来说,我想从每个包含分段df
的数组列表转换为sized 。所以最后,我会有一个数组。m
(1,5,4)
(m,1,5,4)
为了澄清我的问题,我解释使用 this MWE
。假设这是我的df
:
df = {
'id': [1,1,1,1,1,1,1,1,1,1,1,1],
'speed': [17.63,17.63,0.17,1.41,0.61,0.32,0.18,0.43,0.30,0.46,0.75,0.37],
'acc': [0.00,-0.09,1.24,-0.80,-0.29,-0.14,0.25,-0.13,0.16,0.29,-0.38,0.27],
'jerk': [0.00,0.01,-2.04,0.51,0.15,0.39,-0.38,0.29,0.13,-0.67,0.65,0.52],
'bearing': [29.03,56.12,18.49,11.85,36.75,27.52,81.08,51.06,19.85,10.76,14.51,24.27],
'label' : [3,3,3,3,3,3,3,3,3,3,3,3] }
df = pd.DataFrame.from_dict(df)
为此,我使用此功能:
def df_transformer(dataframe, chunk_size=5):
grouped = dataframe.groupby('id')
# initialize accumulators
X, y = np.zeros([0, 1, chunk_size, 4]), np.zeros([0,])
# loop over segments (id)
for _, group in grouped:
inputs = group.loc[:, 'speed':'bearing'].values
label = group.loc[:, 'label'].values[0]
# calculate number of splits
N = len(inputs) // chunk_size
if N > 0:
inputs = np.array_split(inputs, [chunk_size]*N)
else:
inputs = [inputs]
# loop over splits
for inpt in inputs:
inpt = np.pad(
inpt, [(0, chunk_size-len(inpt)),(0, 0)],
mode='constant')
# add each inputs split to accumulators
X = np.concatenate([X, inpt[np.newaxis, np.newaxis]], axis=0)
y = np.concatenate([y, label[np.newaxis]], axis=0)
return X, y
上面有 12 行,所以如果正确转换为预期的df
形式,我应该得到一个 shape 数组(3,1,5,4)
。在上述函数中,少于 5 行的段被零填充,以使段形(1,5,4)
。
目前,我对这个功能有两个问题:
- 该函数仅适用于我的 df 中小于 10 的行。
像这样(最后一行应该在下面补零):
X , y = df_transformer(df[:9])
X
array([[[[ 1.763e+01, 0.000e+00, 0.000e+00, 2.903e+01],
[ 1.763e+01, -9.000e-02, 1.000e-02, 5.612e+01],
[ 1.700e-01, 1.240e+00, -2.040e+00, 1.849e+01],
[ 1.410e+00, -8.000e-01, 5.100e-01, 1.185e+01],
[ 6.100e-01, -2.900e-01, 1.500e-01, 3.675e+01]]],
[[[ 3.200e-01, -1.400e-01, 3.900e-01, 2.752e+01],
[ 1.800e-01, 2.500e-01, -3.800e-01, 8.108e+01],
[ 4.300e-01, -1.300e-01, 2.900e-01, 5.106e+01],
[ 3.000e-01, 1.600e-01, 1.300e-01, 1.985e+01],
[ 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00]]]])
但在这种情况下引入了一个全零数组(段):
X , y = df_transformer(df[:10])
X
array([[[[ 1.763e+01, 0.000e+00, 0.000e+00, 2.903e+01],
[ 1.763e+01, -9.000e-02, 1.000e-02, 5.612e+01],
[ 1.700e-01, 1.240e+00, -2.040e+00, 1.849e+01],
[ 1.410e+00, -8.000e-01, 5.100e-01, 1.185e+01],
[ 6.100e-01, -2.900e-01, 1.500e-01, 3.675e+01]]],
[[[ 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00],
[ 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00],
[ 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00],
[ 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00],
[ 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00]]],
[[[ 3.200e-01, -1.400e-01, 3.900e-01, 2.752e+01],
[ 1.800e-01, 2.500e-01, -3.800e-01, 8.108e+01],
[ 4.300e-01, -1.300e-01, 2.900e-01, 5.106e+01],
[ 3.000e-01, 1.600e-01, 1.300e-01, 1.985e+01],
[ 4.600e-01, 2.900e-01, -6.700e-01, 1.076e+01]]]])
- 如果我传递一个整体,该函数将失败
df
(我不理解错误,但它似乎与少于 5 行的段的填充有关)。
所以在这种情况下,我得到index can't contain negative values
错误:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-5-1fc559db37eb> in <module>()
----> 1 X , y = df_transformer(df)
2 frames
<ipython-input-4-9e1c49985863> in df_transformer(dataframe, chunk_size)
24 inpt = np.pad(
25 inpt, [(0, chunk_size-len(inpt)),(0, 0)],
---> 26 mode='constant')
27 # add each inputs split to accumulators
28 X = np.concatenate([X, inpt[np.newaxis, np.newaxis]], axis=0)
<__array_function__ internals> in pad(*args, **kwargs)
/usr/local/lib/python3.6/dist-packages/numpy/lib/arraypad.py in pad(array, pad_width, mode, **kwargs)
746
747 # Broadcast to shape (array.ndim, 2)
--> 748 pad_width = _as_pairs(pad_width, array.ndim, as_index=True)
749
750 if callable(mode):
/usr/local/lib/python3.6/dist-packages/numpy/lib/arraypad.py in _as_pairs(x, ndim, as_index)
517
518 if as_index and x.min() < 0:
--> 519 raise ValueError("index can't contain negative values")
520
521 # Converting the array with `tolist` seems to improve performance
ValueError: index can't contain negative values
预期输出:
X , y = df_transformer(df)
X
array([[[[ 1.763e+01, 0.000e+00, 0.000e+00, 2.903e+01],
[ 1.763e+01, -9.000e-02, 1.000e-02, 5.612e+01],
[ 1.700e-01, 1.240e+00, -2.040e+00, 1.849e+01],
[ 1.410e+00, -8.000e-01, 5.100e-01, 1.185e+01],
[ 6.100e-01, -2.900e-01, 1.500e-01, 3.675e+01]]],
[[[ 3.200e-01, -1.400e-01, 3.900e-01, 2.752e+01],
[ 1.800e-01, 2.500e-01, -3.800e-01, 8.108e+01],
[ 4.300e-01, -1.300e-01, 2.900e-01, 5.106e+01],
[ 3.000e-01, 1.600e-01, 1.300e-01, 1.985e+01],
[ 4.600e-01, 2.900e-01, -6.700e-01, 1.076e+01]]],
[[[ 7.500e-01, -3.800e-01, 6.500e-01, 1.451e+01],
[ 3.700e-01, 2.700e-01, 5.200e-01, 2.427e+01],
[ 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00],
[ 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00],
[ 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00]]]])
有人可以帮我解决这个问题吗?上面的 WME 可以很好地重现此错误。
编辑
RichieV 的回答也有一个错误。虽然它在给定MWE
的情况下工作,但在下面的情况下它不能完成正确的任务(扩展df
两次
its size):
df = {
'id': [1]*12+[2]*12,
'speed': [17.63,17.63,0.17,1.41,0.61,0.32,0.18,0.43,0.30,0.46,0.75,0.37]*2,
'acc': [0.00,-0.09,1.24,-0.80,-0.29,-0.14,0.25,-0.13,0.16,0.29,-0.38,0.27]*2,
'jerk': [0.00,0.01,-2.04,0.51,0.15,0.39,-0.38,0.29,0.13,-0.67,0.65,0.52]*2,
'bearing': [29.03,56.12,18.49,11.85,36.75,27.52,81.08,51.06,19.85,10.76,14.51,24.27]*2,
'label' : [3,3,3,3,3,3,3,3,3,3,3,3]*2 }
df = pd.DataFrame.from_dict(df)
X, y = df_transformer(df, chunk_size=5)
print(X[:3])
[[[[ 1.763e+01 0.000e+00 0.000e+00 2.903e+01]
[ 0.000e+00 0.000e+00 0.000e+00 0.000e+00]
[ 0.000e+00 0.000e+00 0.000e+00 0.000e+00]
[ 0.000e+00 0.000e+00 0.000e+00 0.000e+00]
[ 3.700e-01 2.700e-01 5.200e-01 2.427e+01]]]
[[[ 7.500e-01 -3.800e-01 6.500e-01 1.451e+01]
[ 3.000e-01 1.600e-01 1.300e-01 1.985e+01]
[ 4.600e-01 2.900e-01 -6.700e-01 1.076e+01]
[ 1.800e-01 2.500e-01 -3.800e-01 8.108e+01]
[ 3.200e-01 -1.400e-01 3.900e-01 2.752e+01]]]
[[[ 6.100e-01 -2.900e-01 1.500e-01 3.675e+01]
[ 1.410e+00 -8.000e-01 5.100e-01 1.185e+01]
[ 1.700e-01 1.240e+00 -2.040e+00 1.849e+01]
[ 1.763e+01 -9.000e-02 1.000e-02 5.612e+01]
[ 4.300e-01 -1.300e-01 2.900e-01 5.106e+01]]]]
请注意,第一个元素与答案中的不同(第 2、第 3 和第 4 行全为零。
解决方案
您可以填充 df 一次,而不是在每次迭代时填充。
使用第二个 id 获取此数据
df = {
'id': [1,1,1,1,1,1,1,1,1,2,2,2],
'speed': [17.63,17.63,0.17,1.41,0.61,0.32,0.18,0.43,0.30,0.46,0.75,0.37],
'acc': [0.00,-0.09,1.24,-0.80,-0.29,-0.14,0.25,-0.13,0.16,0.29,-0.38,0.27],
'jerk': [0.00,0.01,-2.04,0.51,0.15,0.39,-0.38,0.29,0.13,-0.67,0.65,0.52],
'bearing': [29.03,56.12,18.49,11.85,36.75,27.52,81.08,51.06,19.85,10.76,14.51,24.27],
'label' : [3,3,3,3,3,3,3,3,3,3,3,3] }
df = pd.DataFrame.from_dict(df)
print(df)
id speed acc jerk bearing label
0 1 17.63 0.00 0.00 29.03 3
1 1 17.63 -0.09 0.01 56.12 3
2 1 0.17 1.24 -2.04 18.49 3
3 1 1.41 -0.80 0.51 11.85 3
4 1 0.61 -0.29 0.15 36.75 3
5 1 0.32 -0.14 0.39 27.52 3
6 1 0.18 0.25 -0.38 81.08 3
7 1 0.43 -0.13 0.29 51.06 3
8 1 0.30 0.16 0.13 19.85 3
9 2 0.46 0.29 -0.67 10.76 3
10 2 0.75 -0.38 0.65 14.51 3
11 2 0.37 0.27 0.52 24.27 3
和代码
def df_transformer(df, chunk_size=5):
### pad df with 0's so len(df) is exactly a multiple of chunk_size
df = pd.concat([df,
pd.DataFrame([[id] + [0] * 5 # add row with zeros
for id, ct in df.groupby('id').size().iteritems() # for each id
for row in range(chunk_size - ct % chunk_size)] # as many times as needed
, columns=df.columns)
]).sort_values('id', kind='mergesort', ignore_index=True)
# print(df)
X, y = [], []
for _, group in df.groupby(df.index//5):
X.append(group.iloc[:, 1:-1].values[np.newaxis, ...])
y.append(group.iloc[0, -1]) # not sure how you want y to be structured
return np.array(X), np.array(y)
X, y = df_transformer(df, chunk_size=5)
print(X)
输出
[[[[ 1.763e+01 0.000e+00 0.000e+00 2.903e+01]
[ 1.763e+01 -9.000e-02 1.000e-02 5.612e+01]
[ 1.700e-01 1.240e+00 -2.040e+00 1.849e+01]
[ 1.410e+00 -8.000e-01 5.100e-01 1.185e+01]
[ 6.100e-01 -2.900e-01 1.500e-01 3.675e+01]]]
[[[ 3.200e-01 -1.400e-01 3.900e-01 2.752e+01]
[ 1.800e-01 2.500e-01 -3.800e-01 8.108e+01]
[ 4.300e-01 -1.300e-01 2.900e-01 5.106e+01]
[ 3.000e-01 1.600e-01 1.300e-01 1.985e+01]
[ 0.000e+00 0.000e+00 0.000e+00 0.000e+00]]]
[[[ 4.600e-01 2.900e-01 -6.700e-01 1.076e+01]
[ 7.500e-01 -3.800e-01 6.500e-01 1.451e+01]
[ 3.700e-01 2.700e-01 5.200e-01 2.427e+01]
[ 0.000e+00 0.000e+00 0.000e+00 0.000e+00]
[ 0.000e+00 0.000e+00 0.000e+00 0.000e+00]]]]
注意前两个部分是 fromid==1
和最后一个是 from id==2
,每个部分都有自己的零填充
推荐阅读
- ios - 应用符合 xCode 11.7 但在 iOS 14+ 问题上运行
- javascript - 如何将此数组转换为对象
- ms-project - Microsoft Project:现在是否可以设置或更改任务的唯一 ID?
- c# - 如何通过子域访问我的 web api
- java - 我们可以将任何外部 jar 文件加载到 micronaut 项目吗?
- r - 如何将二进制数据帧转换为向量?
- excel - 对象“范围”的特殊粘贴失败
- git - 修改一个 svn 迁移到 git
- macos - 如何解决ngnix中的403禁止错误
- apache-spark - Pyspark:spark sql 中的缓存方法