首页 > 解决方案 > Pandas 数据到 pytorch 张量

问题描述

我正在尝试将 pandas 数据帧转换为 pytorch 张量以运行 LSTM 模型,但我不断收到以下错误消息,指出存在值错误并且无法确定对象类型“系列”的形状。然后它引用以下代码:

class MicroESDataset(Dataset):

    def __init__(self, sequences):
        self.sequences = sequences

    def __len__(self):
        return len(self.sequences)

    def __getitem__(self, idx):
        sequence, label = self.sequences[idx]
        return dict (
            sequence=torch.Tensor(sequence.to_numpy()),
            label = torch.tensor(label).float ()
        )

我错过了一些完全明显的东西吗?谢谢

这是确切的错误消息和回溯:

    ValueError                                Traceback (most recent       call last)
    <ipython-input-46-fb5c7eb803e1> in <module>()
----> 1 for item in data_module.train_dataloader():
  2   print(item["sequence"].shape)
  3   print(item["label"].shape)
  4   # print(item["label"])
  5   break

    3 frames
/usr/local/lib/python3.7/dist-packages/torch/_utils.py in reraise(self)
427             # have message field
428             raise self.exc_type(message=msg)
--> 429         raise self.exc_type(msg)
  430 
  431 

  ValueError: Caught ValueError in DataLoader worker process 0.
 Original Traceback (most recent call last):
 File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/worker.py", line 202, in _worker_loop
data = fetcher.fetch(index)
 File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "<ipython-input-30-36c44aae196d>", line 13, in __getitem__
label = torch.tensor(label).float()

ValueError:无法确定对象类型“系列”的形状

标签: pythonpandaspytorch

解决方案


2列

首先,idxinDataset应该是指 row inside pd.DataFrame

从中获取行的方法是df.iloc[idx]代替[idx](它将获取索引指定的列,这可能不是你想要的,如果是你应该转置你的数据)。

鉴于此,我们可以这样做(pd.DataFrame只有2列的虚拟,请参阅代码注释):

import pandas as pd
import torch


class MicroESDataset(torch.utils.data.Dataset):
    def __init__(self):
        # Dummy sequences dataframe
        self.sequences = pd.DataFrame({"col1": [1, 2], "col2": [3, 4]})

    def __len__(self):
        return len(self.sequences)

    def __getitem__(self, idx):
        sequence, label = self.sequences.iloc[idx]
        return dict(
            # torch.tensor infers dtype, torch.Tensor is always float
            sequence=torch.tensor(sequence),
            label=torch.tensor(label).float(),
        )


dataset = MicroESDataset()
print(dataset[0])

更多专栏

如果您有更多列(假设series可能是指多个值),您必须:

  • 先得到行
  • 按适当的列切片

鉴于上述一个可以做到(在这种情况下4,列,最后一个是标签,请参阅代码注释):

class MicroESDataset(torch.utils.data.Dataset):
    def __init__(self):
        # Dummy sequences dataframe
        self.sequences = pd.DataFrame(
            {"col1": [1, 2], "col2": [3, 4], "col3": [5, 6], "col4": [7, 8]}
        )

    def __len__(self):
        return len(self.sequences)

    def __getitem__(self, idx):
        # No magic unpacking here!
        row = self.sequences.iloc[idx]
        # Now only columns are left and we can slice with the indices
        # One could also slice using : "col3", but I think this is better in ur case
        sequence, label = row.iloc[:-1], row.iloc[-1]
        return dict(
            sequence=torch.tensor(sequence),
            label=torch.tensor(label).float(),
        )


dataset = MicroESDataset()
print(dataset[0])

推荐阅读