首页 > 解决方案 > 在 Pandas DataFrame 中拆分嵌套不规则数组时避免 VisibleDeprecationWarning

问题描述

问题

将包含嵌套不规则数组的列拆分为 Pandas DataFrame 中的新列时如何触发?VisibleDeprecationWarning一种普遍接受直截了当的方式,或者对为什么现在不可能的解释表示赞赏。

术语:

现有职位调查

经过广泛的调查和自己的实验,我找不到普遍接受和直接的方法。下面列出了发布时关于 SO 的两个最相关的帖子。

这些帖子充其量与问题关系不大:post1post2post3post4

实验

样本 sata 和预期输出

df = pd.DataFrame(
    data={
        "id": ['a', 'b', 'c'],
        "col1": [[[1, 2], [3, 4, 5]],
                 [[6], [7, 8, 9]],
                 [[10, 11, 12], []]
                 ]
    }
)

df
Out[81]: 
  id                 col1
0  a  [[1, 2], [3, 4, 5]]
1  b     [[6], [7, 8, 9]]
2  c   [[10, 11, 12], []]

可以看到,df["col1"]最外层的两个级别具有 shape=(3, 2) 。预期输出:

df  # expected output  
Out[177]: 
  id                 col1          sep1       sep2
0  a  [[1, 2], [3, 4, 5]]        [1, 2]  [3, 4, 5]
1  b     [[6], [7, 8, 9]]           [6]  [7, 8, 9]
2  c   [[10, 11, 12], []]  [10, 11, 12]         []

为了节省时间,可以跳到最后一小节直接开始工作方法。我尝试过的所有相关策略都按时间顺序在下面列出。

主要试验

这里的分裂函数产生一个pd.Series二元素元组,这是合理的。

df["col1"].apply(lambda el: (el[0], el[1]))
Out[82]: 
0    ([1, 2], [3, 4, 5])
1       ([6], [7, 8, 9])
2     ([10, 11, 12], [])
Name: col1, dtype: object

但是,直接分配到单独的列会产生ValueError.

df[["sep1", "sep2"]] = df["col1"].apply(lambda el: (el[0], el[1]))

Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3417, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-75-973c44fe294a>", line 1, in <module>
    df[["sep1", "sep2"]] = df["col1"].apply(lambda el: (el[0], el[1]))
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 3037, in __setitem__
    self._setitem_array(key, value)
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 3072, in _setitem_array
    self.iloc._setitem_with_indexer((slice(None), indexer), value)
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1755, in _setitem_with_indexer
    "Must have equal len keys and value "
ValueError: Must have equal len keys and value when setting with an iterable

这可以通过将 转换Serieslistusing来避免.tolist()

df["col1"].apply(lambda el: (el[0], el[1])).tolist()
Out[84]: [([1, 2], [3, 4, 5]), ([6], [7, 8, 9]), ([10, 11, 12], [])]

现在直接分配工作正常,但VisibleDeprecationWarning弹出一个。

df[["sep1", "sep2"]] = df["col1"].apply(lambda el: (el[0], el[1])).tolist()

/opt/anaconda3/lib/python3.7/site-packages/numpy/core/_asarray.py:83: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  return array(a, dtype, copy=False, order=order)

df  # this is expected
Out[86]: 
id                 col1          sep1       sep2
0  a  [[1, 2], [3, 4, 5]]        [1, 2]  [3, 4, 5]
1  b     [[6], [7, 8, 9]]           [6]  [7, 8, 9]
2  c   [[10, 11, 12], []]  [10, 11, 12]         []

list-zip-star 方法

要么ValueError要么VisibleDeprecationWarning

ls = df["col1"].apply(lambda el: (el[0], el[1])).tolist()
unpacked = list(zip(*ls))

df[["sep1", "sep2"]] = unpacked
# same ValueError message as above

df["sep1"] = unpacked[0]
# same VisibleDeprecationWarning message as above

list-map-list-zip-star 方法(有效但...)

随便加一层就行了list-map。这一次,终于可以得到想要的输出了。但这在以下方面是如此违反直觉:

  1. 必须单独分配新。为什么不能一次完成?
  2. list-map-list-zip-star 功能非常令人费解。

我真的应该按设计这样做吗?

ls = df["col1"].apply(lambda el: (el[0], el[1])).tolist()
unpacked = list(map(list, zip(*ls)))  # a magical spell

df[["sep1", "sep2"]] = unpacked
# same ValueError message. Why?

# set the new columns individually.
df["sep1"] = unpacked[0]
df["sep2"] = unpacked[1]

df  # expected output  
Out[177]: 
  id                 col1          sep1       sep2
0  a  [[1, 2], [3, 4, 5]]        [1, 2]  [3, 4, 5]
1  b     [[6], [7, 8, 9]]           [6]  [7, 8, 9]
2  c   [[10, 11, 12], []]  [10, 11, 12]         []

标签: pythonpandas

解决方案


为什么不DataFrame试一试

df =  df.join(pd.DataFrame(df["col1"].apply(lambda el: (el[0], el[1])).tolist(), 
              index = df.index, 
              columns = ["sep1", "sep2"]))

推荐阅读