首页 > 解决方案 > 在 Pandas 中对分层数据框进行切片

问题描述

我有一个基于分层的 excel,它看起来像这样:

Df
lev1    lev2   lev3    lev4   lev5   description
RD21    Nan    Nan     Nan    Nan    Oil
Nan     RD32   Nan     Nan    Nan    Oil/Canola
Nan     Nan    RD33    Nan    Nan    Oil/Canola/Wheat
Nan     Nan    RD34    Nan    Nan    Oil/Canola/Flour
Nan     Nan    Nan     RD55   Nan    Oil/Canola/Flour/Thick
ED54    Nan    Nan     Nan    Nan    Rice
Nan     ED66   Nan     Nan    Nan    Rice/White
Nan     Nan    ED88    Nan    Nan    Rice/White/Jasmine
Nan     Nan    ED89    Nan    Nan    Rice/White/Basmati
Nan     ED68   Nan     Nan    Nan    Rice/Brown

我想根据我从“描述”列中的选择获取所有级别的代码。Eg1:如果我在描述中搜索“Brown”:它应该给我这样的东西:

ED54: Rice
ED68: Rice/Brown

Eg2:如果我在描述栏中搜索“厚”:它应该给我这样的东西:

RD21: Oil
RD32: Oil/Canola
RD34: Oil/Canola/Flour
RD55: Oil/Canola/Flour/Thick

使用 Df["Descriptions"].str.contains(word) 可以很容易地处理对单词的搜索,如果需要,我也可以使用正则表达式来查找特定模式。但是我们如何获得与这个词层次相关的代码。

标签: pythonpandasdataframe

解决方案


通过以下方式创建分层 dict 数据lev1~5

vv = df.apply(
    lambda x: (
        x.iloc[len(x.description.split('/'))-1],
        x.description.split('/')
    ), axis=1
).values

vv好像:

array([('RD21', ['Oil']), ('RD32', ['Oil', 'Canola']),
       ('RD33', ['Oil', 'Canola', 'Wheat']),
       ('RD34', ['Oil', 'Canola', 'Flour']),
       ('RD55', ['Oil', 'Canola', 'Flour', 'Thick']), ('ED54', ['Rice']),
       ('ED66', ['Rice', 'White']),
       ('ED88', ['Rice', 'White', 'Jasmine']),
       ('ED89', ['Rice', 'White', 'Basmati']),
       ('ED68', ['Rice', 'Brown'])], dtype=object)

使用创建分层字典vv

d = {}
for i in vv:
    v = i[0] # RD33
    k = i[1] # ['Oil', 'Canola', 'Wheat']

    # loop and set last value in key "RD33"
    f_d = d
    for j in k[:-1]:
        f_d = f_d[j]
    f_d[k[-1]] = {'_value': v}

d好像:

{'Oil': {'_value': 'RD21',
  'Canola': {'_value': 'RD32',
   'Wheat': {'_value': 'RD33'},
   'Flour': {'_value': 'RD34', 'Thick': {'_value': 'RD55'}}}},
 'Rice': {'_value': 'ED54',
  'White': {'_value': 'ED66',
   'Jasmine': {'_value': 'ED88'},
   'Basmati': {'_value': 'ED89'}},
  'Brown': {'_value': 'ED68'}}}

Df["Descriptions"].str.contains(word)然后假设您通过(或正则表达式)搜索单词 ,它返回:

'Oil/Canola/Flour/Thick'

您可以获得如下结果:

desc_split = 'Oil/Canola/Flour/Thick'.split('/')
res = []
for i in range(len(desc_split)):
    all_keys = desc_split[:i+1]
    d2 = d
    for k in all_keys:
        d2 = d2[k]
    res.append(f"{d2['_value']}: {'/'.join(all_keys)}")

res好像:

['RD21: Oil',
 'RD32: Oil/Canola',
 'RD34: Oil/Canola/Flour',
 'RD55: Oil/Canola/Flour/Thick']

推荐阅读