首页 > 解决方案 > 用熊猫根据另一列的值删除一列中的值

问题描述

假设我有一个这样的数据框

       full_path             name             created           modified
0    C:\T1\1.txt            1.txt            14:04:30             NaN
1    C:\T1\1.txt            1.txt              NaN              14:04:30
2    C:\T1\T2\1.txt         1.txt            14:10:30              NaN
3    C:\T1\T2\1.txt         1.txt              NaN              14:10:30
4    C:\T1\T2\T3\1.txt      1.txt            14:15:30             NaN
5    C:\T1\T2\T3\1.txt      1.txt              NaN              14:15:30
6    C:\T1\T2\T3\T4\1.txt   1.txt            14:20:30             NaN

我使用此代码创建一个数据框:

from pathlib import PurePath
import numpy as np
import pandas as pd

df = pd.DataFrame({
    'full_path': {0: 'C:\\T1\\1.txt', 1: 'C:\\T1\\1.txt',
                  2: 'C:\\T1\\T2\\1.txt', 3: 'C:\\T1\\T2\\1.txt',
                  4: 'C:\\T1\\T2\\T3\\1.txt',
                  5: 'C:\\T1\\T2\\T3\\1.txt',
                  6: 'C:\\T1\\T2\\T3\\T4\\1.txt'},
    'name': {0: '1.txt', 1: '1.txt', 2: '1.txt', 3: '1.txt',
             4: '1.txt', 5: '1.txt', 6: '1.txt'},
    'created': {0: '14:04:30', 1: np.nan, 2: '14:10:30', 3: np.nan,
                4: '14:15:30', 5: np.nan, 6: '14:20:30'},
    'modified': {0: np.nan, 1: '14:04:30', 2: np.nan, 3: '14:10:30',
                 4: np.nan, 5: '14:15:30', 6: np.nan}
})

df['folder'] = df['full_path'].apply(lambda x: PurePath(x).parent.name)
g = df.groupby('name')
df['full_path'] = g['full_path'].transform('last')
df['c_m'] = df['created'].combine_first(df['modified'])
index_cols = ['full_path', 'name']
df = df.pivot_table(index=index_cols,
                    columns='folder',
                    values='c_m',
                    aggfunc='first')
summary_cols = ['created', 'modified']
df = df.reset_index() \
    .merge(g[summary_cols].agg({'created': 'first', 'modified': 'last'}),
           on='name')
df = df[[*index_cols,
         *summary_cols,
         *df.columns.difference(summary_cols + index_cols)]] \
    .rename_axis(None, axis=1)
print(df)

这是输出数据框:

   full_path          name  created modified     T1       T2       T3       T4
C:\T1\T2\T3\T4\1.txt 1.txt 14:04:30 14:20:30 14:04:30 14:10:30 14:15:30 14:20:30

我想要的是,例如,如果文件 1.txt 返回到文件夹 T3,然后删除列 T4 中的时间戳。所以,如果我有这样的数据框:

       full_path             name             created           modified
0    C:\T1\1.txt            1.txt            14:04:30             NaN
1    C:\T1\1.txt            1.txt              NaN              14:04:30
2    C:\T1\T2\1.txt         1.txt            14:10:30              NaN
3    C:\T1\T2\1.txt         1.txt              NaN              14:10:30
4    C:\T1\T2\T3\1.txt      1.txt            14:15:30             NaN
5    C:\T1\T2\T3\1.txt      1.txt              NaN              14:15:30
6    C:\T1\T2\T3\T4\1.txt   1.txt            14:20:30             NaN
7    C:\T1\T2\T3\1.txt      1.txt            14:30:30             NaN

我希望输出数据框是这样的:

   full_path          name  created modified     T1       T2       T3      T4
C:\T1\T2\T3\T4\1.txt 1.txt 14:04:30 14:20:30 14:04:30 14:10:30 14:30:30   NaN

如何修改代码以获得此结果?因此,该文件位于文件夹 T4 中,我在那里放置了一个时间戳,但随后它又移回了 T3,我还想删除 T4 中的时间戳,因为该文件不再存在。

标签: pythonpython-3.xpandasdataframepython-requests

解决方案


推荐阅读