首页 > 解决方案 > 如何总结熊猫数据框

问题描述

我有一个 pandas 数据框,其中包含 ~20,xxx 的公共汽车登机数据记录。数据集包含一个cardNumber对每位乘客来说都是唯一的字段。有一个type字段标识登机类型。有一routeName列指定登机发生在哪条路线上,最后有一Date列标识登机发生的时间。我在下面提供了一个模拟数据框。

df = pd.DataFrame(
    {'cardNumber': ['999', '999', '999', '999', '901', '901', '888', '888'],
     'type': ['trip_pass', 'transfer', 'trip_pass', 'transfer', 'stored_value', 'transfer', 'trip_pass', 
              'trip_pass'],
     'routeName': ['1', '2', '2', '1', '20', '3', '4', '4'],
     'Date': ['2020-08-01 06:18:56 -04:00', '2020-08-01 06:46:12 -04:00', '2020-08-01 17:13:51 -04:00',
              '2020-08-01 17:47:32 -04:00', '2020-08-10 15:23:16 -04:00', '2020-08-10 15:44:45 -04:00',
              '2020-08-31 06:54:09 -04:00', '2020-08-31 16:23:41 -04:00']}
)
df['Date'] = pd.to_datetime(df['Date'])

我想做的是总结转移活动。平均而言,从 Route 1 到 Route 2 或从 Route 2 到 Route 1 发生了多少次换乘。数据集中有 11 条不同的路线可以在它们之间发生换乘。

我希望输出看起来像(请注意,下面的输出不是从上面提供的示例生成的):

From   |   To     |   Avg. Daily
----------------------------------
 1     |   2      |     45.7
 1     |   3      |     22.6
 20    |   1      |     12.2 

标签: pythonpandasdataframe

解决方案


以下代码适用于您提供的块数据。如果它在您的实际数据中不起作用,请告诉我。可能有更好的方法可以做到这一点,但我认为这是一个很好的起点。

这里的总体思路是按乘客分组以找出路线。然后,由于您想要每日平均值,因此您需要按日期分组,然后按目的地分组以计算每日平均值。

# Define a function to get routes' relationship (origin vs destination)
def get_routes(x):
    if 'transfer' not in x.type.tolist(): # if no 'transfer' type in group, leave it as 0 (we'll remove them afterwards)
        return 0
    x = x[x.type == 'transfer'] # select target type
    date = df[df.cardNumber=='999'].Date.dt.strftime('%m/%d/%Y').unique()
    if date.size == 1: # if there is more than one date by passenger, you'll need to change this code
        date = date[0]
    else:
        raise Exception("There are more than one date per passenger, please adapt your code.")
    s_from = x.routeName[x.Date.idxmin()] # get route from the first date
    s_to = x.routeName[x.Date.idxmax()] # get route from the last date
    return date, s_from, s_to

# Define a function to get the routes' daily average
def get_daily_avg(date_group):
    daily_avg = (
        date_group.groupby(['From', 'To'], as_index=False) # group the day by routes
        .apply(lambda route: route.shape[0] / date_group.shape[0]) # divide the total of trips of that route by the total trips of that day
    )
    return daily_avg

# Get route's relationship
routes_series = df.groupby('cardNumber').apply(get_routes) # retrive routes per passenger
routes_series = routes_series[routes_series!=0] # remove groups without the target type

# Create a named dataframe from the series output
routes_df = pd.DataFrame(routes_series.tolist(), columns=['Date', 'From', 'To'])

# Create dataframe, perform filter and calculations
daily_routes_df = (
    routes_df.query('From != To') # remove routes with same destination as the origin
    .groupby('Date').apply(get_daily_avg) # calculate the mean per date
    .rename(columns={None: 'Avg. Daily'}) # set name to previous output
    .drop(['From','To'], axis = 1) # drop out redundant info since there's such info at the index
    .reset_index() # remove MultiIndex to get a tidy dataframe
)

# Visualize results
print(daily_routes_df)

输出:

         Date From To  Avg. Daily
0  08/01/2020    2  1         1.0

在这里,平均值为 1,因为每组只有一个计数。请注意,只有“转移”类型已被考虑在内。没有它的,或者没有改变路线的,被进一步删除。


推荐阅读