首页 > 解决方案 > 如何获得每家航空公司的前 5 个理由?

问题描述

我只需要每家航空公司的前 5 个理由。我设法获得了所有航空公司的交叉表,但它没有排序,它显示了所有的原因。如何缩小结果范围?

pd.crosstab(df.airline, df.negativereason).apply(lambda x: x, axis=1)

>negativereason Bad Flight  Can't Tell  Cancelled Flight    Customer Service Issue  Damaged Luggage Flight Attendant Complaints Flight Booking Problems Late Flight Lost Luggage    longlines
airline                                     
>American   87  198 246 768 12  87  130 249 149 34
>Delta  64  186 51  199 11  60  44  269 57  14
>Southwest  90  159 162 391 14  38  61  152 90  29
>US Airways 104 246 189 811 11  123 122 453 154 50
>United 216 379 181 681 22  168 144 525 269 48

期望的结果

>American
>Customer Service Issue 768
>Late Flight 249
>Cancelled Flight 246
>Can't Tell 198
>Lost Luggage 149

这是数据集

>tweet_id   airline_sentiment   airline_sentiment_confidence    negativereason  negativereason_confidence   airline airline_sentiment_gold  name    negativereason_gold retweet_count   text    tweet_coord tweet_created   tweet_location  user_timezone
>0  570306133677760513  neutral 1.0000  NaN NaN Virgin America  NaN cairdin NaN 0   @VirginAmerica What @dhepburn said. NaN 2015-02-24 11:35:52 -0800   NaN Eastern Time (US & Canada)
>1  570301130888122368  positive    0.3486  NaN 0.0000  Virgin America  NaN jnardino    NaN 0   @VirginAmerica plus you've added commercials t...   NaN 2015-02-24 11:15:59 -0800   NaN Pacific Time (US & Canada)
>2  570301083672813571  neutral 0.6837  NaN NaN Virgin America  NaN yvonnalynn  NaN 0   @VirginAmerica I didn't today... Must mean I n...   NaN 2015-02-24 11:15:48 -0800   Lets Play   Central Time (US & Canada)
>3  570301031407624196  negative    1.0000  Bad Flight  0.7033  Virgin America  NaN jnardino    NaN 0   @VirginAmerica it's really aggressive to blast...   NaN 2015-02-24 11:15:36 -0800   NaN Pacific Time (US & Canada)
>4  570300817074462722  negative    1.0000  Can't Tell  1.0000  Virgin America  NaN jnardino    NaN 0   @VirginAmerica and it's a really big bad thing...   NaN 2015-02-24 11:14:45 -0800   NaN Pacific Time (US & Canada)

标签: pythonsorting

解决方案


这不是最好的解决方案,但它可以完成工作。

top_n = 5
gb = df.groupby(['airline', 'negativereason']).size().reset_index(name='freq')
df_tops = gb.groupby('airline').apply(lambda x: x.nlargest(top_n, ['freq'])).reset_index(drop=True)

它需要2个步骤。首先是计算每个航空公司每个负面原因的频率,其次是根据频率取 top_n 个原因。


推荐阅读