python - 如何获得每家航空公司的前 5 个理由?
问题描述
我只需要每家航空公司的前 5 个理由。我设法获得了所有航空公司的交叉表,但它没有排序,它显示了所有的原因。如何缩小结果范围?
pd.crosstab(df.airline, df.negativereason).apply(lambda x: x, axis=1)
>negativereason Bad Flight Can't Tell Cancelled Flight Customer Service Issue Damaged Luggage Flight Attendant Complaints Flight Booking Problems Late Flight Lost Luggage longlines
airline
>American 87 198 246 768 12 87 130 249 149 34
>Delta 64 186 51 199 11 60 44 269 57 14
>Southwest 90 159 162 391 14 38 61 152 90 29
>US Airways 104 246 189 811 11 123 122 453 154 50
>United 216 379 181 681 22 168 144 525 269 48
期望的结果
>American
>Customer Service Issue 768
>Late Flight 249
>Cancelled Flight 246
>Can't Tell 198
>Lost Luggage 149
这是数据集
>tweet_id airline_sentiment airline_sentiment_confidence negativereason negativereason_confidence airline airline_sentiment_gold name negativereason_gold retweet_count text tweet_coord tweet_created tweet_location user_timezone
>0 570306133677760513 neutral 1.0000 NaN NaN Virgin America NaN cairdin NaN 0 @VirginAmerica What @dhepburn said. NaN 2015-02-24 11:35:52 -0800 NaN Eastern Time (US & Canada)
>1 570301130888122368 positive 0.3486 NaN 0.0000 Virgin America NaN jnardino NaN 0 @VirginAmerica plus you've added commercials t... NaN 2015-02-24 11:15:59 -0800 NaN Pacific Time (US & Canada)
>2 570301083672813571 neutral 0.6837 NaN NaN Virgin America NaN yvonnalynn NaN 0 @VirginAmerica I didn't today... Must mean I n... NaN 2015-02-24 11:15:48 -0800 Lets Play Central Time (US & Canada)
>3 570301031407624196 negative 1.0000 Bad Flight 0.7033 Virgin America NaN jnardino NaN 0 @VirginAmerica it's really aggressive to blast... NaN 2015-02-24 11:15:36 -0800 NaN Pacific Time (US & Canada)
>4 570300817074462722 negative 1.0000 Can't Tell 1.0000 Virgin America NaN jnardino NaN 0 @VirginAmerica and it's a really big bad thing... NaN 2015-02-24 11:14:45 -0800 NaN Pacific Time (US & Canada)
解决方案
这不是最好的解决方案,但它可以完成工作。
top_n = 5
gb = df.groupby(['airline', 'negativereason']).size().reset_index(name='freq')
df_tops = gb.groupby('airline').apply(lambda x: x.nlargest(top_n, ['freq'])).reset_index(drop=True)
它需要2个步骤。首先是计算每个航空公司每个负面原因的频率,其次是根据频率取 top_n 个原因。
推荐阅读
- ios - Swift:将字符串转换为字符串中的单词和范围列表
- node.js - ERR_INVALID_ARG_TYPE 使用 webpack 运行criticalcss
- wordpress - Wordpress 网站重新加载显示 2 个标题
- go - Delve 中的 args 命令是否也显示返回值(而不仅仅是函数参数)?
- java - Intellij 不接受 openjdk 15
- ruby - 比较 2 个库的“大小”
- mysql - 如何在 SQL 中用 JOIN 替换 EXISTS 和 NOT EXISTS,以便将其转换为关系代数?
- html - 提交时表单不会重定向到提交消息页面并且不会发布到 netlify
- ruby-on-rails - Ruby 高效的每个循环
- git - ssh:连接到主机 gitlab.com 端口 22:网络不可达