python - Python + Pandas + 数据可视化:如何获取每行的百分比并可视化分类数据?
问题描述
我正在对贷款预测数据集(Pandas 数据框)进行探索性数据分析。此数据框有两列:Property_Area,其值分为三种类型 - Rural、Urban、Semiurban。另一列是 Loan_Status 明智值有两种类型 - Y、N。我想绘制这样的图表:沿 X 轴应该有 Property_Area,并且,对于每种类型的 3 个区域,我想显示接受的贷款百分比或沿 Y 轴拒绝。怎么做?
这是我的数据示例:
data = pd.DataFrame({'Loan_Status':['N','Y','Y','Y','Y','N','N','Y','N','Y','N'],
'Property_Area': ['Rural', 'Urban','Urban','Urban','Urban','Urban',
'Semiurban','Urban','Semiurban','Rural','Semiurban']})
我试过这个:
status = data['Loan_Status']
index = data['Property_Area']
df = pd.DataFrame({'Loan Status' : status}, index=index)
ax = df.plot.bar(rot=0)
data is the dataframe for the original dataset
编辑: 我能够做我想做的事,但为此,我不得不写一段很长的代码:
new_data = data[['Property_Area', 'Loan_Status']].copy()
count_rural_y = new_data[(new_data.Property_Area == 'Rural') & (data.Loan_Status == 'Y') ].count()
count_rural = new_data[(new_data.Property_Area == 'Rural')].count()
#print(count_rural[0])
#print(count_rural_y[0])
rural_y_percent = (count_rural_y[0]/count_rural[0])*100
#print(rural_y_percent)
#print("-"*50)
count_urban_y = new_data[(new_data.Property_Area == 'Urban') & (data.Loan_Status == 'Y') ].count()
count_urban = new_data[(new_data.Property_Area == 'Urban')].count()
#print(count_urban[0])
#print(count_urban_y[0])
urban_y_percent = (count_urban_y[0]/count_urban[0])*100
#print(urban_y_percent)
#print("-"*50)
count_semiurban_y = new_data[(new_data.Property_Area == 'Semiurban') & (data.Loan_Status == 'Y') ].count()
count_semiurban = new_data[(new_data.Property_Area == 'Semiurban')].count()
#print(count_semiurban[0])
#print(count_semiurban_y[0])
semiurban_y_percent = (count_semiurban_y[0]/count_semiurban[0])*100
#print(semiurban_y_percent)
#print("-"*50)
objects = ('Rural', 'Urban', 'Semiurban')
y_pos = np.arange(len(objects))
performance = [rural_y_percent,urban_y_percent,semiurban_y_percent]
plt.bar(y_pos, performance, align='center', alpha=0.5)
plt.xticks(y_pos, objects)
plt.ylabel('Loan Approval Percentage')
plt.title('Area Wise Loan Approval Percentage')
plt.show()
输出:
如果可能的话,你能否建议我一个更简单的方法来做到这一点?
解决方案
熊猫将使这变得Crosstabs
简单normalize
获取 2+ 列并获取pandas 数据框中每一行pandas
crosstab
的百分比的简单方法是将函数与normalize = 'index'
下面是 crosstab 函数的查找方式:
# Crosstab with "normalize = 'index'".
df_percent = pd.crosstab(data.Property_Area,data.Loan_Status,
normalize = 'index').rename_axis(None)
# Multiply all percentages by 100 for graphing.
df_percent *= 100
这将输出df_percent
如下所示:
Loan_Status N Y
Rural 50.000000 50.000000
Semiurban 66.666667 33.333333
Urban 16.666667 83.333333
然后,您可以很容易地将其绘制到您的条形图中:
# Plot only approvals as bar graph.
plt.bar(df_percent.index, df_percent.Y, align='center', alpha=0.5)
plt.ylabel('Loan Approval Percentage')
plt.title('Area Wise Loan Approval Percentage')
plt.show()
并得到结果图表:
这是我为此答案生成的示例数据框:
data = pd.DataFrame({'Loan_Status':['N','Y','Y','Y','Y','N','N','Y','N','Y','Y'
], 'Property_Area': ['Rural', 'Urban','Urban','Urban','Urban','Urban',
'Semiurban','Urban','Semiurban','Rural','Semiurban']})
创建此示例数据框:
Loan_Status Property_Area
0 N Rural
1 Y Urban
2 Y Urban
3 Y Urban
4 Y Urban
5 N Urban
6 N Semiurban
7 Y Urban
8 N Semiurban
9 Y Rural
10 Y Semiurban
推荐阅读
- powershell - 在字符串中使用系统变量在 Powershell 中获取文件版本
- python - PyGithub - 如何在 repo 中获取子文件夹的内容
- yaml - 如何在 Arcanist 中设置 YAML linting?
- monitoring - 为 Nagios 插件编写优雅的超时
- python - 熊猫数据框查询
- visual-studio-code - 为什么 Visual Studio 代码会在我的笔记本电脑和台式机上卸载 PDF?
- node.js - 如何限制 AWS DynamoDB 扫描的项目数量?
- php - 未定义变量:数据(查看:C:\cygwin64\home\hp\AddressBook\resources\views\Update\edit.blade.php)
- java - Apache Spark JSON 读取错误 - java.lang.IllegalArgumentException:非法模式组件:XXX
- java - 通过 Spring-Kafka 列出 Kafka 主题