首页 > 解决方案 > 如何处理多响应数据以在 Python 3 中构建频率?

问题描述

我正在使用多响应数据集来使用 python panda 构建一些频率表。这是我的数据集:

Student Id  |1st_Lang   |2nd_Lang   |Core_Sub_1 |Core_Sub_2 |Core_Sub_3 |Additional
1       |Bengali    |English    |Math       |Life Sc    |Physical Sc    |Work Education
2       |Bengali    |English    |Geography  |Life Sc    |Physical Sc    |Physical Education
3       |Bengali    |English    |History    |Geography  |Economics  |Life Sc
4       |English    |Hindi      |History    |Geography  |Economics  |Life Sc
5       |Hindi      |English    |Math       |Life Sc    |Physical Sc    |Work Education

具有学生 ID 和他们选择作为语言、核心和附加的不同科目的样本学生数据。

我想生成学生正在学习的科目的频率

例子:

English - 5
Bengali - 3
Hindi - 2
Geography - 3
... etc.

我还想了解学生正在学习的科目的频率,其中学生学习的语言是英语或印地语(来自 1st_lang、2nd_Lang 列)。

请问你能帮忙用Python完成吗?

标签: pythonpandasmultiple-columns

解决方案


因为我们不需要它,所以我们将“学生 ID”作为索引放在一边(或删除它):

df= df.set_index("Student Id")
#df= df.drop(columns=""Student Id")

           1st_Lang 2nd_Lang Core_Sub_1 Core_Sub_2   Core_Sub_3          Additional
Student Id
1           Bengali  English       Math    Life Sc  Physical Sc      Work Education
2           Bengali  English  Geography    Life Sc  Physical Sc  Physical Education
3           Bengali  English    History  Geography    Economics             Life Sc
4           English    Hindi    History  Geography    Economics             Life Sc
5             Hindi  English       Math    Life Sc  Physical Sc      Work Education

堆叠df,我们得到了一个系列(带有MultiIndex):

ser= df.stack()

Student Id
1           1st_Lang                 Bengali
            2nd_Lang                 English
            Core_Sub_1                  Math
            Core_Sub_2               Life Sc
            Core_Sub_3           Physical Sc
            Additional        Work Education
2           1st_Lang                 Bengali
            2nd_Lang                 English
            Core_Sub_1             Geography
            Core_Sub_2               Life Sc
            Core_Sub_3           Physical Sc
            Additional    Physical Education
3           1st_Lang                 Bengali
            2nd_Lang                 English
            Core_Sub_1               History
            Core_Sub_2             Geography
            Core_Sub_3             Economics
            Additional               Life Sc
4           1st_Lang                 English
            2nd_Lang                   Hindi
            Core_Sub_1               History
            Core_Sub_2             Geography
            Core_Sub_3             Economics
            Additional               Life Sc
5           1st_Lang                   Hindi
            2nd_Lang                 English
            Core_Sub_1                  Math
            Core_Sub_2               Life Sc
            Core_Sub_3           Physical Sc
            Additional        Work Education
dtype: object

我们现在可以计算频率:

ser.value_counts()

Life Sc               5
English               5
Physical Sc           3
Bengali               3
Geography             3
Work Education        2
Hindi                 2
Math                  2
History               2
Economics             2
Physical Education    1
dtype: int64

现在看看学习印地语的学生,设置标准:

critH= df[["1st_Lang","2nd_Lang"]].eq("Hindi")

            1st_Lang  2nd_Lang
Student Id
1              False     False
2              False     False
3              False     False
4              False      True
5               True     False

我们也将印地语视为第一和第二语言:

critH=critH.any(axis=1)

Student Id
1    False
2    False
3    False
4     True
5     True
dtype: bool

选择匹配的行(学生)并一步计算频率:

df.loc[critH].stack().value_counts()

Life Sc           2
Hindi             2
English           2
History           1
Work Education    1
Math              1
Economics         1
Physical Sc       1
Geography         1
dtype: int64

推荐阅读