首页 > 解决方案 > 在python中将分类变量转换为定量变量

问题描述

我正在尝试将分类变量更改为定量变量。我正在使用get_dummies应该返回定量变量的函数。

我的想法是在我的数据框中创建新列并将返回的定量变量添加到这些新列中,但是当我将其打印出来时,输出显示的是其他内容。

我的代码:

    import pandas as pd
    import numpy as np

    df = pd.read_csv('/home/user/Documents/MOOC dataset cleaned/duplicate.csv')
    df['0_to_35'],df['35_to_55'],df['greater then 55'] = pd.get_dummies(df['age_band'])

    print(df['0_to_35'],df['35_to_55'],df['greater then 55'])

输出:

(0       0-35
1        0-35
2        0-35
3        0-35
4        0-35
5        0-35
6        0-35
7        0-35
8        0-35
9        0-35
10       0-35
11       0-35
12       0-35
13       0-35
14       0-35
15       0-35
16       0-35
17       0-35
18       0-35
19       0-35
20       0-35
21       0-35
22       0-35
23       0-35
24       0-35
25       0-35
26       0-35
27       0-35
28       0-35
29       0-35
         ... 
28755    0-35
28756    0-35
28757    0-35
28758    0-35
28759    0-35
28760    0-35
28761    0-35
28762    0-35
28763    0-35
28764    0-35
28765    0-35
28766    0-35
28767    0-35
28768    0-35
28769    0-35
28770    0-35
28771    0-35
28772    0-35
28773    0-35
28774    0-35
28775    0-35
28776    0-35
28777    0-35
28778    0-35
28779    0-35
28780    0-35
28781    0-35
28782    0-35
28783    0-35
28784    0-35
Name: 0_to_35, dtype: object, 0        35-55
1        35-55
2        35-55
3        35-55
4        35-55
5        35-55
6        35-55    (0        0-35
1        0-35
2        0-35
3        0-35
4        0-35
5        0-35
6        0-35
7        0-35
8        0-35
9        0-35
10       0-35
11       0-35
12       0-35
13       0-35
14       0-35
15       0-35
16       0-35
17       0-35
18       0-35
19       0-35
20       0-35
21       0-35
22       0-35
23       0-35
24       0-35
25       0-35
26       0-35
27       0-35
28       0-35
29       0-35
         ... 
28755    0-35
28756    0-35
28757    0-35
28758    0-35
28759    0-35
28760    0-35
28761    0-35
28762    0-35
28763    0-35
28764    0-35
28765    0-35
28766    0-35
28767    0-35
28768    0-35
28769    0-35
28770    0-35
28771    0-35
28772    0-35
28773    0-35
28774    0-35
28775    0-35
28776    0-35
28777    0-35
28778    0-35
28779    0-35
28780    0-35
28781    0-35
28782    0-35
28783    0-35
28784    0-35
Name: 0_to_35, dtype: object, 0        35-55
1        35-55
2        35-55
3        35-55
4        35-55
5        35-55
6        35-55
7        35-55
8        35-55
9        35-55
10       35-55
11       35-55
12       35-55
13       35-55
14       35-55
15       35-55
16       35-55
17       35-55
18       35-55
19       35-55
20       35-55
21       35-55
22       35-55
23       35-55
24       35-55
25       35-55
26       35-55
27       35-55
28       35-55
29       35-55
         ...  
28755    35-55
28756    35-55
28757    35-55
28758    35-55
28759    35-55
28760    35-55
28761    35-55
28762    35-55
28763    35-55
28764    35-55
28765    35-55
28766    35-55
28767    35-55
28768    35-55
28769    35-55
28770    35-55
28771    35-55
28772    35-55
28773    35-55
28774    35-55
28775    35-55
28776    35-55
28777    35-55
28778    35-55
28779    35-55
28780    35-55
28781    35-55
28782    35-55
28783    35-55
28784    35-55
Name: 35_to_55, dtype: object, 0        55<=
1        55<=
2        55<=
3        55<=
4        55<=
5        55<=
6        55<=
7        55<=
8        55<=
9        55<=
10       55<=
11       55<=
12       55<=
13       55<=
14       55<=
15       55<=
16       55<=
17       55<=
18       55<=
19       55<=
20       55<=
21       55<=
22       55<=
23       55<=
24       55<=
25       55<=
26       55<=
27       55<=
28       55<=
29       55<=
         ... 
28755    55<=
28756    55<=
28757    55<=
28758    55<=
28759    55<=
28760    55<=
28761    55<=
28762    55<=
28763    55<=
28764    55<=
28765    55<=
28766    55<=
28767    55<=
28768    55<=
28769    55<=
28770    55<=
28771    55<=
28772    55<=
28773    55<=
28774    55<=
28775    55<=
28776    55<=
28777    55<=
28778    55<=
28779    55<=
28780    55<=
28781    55<=
28782    55<=
28783    55<=
28784    55<=
Name: greater then 55, dtype: object)
7        35-55
8        35-55
9        35-55
10       35-55
11       35-55
12       35-55
13       35-55
14       35-55
15       35-55
16       35-55
17       35-55
18       35-55
19       35-55
20       35-55
21       35-55
22       35-55
23       35-55
24       35-55
25       35-55
26       35-55
27       35-55
28       35-55
29       35-55
         ...  
28755    35-55
28756    35-55
28757    35-55
28758    35-55
28759    35-55
28760    35-55
28761    35-55
28762    35-55
28763    35-55
28764    35-55
28765    35-55
28766    35-55
28767    35-55
28768    35-55
28769    35-55
28770    35-55
28771    35-55
28772    35-55
28773    35-55
28774    35-55
28775    35-55
28776    35-55
28777    35-55
28778    35-55
28779    35-55
28780    35-55
28781    35-55
28782    35-55
28783    35-55
28784    35-55
Name: 35_to_55, dtype: object, 0        55<=
1        55<=
2        55<=
3        55<=
4        55<=
5        55<=
6        55<=
7        55<=
8        55<=
9        55<=
10       55<=
11       55<=
12       55<=
13       55<=
14       55<=
15       55<=
16       55<=
17       55<=
18       55<=
19       55<=
20       55<=
21       55<=
22       55<=
23       55<=
24       55<=
25       55<=
26       55<=
27       55<=
28       55<=
29       55<=
         ... 
28755    55<=
28756    55<=
28757    55<=
28758    55<=
28759    55<=
28760    55<=
28761    55<=
28762    55<=
28763    55<=
28764    55<=
28765    55<=
28766    55<=
28767    55<=
28768    55<=
28769    55<=
28770    55<=
28771    55<=
28772    55<=
28773    55<=
28774    55<=
28775    55<=
28776    55<=
28777    55<=
28778    55<=
28779    55<=
28780    55<=
28781    55<=
28782    55<=
28783    55<=pd.get_dummies(df['age_band'])
28784    55<=
Name: greater then 55, dtype: object)

pd.get_dummies(df['age_band']) 的输出 -

    0-35  35-55  55<=
0         0      0     1
1         0      1     0
2         0      1     0
3         0      1     0
4         1      0     0
5         0      1     0
6         1      0     0
7         1      0     0
8         1      0     0
9         0      0     1
10        0      1     0
11        1      0     0
12        0      1     0
13        1      0     0
14        0      1     0
15        1      0     0
16        0      1     0
17        0      1     0
18        0      1     0
19        0      1     0
20        1      0     0
21        1      0     0
22        0      1     0
23        0      1     0
24        1      0     0
25        0      1     0
26        1      0     0
27        1      0     0
28        0      1     0
29        0      1     0
...     ...    ...   ...
28755     0      1     0
28756     0      1     0
28757     1      0     0
28758     0      1     0
28759     0      1     0
28760     0      1     0
28761     0      1     0
28762     0      1     0
28763     0      1     0
28764     0      1     0
28765     0      1     0
28766     0      1     0
28767     0      1     0
28768     0      1     0
28769     1      0     0
28770     0      1     0
28771     0      1     0
28772     0      1     0
28773     1      0     0
28774     0      1     0
28775     1      0     0
28776     1      0     0
28777     1      0     0
28778     0      1     0
28779     1      0     0
28780     1      0     0
28781     0      1     0
28782     1      0     0
28783     0      1     0
28784     0      1     0

[28785 rows x 3 columns]
[Finished in 0.216s]

我不明白为什么会这样。它应该将以上三个变量放在新列中。我怎样才能解决这个问题?

标签: pythonpandas

解决方案


我认为需要分配给新列名称的子集:

df[['0_to_35', '35_to_55', 'greater then 55']] = pd.get_dummies(df['age_band'])

或分配给新的 DataFrame 和join

df1 = pd.get_dummies(df['age_band'])
#set new columns names if necessary
df1.columns = ['0_to_35','35_to_55','greater then 55']
df = df.join(df1)

推荐阅读