python - 如何重新排列这个多索引熊猫数据框的行顺序？

问题描述

非常感谢任何提示。我从以下名为 years69_19 的数据框开始： years69_19

然后，我创建了一堆不同的数据框，其中包含来自 years69_19 的数据，以“Fied”名称分隔。这是我如何执行此操作的示例（某些部门对同一部门有多个标签，因此我使用 | 运算符查找所有标签）：按字段分隔

然后，我将新的数据框放入一个名为 listofdeps 的列表中。我还制作了一个与 listofdeps 对应的字符串列表，这只是为了正确命名数据帧。这是数据框和字符串标签的列表

最后，我遍历 listofdeps，并旋转每个数据帧。这是我的代码：

newlistofdeps = []

for dataframes, deptname in zip(listofdeps, depstrings):
    newlabel =  deptname + ' Department at [REDACTED]'
    dataframes[newlabel] = 1
    deptable = pd.pivot_table(dataframes[['Year', 'Gender', 'Ethnicity', newlabel]], index=['Gender', 'Ethnicity'], columns = ['Year'], aggfunc=np.sum, fill_value=0)
    newlistofdeps.append(deptable)

现在我有一个列表 newlistofdeps，每个部门（字段）都有一个数据框，它看起来像这样： newlistofdeps 中第一个数据框的示例

Stackoverflow 社区，我需要以下帮助：

我需要像这样重新排列种族指数：“亚洲人”、“黑人”、“奇卡诺人/墨西哥裔美国人”、“其他西班牙裔/拉丁美洲人”、“白人”、“其他”、“国际”。我已经尝试了很多不同的方法，比如 df.reindex 和使用“级别”，但我只是无法弄清楚如何做到这一点。
我需要做到这一点，以便对于 newlistofdeps 中的每个数据框，上面列出的每个种族都会出现，即使该部门中没有该种族的行。 这是我的意思的一个例子。在此处输入图像描述在这个部门，没有任何奇卡诺/墨西哥裔美国女性或黑人男性。但是，我仍然需要这些组的行，它们都将用 0 填充。我实际上不知道如何解决这个问题，我在想可能以这种格式创建一个数据框，所有种族都用 0 填充，然后将每个数据框与该数据框合并，这样缺失的种族仍然有行。有任何想法吗？

谢谢！！！

标签： pythonpandasdatabasedataframepivot-table

解决方案

似乎您要走很长的路来执行交叉制表。您可以简单地使用pd.crosstab来完成您手动执行的所有繁重工作。

数据创建

import pandas as pd
import numpy as np
import itertools

ethnicities = ['Asian', 'Black', 'Chicano/Mexican-American', 'Other Hispanic/Latino', 'White', 'Other', 'Interational']
fields = ["economics", "physics", "political sciences", "chemistry", "english"]
sexes = ["M", "F"]
years = [2000, 2001, 2002, 2003]

records = itertools.product(ethnicities, fields, sexes, years)
base_df = pd.DataFrame(records, columns=["ethnicity", "field", "sex", "year"])

print(base_df.head(10))

  ethnicity      field sex  year
0     Asian  economics   M  2000
1     Asian  economics   M  2001
2     Asian  economics   M  2002
3     Asian  economics   M  2003
4     Asian  economics   F  2000
5     Asian  economics   F  2001
6     Asian  economics   F  2002
7     Asian  economics   F  2003
8     Asian    physics   M  2000
9     Asian    physics   M  2001

这base_df就是我们所有类别的笛卡尔积。所以在这个变量中，我们对种族、领域、性别和年份的每个独特组合都有一行。现在我们有了这个，我们可以对这个数据框进行采样以使我们的数据更加真实。我将对我们的数据进行欠采样，以确保某些组合区域完全从数据中丢失，以更接近您正在使用的数据。

df = base_df.sample(50, replace=True)

print(df.head())
                 ethnicity               field sex  year
183                  White  political sciences   F  2003
228                  Other           chemistry   F  2000
38                   Asian             english   F  2002
166                  White           economics   F  2002
146  Other Hispanic/Latino           chemistry   M  2002

现在我们有了一个很好的示例数据集，我们可以使用它pd.crosstab来获取您在问题中计算的计数。我正在设置参数dropna=False这告诉熊猫不要丢弃完全缺失的组合，而是用 0 填充缺失的观察值。

xtab = pd.crosstab(index=[df["field"], df["sex"], df["ethnicity"]], columns=df["year"], dropna=False)

print(xtab.head(10))
year                                    2000  2001  2002  2003
field     sex ethnicity                                       
chemistry F   Asian                        0     0     0     0
              Black                        0     0     0     0
              Chicano/Mexican-American     0     0     0     0
              Interational                 0     0     0     1
              Other                        1     0     0     0
              Other Hispanic/Latino        0     0     1     0
              White                        1     0     0     0
          M   Asian                        0     1     0     0
              Black                        0     0     0     0
              Chicano/Mexican-American     0     1     0     0

在那里你有一个我们所有类别的交叉表，同时也代表了缺失的类别组合。

为了比较，当您设置时会发生以下情况dropna=True（我们将删除具有 0 个观察值的类别组合 - 例如您发布的问题）。

xtab = pd.crosstab(index=[df["field"], df["sex"], df["ethnicity"]], columns=df["year"], dropna=True)

print(xtab.head(10))
year                                    2000  2001  2002  2003
field     sex ethnicity                                       
chemistry F   Interational                 0     0     0     1
              Other                        1     0     0     0
              Other Hispanic/Latino        0     0     1     0
              White                        1     0     0     0
          M   Asian                        0     1     0     0
              Chicano/Mexican-American     0     1     0     0
              Other Hispanic/Latino        1     2     1     0
              White                        0     1     0     1
economics F   Asian                        0     0     0     1
              Black                        0     1     0     0

请注意，dropna=True我们现在缺少某些分类组合，因为在我们的样本中没有观察到它们。

要更改行的顺序，最简单的方法MultiIndex是按照您期望的顺序显式构造一个 new 并从那里开始。

# define the order of categories for each level
new_index = pd.MultiIndex.from_product([
    ["economics", "physics", "political sciences", "chemistry", "english"],
    ["M", "F"],
    ['Asian', 'Black', 'Chicano/Mexican-American', 'Other Hispanic/Latino', 'White', 'Other', 'Interational']],
    names=["field", "sex", "ethnicity"]
)

# use the new index to reorder the data
reordered_xtab = xtab.reindex(new_index)

print(reordered_xtab.head(10))
year                                    2000  2001  2002  2003
field     sex ethnicity                                       
economics M   Asian                        0     0     0     0
              Black                        0     0     0     0
              Chicano/Mexican-American     0     0     1     1
              Other Hispanic/Latino        0     0     0     0
              White                        0     1     1     0
              Other                        0     1     0     0
              Interational                 0     0     0     0
          F   Asian                        0     0     0     0
              Black                        0     0     0     0
              Chicano/Mexican-American     0     0     0     1

现在一切都尊重我定义的顺序new_index，而不是字母顺序，这是熊猫在计算crosstab.

python - 如何重新排列这个多索引熊猫数据框的行顺序？

问题描述

解决方案

推荐阅读