首页 > 解决方案 > 连接多个数据帧时如何解决这个 NaN 值(索引)熊猫问题?

问题描述

我正在尝试使用 pandas 读取和分析多个“KSA 中的 covid-19”CSV 文件,特别是 52 个文件,并绘制它们,然后将这些图插入到 folium 地图中。我在 Windows 10 上使用 python 3.9.1。IDE 是用于测试的 jupyter notebook,然后将在 pycharm 中编写最终代码。基本上,KSA 13 个地区中的每个地区都有 4 个文件:病例、康复、死亡、活跃。(delimiter=;) 示例:

    Daily / Cumulative  Indicator   Date    Event   City    Region  Cases (person)
0   Cumulative  Active  2020-08-04  NaN Shuwaq  Tabuk   5
1   Cumulative  Active  2020-08-05  NaN Tabuk   Tabuk   243
2   Cumulative  Active  2020-08-05  NaN Al Badʻ Tabuk   1
3   Cumulative  Active  2020-08-06  NaN Tabuk   Tabuk   229
4   Cumulative  Active  2020-08-06  NaN Taymaʼ  Tabuk   5
... ... ... ... ... ... ... ...
2220    Cumulative  Active  2020-12-22  NaN Al Wajh Tabuk   2
2221    Cumulative  Active  2020-12-26  NaN Al Wajh Tabuk   2
2222    Cumulative  Active  2021-01-01  NaN Al Wajh Tabuk   2
2223    Cumulative  Active  2021-01-02  NaN Al Wajh Tabuk   2
2224    Cumulative  Active  2020-12-06  NaN Abu Rakah   Tabuk   1

我清理了不需要的列,为串联重命名了类似的列:

import pandas as pd
import matplotlib.pyplot as plt

file1 = pd.read_csv("data/tabuk_total_active.csv", delimiter=";", parse_dates=True, index_col="Date")
file2 = pd.read_csv("data/tabuk_daily_cases.csv", delimiter=";", parse_dates=True, index_col="Date")
file3 = pd.read_csv("data/tabuk_daily_deaths.csv", delimiter=";", parse_dates=True, index_col="Date")
file4 = pd.read_csv("data/tabuk_daily_recovories.csv", delimiter=";", parse_date=True, index_col="Date")
data1 = file1.drop(["Event", "Daily / Cumulative", "City", "Indicator"], axis=1)
data2 = file2.drop(["Event", "Daily / Cumulative", "City", "Indicator"], axis=1)
data3 = file3.drop(["Event", "Daily / Cumulative", "City", "Indicator"], axis=1)
data4 = file4.drop(["Event", "Daily / Cumulative", "City", "Indicator"], axis=1)
df1 = data1.rename(columns={'Cases (person)': 'active'})
df2 = data2.rename(columns={'Cases (person)': 'cases'})
df3 = data3.rename(columns={'Cases (person)': 'deaths'})
df4 = data4.rename(columns={'Cases (person)': 'recovories'})

结果:

 df1
        Region  active
    Date        
    2020-08-04  Tabuk   5
    2020-08-05  Tabuk   243
    2020-08-05  Tabuk   1
    2020-08-06  Tabuk   229
    2020-08-06  Tabuk   5
    ... ... ...
    2020-12-22  Tabuk   2
    2020-12-26  Tabuk   2
    2021-01-01  Tabuk   2
    2021-01-02  Tabuk   2
    2020-12-06  Tabuk   1

df2

    Region  cases
Date        
2020-03-20  Tabuk   1
2020-03-28  Tabuk   3
2020-03-29  Tabuk   1
2020-03-30  Tabuk   1
2020-04-01  Tabuk   2
... ... ...
2021-02-03  Tabuk   1
2021-02-03  Tabuk   1
2021-02-04  Tabuk   1
2021-02-04  Tabuk   1
2021-02-05  Tabuk   1

df3

    Region  deaths
Date        
2020-04-16  Tabuk   1
2020-05-31  Tabuk   2
2020-06-02  Tabuk   2
2020-06-06  Tabuk   1
2020-06-07  Tabuk   1
2020-06-09  Tabuk   1
2020-06-11  Tabuk   1
2020-06-13  Tabuk   1
2020-06-14  Tabuk   2
2020-06-15  Tabuk   1
2020-06-21  Tabuk   1
2020-06-26  Tabuk   1
2020-06-28  Tabuk   2
2020-07-01  Tabuk   1
2020-07-04  Tabuk   3
2020-07-05  Tabuk   4
2020-07-07  Tabuk   2
2020-07-08  Tabuk   1
2020-07-11  Tabuk   3
2020-07-12  Tabuk   1
2020-07-14  Tabuk   3
2020-07-15  Tabuk   3
2020-07-16  Tabuk   1
2020-07-17  Tabuk   2
2020-07-23  Tabuk   2
2020-07-24  Tabuk   3
2020-08-02  Tabuk   1
2020-08-03  Tabuk   2
2020-08-10  Tabuk   3
2020-08-11  Tabuk   2
2020-08-16  Tabuk   1
2020-08-20  Tabuk   1
2020-08-21  Tabuk   1
2020-08-22  Tabuk   1
2020-08-23  Tabuk   1
2020-09-05  Tabuk   1
2020-09-08  Tabuk   2
2020-09-09  Tabuk   1
2020-09-11  Tabuk   1
2020-09-14  Tabuk   1
2020-09-15  Tabuk   1
2020-09-19  Tabuk   1
2020-09-25  Tabuk   1
2020-09-29  Tabuk   1
2020-10-02  Tabuk   2
2020-10-06  Tabuk   1
2020-10-07  Tabuk   1
2020-10-11  Tabuk   1
2020-10-12  Tabuk   1
2020-10-27  Tabuk   1
2020-10-29  Tabuk   1
2020-10-30  Tabuk   1
2020-11-15  Tabuk   1
2020-11-25  Tabuk   1
2020-12-04  Tabuk   1
2020-12-08  Tabuk   1
2020-12-12  Tabuk   1
2021-01-02  Tabuk   1
2021-01-11  Tabuk   1
2021-01-14  Tabuk   1

df4
    Region  recovories
Date        
2020-04-07  Tabuk   4
2020-04-12  Tabuk   1
2020-04-13  Tabuk   1
2020-04-18  Tabuk   2
2020-04-19  Tabuk   1
... ... ...
2021-02-03  Tabuk   1
2021-02-04  Tabuk   2
2021-02-04  Tabuk   1
2021-02-05  Tabuk   1
2021-02-05  Tabuk   1

排序凌乱的日期:

df1 = df1.sort_values(by="Date")
df2 = df2.sort_values(by="Date")
df3 = df3.sort_values(by="Date")
df4 = df4.sort_values(by="Date")

接下来我想合并数据框以便在一个图中轻松绘制它们:

tabuk_all= [df1, df2, df3, df4]
all = pd.concat(tabuk_all)
all.head()



    Region  active  cases   deaths  recovories
Date                    
2020-03-20  Tabuk   1.0 NaN NaN NaN
2020-03-21  Tabuk   1.0 NaN NaN NaN
2020-03-22  Tabuk   1.0 NaN NaN NaN
2020-03-23  Tabuk   1.0 NaN NaN NaN
2020-03-24  Tabuk   1.0 NaN NaN NaN

并且出现了这些 NaN 值。我认为原因是日期不一致。我怎样才能首先合并每个文件的同一月份的行,所以它看起来像这样:

   Region  active                   >>>  Date  active
    Date                                   march-2020   400
    2020-08-04  Tabuk   5                  April-2020   350
    2020-08-05  Tabuk   243                may-2020   670
    2020-08-05  Tabuk   1                  ...
    2020-08-06  Tabuk   229
    2020-08-06  Tabuk   5                  ...
    ... ... ...
    2020-12-22  Tabuk   2
    2020-12-26  Tabuk   2
    2021-01-01  Tabuk   2
    2021-01-02  Tabuk   2
    2020-12-06  Tabuk   1

然后合并每个区域的 4 个文件,如下所示:

    Date    active  cases   deaths  recovories
0   march-2020  34  234  22  1232
1   april-2020  55  345  23  1234
2   may-2020    100 367  5   234

什么是最有效的方式来读取和执行所有文件而不是每个文件?

标签: pythonpython-3.xpandasdataframecsv

解决方案


推荐阅读