python - 连接多个数据帧时如何解决这个 NaN 值(索引)熊猫问题?
问题描述
我正在尝试使用 pandas 读取和分析多个“KSA 中的 covid-19”CSV 文件,特别是 52 个文件,并绘制它们,然后将这些图插入到 folium 地图中。我在 Windows 10 上使用 python 3.9.1。IDE 是用于测试的 jupyter notebook,然后将在 pycharm 中编写最终代码。基本上,KSA 13 个地区中的每个地区都有 4 个文件:病例、康复、死亡、活跃。(delimiter=;) 示例:
Daily / Cumulative Indicator Date Event City Region Cases (person)
0 Cumulative Active 2020-08-04 NaN Shuwaq Tabuk 5
1 Cumulative Active 2020-08-05 NaN Tabuk Tabuk 243
2 Cumulative Active 2020-08-05 NaN Al Badʻ Tabuk 1
3 Cumulative Active 2020-08-06 NaN Tabuk Tabuk 229
4 Cumulative Active 2020-08-06 NaN Taymaʼ Tabuk 5
... ... ... ... ... ... ... ...
2220 Cumulative Active 2020-12-22 NaN Al Wajh Tabuk 2
2221 Cumulative Active 2020-12-26 NaN Al Wajh Tabuk 2
2222 Cumulative Active 2021-01-01 NaN Al Wajh Tabuk 2
2223 Cumulative Active 2021-01-02 NaN Al Wajh Tabuk 2
2224 Cumulative Active 2020-12-06 NaN Abu Rakah Tabuk 1
我清理了不需要的列,为串联重命名了类似的列:
import pandas as pd
import matplotlib.pyplot as plt
file1 = pd.read_csv("data/tabuk_total_active.csv", delimiter=";", parse_dates=True, index_col="Date")
file2 = pd.read_csv("data/tabuk_daily_cases.csv", delimiter=";", parse_dates=True, index_col="Date")
file3 = pd.read_csv("data/tabuk_daily_deaths.csv", delimiter=";", parse_dates=True, index_col="Date")
file4 = pd.read_csv("data/tabuk_daily_recovories.csv", delimiter=";", parse_date=True, index_col="Date")
data1 = file1.drop(["Event", "Daily / Cumulative", "City", "Indicator"], axis=1)
data2 = file2.drop(["Event", "Daily / Cumulative", "City", "Indicator"], axis=1)
data3 = file3.drop(["Event", "Daily / Cumulative", "City", "Indicator"], axis=1)
data4 = file4.drop(["Event", "Daily / Cumulative", "City", "Indicator"], axis=1)
df1 = data1.rename(columns={'Cases (person)': 'active'})
df2 = data2.rename(columns={'Cases (person)': 'cases'})
df3 = data3.rename(columns={'Cases (person)': 'deaths'})
df4 = data4.rename(columns={'Cases (person)': 'recovories'})
结果:
df1
Region active
Date
2020-08-04 Tabuk 5
2020-08-05 Tabuk 243
2020-08-05 Tabuk 1
2020-08-06 Tabuk 229
2020-08-06 Tabuk 5
... ... ...
2020-12-22 Tabuk 2
2020-12-26 Tabuk 2
2021-01-01 Tabuk 2
2021-01-02 Tabuk 2
2020-12-06 Tabuk 1
df2
Region cases
Date
2020-03-20 Tabuk 1
2020-03-28 Tabuk 3
2020-03-29 Tabuk 1
2020-03-30 Tabuk 1
2020-04-01 Tabuk 2
... ... ...
2021-02-03 Tabuk 1
2021-02-03 Tabuk 1
2021-02-04 Tabuk 1
2021-02-04 Tabuk 1
2021-02-05 Tabuk 1
df3
Region deaths
Date
2020-04-16 Tabuk 1
2020-05-31 Tabuk 2
2020-06-02 Tabuk 2
2020-06-06 Tabuk 1
2020-06-07 Tabuk 1
2020-06-09 Tabuk 1
2020-06-11 Tabuk 1
2020-06-13 Tabuk 1
2020-06-14 Tabuk 2
2020-06-15 Tabuk 1
2020-06-21 Tabuk 1
2020-06-26 Tabuk 1
2020-06-28 Tabuk 2
2020-07-01 Tabuk 1
2020-07-04 Tabuk 3
2020-07-05 Tabuk 4
2020-07-07 Tabuk 2
2020-07-08 Tabuk 1
2020-07-11 Tabuk 3
2020-07-12 Tabuk 1
2020-07-14 Tabuk 3
2020-07-15 Tabuk 3
2020-07-16 Tabuk 1
2020-07-17 Tabuk 2
2020-07-23 Tabuk 2
2020-07-24 Tabuk 3
2020-08-02 Tabuk 1
2020-08-03 Tabuk 2
2020-08-10 Tabuk 3
2020-08-11 Tabuk 2
2020-08-16 Tabuk 1
2020-08-20 Tabuk 1
2020-08-21 Tabuk 1
2020-08-22 Tabuk 1
2020-08-23 Tabuk 1
2020-09-05 Tabuk 1
2020-09-08 Tabuk 2
2020-09-09 Tabuk 1
2020-09-11 Tabuk 1
2020-09-14 Tabuk 1
2020-09-15 Tabuk 1
2020-09-19 Tabuk 1
2020-09-25 Tabuk 1
2020-09-29 Tabuk 1
2020-10-02 Tabuk 2
2020-10-06 Tabuk 1
2020-10-07 Tabuk 1
2020-10-11 Tabuk 1
2020-10-12 Tabuk 1
2020-10-27 Tabuk 1
2020-10-29 Tabuk 1
2020-10-30 Tabuk 1
2020-11-15 Tabuk 1
2020-11-25 Tabuk 1
2020-12-04 Tabuk 1
2020-12-08 Tabuk 1
2020-12-12 Tabuk 1
2021-01-02 Tabuk 1
2021-01-11 Tabuk 1
2021-01-14 Tabuk 1
df4
Region recovories
Date
2020-04-07 Tabuk 4
2020-04-12 Tabuk 1
2020-04-13 Tabuk 1
2020-04-18 Tabuk 2
2020-04-19 Tabuk 1
... ... ...
2021-02-03 Tabuk 1
2021-02-04 Tabuk 2
2021-02-04 Tabuk 1
2021-02-05 Tabuk 1
2021-02-05 Tabuk 1
排序凌乱的日期:
df1 = df1.sort_values(by="Date")
df2 = df2.sort_values(by="Date")
df3 = df3.sort_values(by="Date")
df4 = df4.sort_values(by="Date")
接下来我想合并数据框以便在一个图中轻松绘制它们:
tabuk_all= [df1, df2, df3, df4]
all = pd.concat(tabuk_all)
all.head()
Region active cases deaths recovories
Date
2020-03-20 Tabuk 1.0 NaN NaN NaN
2020-03-21 Tabuk 1.0 NaN NaN NaN
2020-03-22 Tabuk 1.0 NaN NaN NaN
2020-03-23 Tabuk 1.0 NaN NaN NaN
2020-03-24 Tabuk 1.0 NaN NaN NaN
并且出现了这些 NaN 值。我认为原因是日期不一致。我怎样才能首先合并每个文件的同一月份的行,所以它看起来像这样:
Region active >>> Date active
Date march-2020 400
2020-08-04 Tabuk 5 April-2020 350
2020-08-05 Tabuk 243 may-2020 670
2020-08-05 Tabuk 1 ...
2020-08-06 Tabuk 229
2020-08-06 Tabuk 5 ...
... ... ...
2020-12-22 Tabuk 2
2020-12-26 Tabuk 2
2021-01-01 Tabuk 2
2021-01-02 Tabuk 2
2020-12-06 Tabuk 1
然后合并每个区域的 4 个文件,如下所示:
Date active cases deaths recovories
0 march-2020 34 234 22 1232
1 april-2020 55 345 23 1234
2 may-2020 100 367 5 234
什么是最有效的方式来读取和执行所有文件而不是每个文件?
解决方案
推荐阅读
- android - 山图对数
- c# - 使用 Linq 动态查询数据集
- html - Angular 材质复选框自动取消选中
- java - 如何使用指定字符将字符串格式化 n 次,然后将其附加到另一个字符串?
- delphi - 带有两个变量delphi的嵌套If else条件
- php - 使用 Artisan 调用 php artisan 迁移
- r - 分析列表中的数据框并绑定结果
- python - 烧瓶 SQLAlchemy。AttributeError:模块 '<...>.db' 没有属性 'Model'
- javascript - 为什么幻灯片会跳过最后一张幻灯片?
- vb.net - 如何修复 vb.net 项目上未解决的引用异常