python - csv 列在 R 中被读取为“num”,但在 pandas.read_csv() 中被读取为“object”
问题描述
数据集链接:https ://www.kaggle.com/blastchar/telco-customer-churn
R和pandas读取的“TotalCharges”列的数据类型不同的原因是什么?pandas 中的列应该是数字类型,而不是对象。
Python pandas.read_csv()
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
ch_data = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
ch_data.info()
结果:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
customerID 7043 non-null object
gender 7043 non-null object
SeniorCitizen 7043 non-null int64
Partner 7043 non-null object
Dependents 7043 non-null object
tenure 7043 non-null int64
PhoneService 7043 non-null object
MultipleLines 7043 non-null object
InternetService 7043 non-null object
OnlineSecurity 7043 non-null object
OnlineBackup 7043 non-null object
DeviceProtection 7043 non-null object
TechSupport 7043 non-null object
StreamingTV 7043 non-null object
StreamingMovies 7043 non-null object
Contract 7043 non-null object
PaperlessBilling 7043 non-null object
PaymentMethod 7043 non-null object
MonthlyCharges 7043 non-null float64
TotalCharges 7043 non-null object
Churn 7043 non-null object
dtypes: float64(1), int64(2), object(18)
memory usage: 1.1+ MB
“TotalCharges”的数据类型是对象。
R 读取.csv()
gg<-read.csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
str(gg)
结果:
'data.frame': 7043 obs. of 21 variables:
$ customerID : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
$ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
$ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
$ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
$ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
$ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
$ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
$ MultipleLines : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
$ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
$ OnlineSecurity : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
$ OnlineBackup : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
$ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
$ TechSupport : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
$ StreamingTV : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
$ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
$ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
$ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
$ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
$ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
$ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
$ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
TotalCharges的数据类型是 num。
解决方案
这是由处理空格字符的不同策略引起的。您可以在pd.read_csv(sep=)中使用正则表达式分隔符来“吃掉”仅包含空格的列:
df = pd.read_csv("WA_Fn-UseC_-Telco-Customer-Churn.csv", sep=r"\,\s*", engine='python')
df.dtypes
Out[19]:
customerID object
gender object
SeniorCitizen int64
Partner object
Dependents object
tenure int64
PhoneService object
MultipleLines object
InternetService object
OnlineSecurity object
OnlineBackup object
DeviceProtection object
TechSupport object
StreamingTV object
StreamingMovies object
Contract object
PaperlessBilling object
PaymentMethod object
MonthlyCharges float64
TotalCharges float64 <- correct
Churn object
# space -> nan
df["TotalCharges"][488]
Out[23]: nan
您可以看到它TotalCharges
被正确读取。
注意包含空格字符的行是这样找到的:
df = pd.read_csv("/mnt/ramdisk/WA_Fn-UseC_-Telco-Customer-Churn.csv")
for i in range(len(df)):
try:
_ = float(df["TotalCharges"][i])
except ValueError:
print(f'float() error: row={i}, val="{df.TotalCharges[i]}"')
# result
float() error: row=488, val=" "
float() error: row=753, val=" "
float() error: row=936, val=" "
float() error: row=1082, val=" "
float() error: row=1340, val=" "
float() error: row=3331, val=" "
float() error: row=3826, val=" "
float() error: row=4380, val=" "
float() error: row=5218, val=" "
float() error: row=6670, val=" "
float() error: row=6754, val=" "
同时,R 还决定在内部将文本编码为分类变量,而 pandas 则没有。一般来说,为了数据分析师的潜在便利,R 试图变得“更聪明”一点,因为 R 是为统计/分析目的而设计的。这可能会也可能不会给您带来麻烦。相反,Pandas 更通用,所以为了一致性而做的假设更少。所以这只是函数设计理念的不同选择,任何这样的观点总是纯粹基于意见,即使是函数的创建者自己回答。
推荐阅读
- javascript - JavaScript 货币转换器
- doctrine-orm - 带有子查询、order by、rand 和 group by 的 Doctrine Query
- reactjs - 在 Semantic UI React 中永远不会获取 Popup 中的输入引用
- dart - AngularDart:使用响应式表单生成器创建表单
- java - JSON 问题 - 如何解决?
- r - “稍后”构建 R 包会生成未定义的符号
- python - django 项目和 django 应用程序可以有不同的 docker 镜像吗?
- javascript - 无法从 html 页面调用缩小的 js 文件中的函数
- django - 当前端依赖于无头 wagtail API 时在 Wagtail 中预览未发布的草稿
- python - 标签数量与决策树回归上的样本不匹配