首页 > 解决方案 > csv 列在 R 中被读取为“num”,但在 pandas.read_csv() 中被读取为“object”

问题描述

数据集链接:https ://www.kaggle.com/blastchar/telco-customer-churn

R和pandas读取的“TotalCharges”列的数据类型不同的原因是什么?pandas 中的列应该是数字类型,而不是对象。

Python pandas.read_csv()

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
ch_data = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
ch_data.info()

结果:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
customerID          7043 non-null object
gender              7043 non-null object
SeniorCitizen       7043 non-null int64
Partner             7043 non-null object
Dependents          7043 non-null object
tenure              7043 non-null int64
PhoneService        7043 non-null object
MultipleLines       7043 non-null object
InternetService     7043 non-null object
OnlineSecurity      7043 non-null object
OnlineBackup        7043 non-null object
DeviceProtection    7043 non-null object
TechSupport         7043 non-null object
StreamingTV         7043 non-null object
StreamingMovies     7043 non-null object
Contract            7043 non-null object
PaperlessBilling    7043 non-null object
PaymentMethod       7043 non-null object
MonthlyCharges      7043 non-null float64
TotalCharges        7043 non-null object
Churn               7043 non-null object
dtypes: float64(1), int64(2), object(18)
memory usage: 1.1+ MB

“TotalCharges”的数据类型是对象。

R 读取.csv()

gg<-read.csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
str(gg)

结果:

'data.frame':   7043 obs. of  21 variables:
 $ customerID      : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
 $ gender          : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
 $ SeniorCitizen   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Partner         : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
 $ Dependents      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
 $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
 $ PhoneService    : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
 $ MultipleLines   : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
 $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
 $ OnlineSecurity  : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
 $ OnlineBackup    : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
 $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
 $ TechSupport     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
 $ StreamingTV     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
 $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
 $ Contract        : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
 $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
 $ PaymentMethod   : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
 $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
 $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
 $ Churn           : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...

TotalCharges的数据类型是 num。

标签: pythonrpandasdataframe

解决方案


这是由处理空格​​字符的不同策略引起的。您可以在pd.read_csv(sep=)中使用正则表达式分隔符来“吃掉”仅包含空格的列:

df = pd.read_csv("WA_Fn-UseC_-Telco-Customer-Churn.csv", sep=r"\,\s*", engine='python')
df.dtypes
Out[19]: 
customerID           object
gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges        float64  <- correct
Churn                object

# space -> nan
df["TotalCharges"][488]
Out[23]: nan

您可以看到它TotalCharges被正确读取。

注意包含空格字符的行是这样找到的:

df = pd.read_csv("/mnt/ramdisk/WA_Fn-UseC_-Telco-Customer-Churn.csv")
for i in range(len(df)):
    try:
        _ = float(df["TotalCharges"][i])
    except ValueError:
        print(f'float() error: row={i}, val="{df.TotalCharges[i]}"')

# result
float() error: row=488, val=" "
float() error: row=753, val=" "
float() error: row=936, val=" "
float() error: row=1082, val=" "
float() error: row=1340, val=" "
float() error: row=3331, val=" "
float() error: row=3826, val=" "
float() error: row=4380, val=" "
float() error: row=5218, val=" "
float() error: row=6670, val=" "
float() error: row=6754, val=" "

同时,R 还决定在内部将文本编码为分类变量,而 pandas 则没有。一般来说,为了数据分析师的潜在便利,R 试图变得“更聪明”一点,因为 R 是为统计/分析目的而设计的。这可能会也可能不会给您带来麻烦。相反,Pandas 更通用,所以为了一致性而做的假设更少。所以这只是函数设计理念的不同选择,任何这样的观点总是纯粹基于意见,即使是函数的创建者自己回答


推荐阅读