首页 > 解决方案 > 如何从 uci 中清除 R 中的数据集

问题描述

下午好 ,

假设我们有以下功能:

data_preprocessing<-function(link){
  
link=as.character(link) 
dataset=read.csv(link)  
dataset=replace(dataset,dataset=="?",NA)  

return(dataset)

}

示例(https协议问题):

Echocardiogram=data_preprocessing("https://archive.ics.uci.edu/ml/machine-learning-databases/echocardiogram/echocardiogram.data")
 Show Traceback
 
 Rerun with Debug
 Error in file(file, "rt") : cannot open the connection 

下载数据集后:

Echocardiogram=data_preprocessing("http://archive.ics.uci.edu/ml/machine-learning-databases/echocardiogram/echocardiogram.data")

head(Echocardiogram)

    X11   X0    X71 X0.1 X0.260     X9 X4.600    X14    X1  X1.1 name X1.2 X0.2
1    19    0     72    0  0.380      6  4.100     14 1.700 0.588 name    1    0
2    16    0     55    0  0.260      4  3.420     14     1     1 name    1    0
3    57    0     60    0  0.253 12.062  4.603     16 1.450 0.788 name    1    0
4    19    1     57    0  0.160     22  5.750     18 2.250 0.571 name    1    0
5    26    0     68    0  0.260      5  4.310     12     1 0.857 name    1    0
6    13    0     62    0  0.230     31  5.430   22.5 1.875 0.857 name    1    0

还 :

str(Echocardiogram)
'data.frame':   130 obs. of  12 variables:
 $ X11   : Factor w/ 57 levels "",".03",".25",..: 18 16 54 18 27 14 50 18 26 12 ...
 $ X0    : Factor w/ 4 levels "","?","0","1": 3 3 3 4 3 3 3 3 3 4 ...
 $ X71   : Factor w/ 40 levels "","?","35","46",..: 30 12 17 14 26 19 17 4 11 34 ...
 $ X0.1  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ X0.260: Factor w/ 74 levels "","?","0.010",..: 65 50 47 26 50 42 59 60 21 19 ...
 $ X9    : Factor w/ 93 levels "","?","0","10",..: 69 57 13 46 62 56 79 3 19 29 ...
 $ X4.600: Factor w/ 106 levels "","?","2.32",..: 25 6 54 92 38 85 76 70 47 33 ...
 $ X14   : Factor w/ 48 levels "","?","10","10.5",..: 16 16 21 27 8 36 16 21 19 27 ...
 $ X1    : Factor w/ 67 levels "","?","1","1.04",..: 48 3 37 60 3 52 3 11 16 50 ...
 $ X1.1  : Factor w/ 32 levels "","?","0.140",..: 14 30 25 13 27 27 30 31 29 21 ...
 $ X1.2  : Factor w/ 5 levels "","?","1","2",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ X0.2  : Factor w/ 5 levels "","?","0","1",..: 3 3 3 3 3 3 3 3 3 4 ...

在这里,我想将"?"数据集中的所有内容替换为NA. 此外,删除重复和空行(如 50 行)会很好。

谢谢你的帮助 !

标签: rdata-cleaning

解决方案


像这样的东西?

library(data.table)
DT <- data.table::fread("https://archive.ics.uci.edu/ml/machine-learning-databases/echocardiogram/echocardiogram.data", 
                        fill = TRUE,
                        na.strings = "?")

推荐阅读