首页 > 解决方案 > 需要使导入功能更快速

问题描述

我正在创建这个函数来将文件从 HDFS 导入到 RStudio,它工作正常。但问题是它需要重要的时间才能给出所需的结果。

library(data.table)

import_file <- function (file_Path)
{

data.fichier <- as.data.table(system(paste("hadoop fs -cat",PAPath),intern=TRUE))
return(do.call(rbind, stringr::str_split(data.fichier$V1, ',')))

}

实际上,它的输入是 HDFS 中文件的目录,由:

/hdfs/data/lll/l111/l11/l1/InterfacePublique-Controle-PUB_1EEUC-201803-PR-20181004-100228-indicateurs-PUB_1EEUC/*

这是输出的一个例子:

  [,1]                                [,2]                      [,3] [,4]       [,5]                   
   [1,] "DIS_CD_SI_CD_QUL_SGN_PSE"          "001_COE"                 ""   "819832"   "3.2664467021013293"   
   [2,] "DIS_CD_SI_CD_QUL_SGN_PSE"          "001_COT"                 ""   "937680"   "3.7359870603079344"   
   [3,] "DIS_CD_SI_CD_QUL_SGN_PSE"          "001_EMP"                 ""   "3797954"  "15.132142095005504"   
   [4,] "DIS_CD_SI_CD_QUL_SGN_PSE"          "001_SOU"                 ""   "1327439"  "5.288899120540168"    
   [5,] "DIS_CD_SI_CD_QUL_SGN_PSE"          "001_TIT"                 ""   "13849361" "55.17984119265992"    
   [6,] "DIS_CD_SI_CD_QUL_SGN_PSE"          "002_COE"                 ""   "33716"    "0.13433425019766052"  
   [7,] "DIS_CD_SI_CD_QUL_SGN_PSE"          "002_COT"                 ""   "31649"    "0.1260987271475192"   
   [8,] "DIS_CD_SI_CD_QUL_SGN_PSE"          "002_EMP"                 ""   "158625"   "0.632007665132397"    

请问有什么优化它的代码的建议吗?

标签: rhdfshadoop2

解决方案


推荐阅读