首页 > 解决方案 > 使用唯一 id 平均分割 3 个数据帧

问题描述

我有三个表:upload, pap1, pap2) 每个表有 50 列和 15 万行,我想使用唯一的主键将三个数据帧拆分匹配的多个数据帧(其中每个子集最多 1000 行),例如,subset_upload1必须有相同的IDsubset_pap1等等subset_pap2...

 employee_id<-c(1,2,3)
 employee <- c('John','Peter ','Jolie')
 salary <- c(21000, 23400, 26800)
 startdate <- as.Date(c('2010-11-1','2008-3-25','2007-3-14'))

 upload<- data.frame(employee_id,employee, salary, startdate)

 employee_id<-c(1,2,3)
 line_1<-c('address1','address2','address3')
 line_2<-c('address1','address2','address3')
 postcode<-c('postcode1','postcode2','postcode')

 pap1<-data.frame(employee_id,line_1,line_2,postcode)


 age<-c(57,43,23)
 Height<-c(150,170,190)
 gender<-c('M','M','F')
 enddate<-as.Date(c('2020-11-1','2020-3-25','2020-3-14'))

 pap2<-data.frame(employee_id,age,Height,gender,enddate)

我希望的结果是:

    subupload1<-data.frame(employee_id =1,employee = "John",salary=21000,startdate=as.Date('2010-11-1'))
   subpap1_1<-data.frame(employee_id=1,line_1='address1',line_2='address1',postcode='postcode1')
         subpap2_1<-data.frame(age=57,Height=150,gender='M',enddate=as.Date('202011-1'))

标签: rdataframedata-manipulation

解决方案


我们可以使用while循环n从 3 个数据帧中随机抽取唯一 id 并对其进行子集化,以创建新的数据帧。

n <- 1 #Number of unique primary key in one dataframe
remaining_ids <- unique(upload$employee_id)
counter <- 1

while(length(remaining_ids) > n) {
   ids <- sample(remaining_ids, n)
   assign(paste0("subupload_", counter), subset(upload, employee_id %in% ids))
   assign(paste0("subpap1_", counter), subset(pap1, employee_id %in% ids))
   assign(paste0("subpap2_", counter), subset(pap2, employee_id %in% ids))
   counter <- counter + 1
   remaining_ids <- setdiff(remaining_ids, ids)
}

assign(paste0("subupload_", counter),subset(upload, employee_id %in% remaining_ids))
assign(paste0("subpap1_", counter), subset(pap1, employee_id %in% remaining_ids))
assign(paste0("subpap2_", counter), subset(pap2, employee_id %in% remaining_ids))

但是,尝试使用列表来更好地处理/管理数据,而不是用大量对象污染全局环境。


如果我们想将所有这些都写成 csv,我们可以使用write.csv而不是assignlike :

while(length(remaining_ids) > n) {
  ids <- sample(remaining_ids, n)
  write.csv(subset(upload, employee_id %in% ids), paste0("subupload_", counter, ".csv"))
  write.csv(subset(pap1, employee_id %in% ids), paste0("subpap1_", counter, ".csv"))
  write.csv(subset(pap2, employee_id %in% ids), paste0("subpap2_", counter, ".csv"))
  counter <- counter + 1
  remaining_ids <- setdiff(remaining_ids, ids)
}
write.csv(subset(upload, employee_id %in% remaining_ids), paste0("subupload_", counter, ".csv"))
write.csv(subset(pap1, employee_id %in% remaining_ids), paste0("subpap1_", counter, ".csv"))
write.csv(subset(pap2, employee_id %in% remaining_ids), paste0("subpap2_", counter, ".csv"))

推荐阅读