首页 > 解决方案 > R循环大数据

问题描述

我有一个大型数据集,我想逐个运行一些代码,因为它的大小让我的电脑一次性运行。

到目前为止,这是我的代码......我的数据集有列基因、月份和计数

df <- read.table(file = "/Users/x/x.txt", 
                         header = TRUE, sep=",", fill=TRUE, comment.char = "")

count_by_gene <- 
  df %>%
  group_by(gene) %>%
  summarize(count = n())

由于数据集太大,我无法导入。有没有办法一块一块地做,并为每块创建一个不同的表(count_by_gene)?

标签: r

解决方案


你好,从你所说的你只需要加载gene列来计数gene。对于大数据框,我还会指导您data.table打包,以便更有效地从 CSV 读取和处理。如果基因名称是字符串,那么将它们加载为factor(存储为整数)将进一步减少内存占用

library(data.table)

# fread is data.table's read.table. It is also smart in detecting
# separators otherwise you can still provide them as parameters
dt <- fread("yourfile.txt", select="gene", stringAsFactors=TRUE)

# here we group by "gene" values and compute for each group the 
# count (using .N pseudo variable)
# the empty comma at the beginning means that we want all lines
count_by_gene <- dt[, by="gene", list(count = .N)]

如果它仍然很大,前提是您可以使用例如将CSV 文件拆分为较小文件但保留标题的提示将文件分成几个块?看来您正在使用linux,那么您可以将结果与以下代码合并

file_parts <- c("fic1.txt", "fic2.txt", .... )

# compute counts for each part
parts_counts <- lapply(file_parts, function(file) {
  dt <- fread(file, select="gene", stringAsFactors=TRUE)
  dt[, by="gene", list(count = .N)]
})

# merge part counts in a single table
merged_parts_counts <- rbindlist(parts_counts)

# then total count is sum of part counts
gene_counts <- merged_parts_counts[, by="gene", list(count=sum(count))]

还值得看看hdd看起来像您正在寻找的包(Easy Manipulation of Out of Memory Data Sets)。


推荐阅读