r - 满足条件如何保留一行并删除其他行
问题描述
我正在处理分类数据,并且在我可以以图形方式显示它之前,我已经将我的数据带到了最后一步。但是,我需要行来匹配条件,这就是我卡住的地方 - 卡住了,因为我不想手动操作。
我的数据:
x <- data.frame("Phylum" = c("Chordata", "Chordata", "Chordata", "Chordata", "Chordata", "Chordata"),
"Class" = c("NA", "Actinopterygii", "Actinopterygii", "Actinopterygii", "Actinopterygii", "Actinopterygii"),
"Order" = c("NA", "NA", "Gadiformes", "Gadiformes", "Gadiformes", "Gadiformes"),
"Family" = c("NA", "NA", "NA", "Moridae", "Moridae", "Moridae"),
"Genus" = c("NA", "NA", "NA", "NA", "Notophycis", "Notophycis"),
"Species" = c("NA", "NA", "NA", "NA", "NA", "Notophycis marginata"),
Number = c(21616, 12123, 1497, 730,730,730))
想要的最终结果:
y <- data.frame("Phylum" = c("Chordata", "Chordata", "Chordata", "Chordata"),
"Class" = c("NA", "Actinopterygii", "Actinopterygii", "Actinopterygii"),
"Order" = c("NA", "NA", "Gadiformes", "Gadiformes"), "Family" = c("NA", "NA", "NA", "Moridae"),
"Genus" = c("NA", "NA", "NA", "Notophycis"), "Species" = c("NA", "NA", "NA", "Notophycis marginata"),
Number = c(9493, 10626, 767, 730))
这是一个来自更大更复杂数据集的简单子集示例。因此,如果我能以某种方式将其放入代码中:
- 数字之和 (
Phylum == "P1" & Class == "NA"
) -Class == "C1" & Order == "NA"
如果门匹配,数字之和 ( ) 将等于 P1 的新数字 - sum of Number (
Class == "C1" & Order== "NA"
) - Number ( ) 的总和Order == "O1" & Family == "NA"
IF 类匹配,这将等于 C1 的新 Number 等...
但是,如果数字与多行匹配,我需要有代码来评估这些行并选择具有最少 NA 的行并保留该数字...
我想我想编写一个函数来做到这一点,但不知道从哪里开始!
感谢帮助:)
更新
测试人员:
Phylum Class Order Family Genus Species Reads_sum
Chordata Elasmobranchii Carcharhiniformes NA NA NA 31
Chordata Actinopterygii Perciformes Scombridae NA NA 589
Chordata Elasmobranchii Carcharhiniformes Pentanchidae NA NA 31
Chordata Actinopterygii Myctophiformes Myctophidae Notoscopelus NA 208
Chordata Actinopterygii Perciformes Scombridae Katsuwonus NA 589
Chordata Actinopterygii Myctophiformes Myctophidae Notoscopelus Notoscopelus caudispinosus 178
Chordata Actinopterygii Perciformes Scombridae Katsuwonus Katsuwonus pelamis 589
Cnidaria Hydrozoa Leptothecata Plumulariidae NA NA 69
Cnidaria Hydrozoa Leptothecata Plumulariidae Plumularia NA 69
Echinodermata Ophiuroidea NA NA NA NA 146
Echinodermata Ophiuroidea Ophiurida NA NA NA 137
Echinodermata Ophiuroidea Ophiurida Ophiuridae NA NA 137
Echinodermata Ophiuroidea Ophiurida Ophiuridae Ophioplinthus NA 137
Echinodermata Ophiuroidea Ophiurida Ophiuridae Ophioplinthus Ophioplinthus accomodata 137
Mollusca Cephalopoda Oegopsida Ommastrephidae NA NA 34311
Ochrophyta Phaeophyceae Ectocarpales Acinetosporaceae NA NA 29
执行我想要的但每次都必须更改变量的代码:
Tester$Reads_sum[Tester$Class == "Ophiuroidea" & Tester$Order == "NA"] - sum(Tester$Reads_sum[Tester$Class == "Ophiuroidea" & Tester$Order != "NA" & Tester$Family == "NA"])
所以我希望这样的事情会起作用,我只需要将 Class 更改为其他选定的分类等级:
for (i in unique(Tester$Class)){
Tester$Test.1 <- ifelse(Tester$Class != "NA" & Tester$Order == "NA",
Tester$Reads_sum[Tester$Class == i & Tester$Order == "NA"] - sum(Tester$Reads_sum[Tester$Class == i & Tester$Order != "NA" & Tester$Family == "NA"]), 0)
}
但它给了我一个 NA 而不是 9。
最终数据应如下所示:
Phylum Class Order Family Genus Species Reads_sum
Chordata Elasmobranchii Carcharhiniformes Pentanchidae NA NA 31
Chordata Actinopterygii Myctophiformes Myctophidae Notoscopelus NA 30
Chordata Actinopterygii Myctophiformes Myctophidae Notoscopelus Notoscopelus caudispinosus 178
Chordata Actinopterygii Perciformes Scombridae Katsuwonus Katsuwonus pelamis 589
Cnidaria Hydrozoa Leptothecata Plumulariidae Plumularia NA 69
Echinodermata Ophiuroidea NA NA NA NA 9
Echinodermata Ophiuroidea Ophiurida Ophiuridae Ophioplinthus Ophioplinthus accomodata 137
Mollusca Cephalopoda Oegopsida Ommastrephidae NA NA 34311
Ochrophyta Phaeophyceae Ectocarpales Acinetosporaceae NA NA 29
解决方案
感谢更新。我想出了一些我认为满足您正在寻找的东西,但需要一些支持。
我是否正确地认为它是按顺序排列的树状数据c("Phylum", "Class", "Order", "Family", "Genus", "Species")
?并且您有兴趣查找树的每个级别,您想删除下面图层的值吗?
我希望我的代码不会太混乱,我发现以当前格式使用数据具有挑战性。我更喜欢将它分成树的各个级别,即只有 Phylum 数据的那些,一直到具有树所有级别的那些。为此,我最喜欢使用该data.table
软件包。
我lapply's
尽可能地使用它们,因为一旦你经常使用它们,我发现它们很容易解释。我确信会有更有效的解决方案,但作为初学者,我认为了解和理解所需的步骤更为重要。
# using data.table package, as I find it quicker and easier to work with
# for complex problems. Run the hashed out command below if you dont have it
# install.packages("data.table")
library(data.table)
# turning in to a data.table, similar to data.frame, but some differences.
dt <- as.data.table(Tester)
# I am making an id, which I will use to split up this data. Different rows
# have different structures, as its a tree structure, so I am going to break
# the data up
dt[, id := 1:.N]
# to do so i need to know the order of significance of the tree. I believe
# they go in this order:
col_structure <- c("Phylum", "Class", "Order", "Family", "Genus", "Species")
# I want to find out at which level of the tree each row is, so I am going
# to change teh shape from wide to long, and then do some row aggregation on
# the single column, to group
melt_dt <- melt(dt, id.vars = "id",
measure.vars = col_structure)
# tip: try not to use "NA", but instead NA, they have different structures
# and built in commands like is.na make them easier to differentiate
melt_dt[value == "NA", value := NA]
melt_dt <- melt_dt[!is.na(value)]
melt_dt[]
# using a data.table command .N, grouped by id, to find out how many non NA
# values there are, this will tell me where it is in the tree
group_ids <- melt_dt[, .N, by = id]
# Ok, so now I will split up each row in to where it sits in the tree
split_ids <- split(group_ids, group_ids$N)
split_ids
# pull out the number of levels of tree for easy use
levels <- seq_along(split_ids)
# merge back in the original data, so we have the same data at the start, but
# split up in to new sets. Makes it easier to think about the problem
split_dt <- lapply(levels, function(x){
out <- merge(split_ids[[x]], dt, by = "id")
N <- as.numeric(names(split_ids)[x])
# using keys in my data, to make easy extraction. means rather than do
# Phylum == "a" & Class == "b" later on, if Phylum & Class are the keys,
# then can use command J("a", "b"). See next stage
setkeyv(out, col_structure[1:N])
out
})
# Now I'm going to add the value in. I will look at the next level of the tree
# and remove the values from that level from the reads_sum. Try it with setting
# x = 1.
# I've removed bottom element of the tree, don't know what to do with them
split_dt_with_value <- lapply(levels[1:(length(levels)-1)], function(x){
# similar to for loop, but using data.table keys to extract data
out <- split_dt[[x]]
out$Test.1 <- out$Reads_sum - sapply(1:nrow(out), function(i){
sum(split_dt[[(x+1)]][J(out[i, key(out), with = FALSE])]$Reads_sum,
na.rm = TRUE)
})
out
})
# combine results, and with the bottom tree level
combined <- rbindlist(c(split_dt_with_value,
split_dt[max(levels)]),
fill = TRUE)
# turn it back in to data frame form
combined <- as.data.frame(combined)
combined
请看一下,让我知道是否有任何步骤令人困惑,或者任何逻辑不正确:)
干杯,乔尼
推荐阅读
- magento - Magento 1.9 cron 正在运行,但在一种方法上停止
- sqlite - 无法在 React-Native 中通过 RecyclerListView 显示 SQLite 表
- ionic-framework - 如何在 Ionic 中为“INR”货币运行我的 PayPal 代码
- python - Python 3:为什么循环比递归快?
- kubernetes - Kubernetes 丢失了 ~/.kube/config
- node.js - Express:会话秘密
- c - 如何在 ac 函数中返回通用指针?
- django - Django 使用包含文件字段的外部 postgresql 数据库
- r - Trying to install easyGgplot2 in R
- matlab - 如何在 MATLAB 的子图中对齐 xlabels 和 ylabels