首页 > 解决方案 > 通过重复值和何时有断点创建具有条件的新列

问题描述

我的数据是大约 40 只动物(id),通过遥测定位,我已经规定了 3 个区域。第一个是AR,哪里是繁殖区,哪里是AM迁徙,哪里AA是觅食区。所有动物的第一个位置在AR。但有时动物还处于繁殖期(在AR),但可以出去AM几次,然后又回来了AR。只有当动物才AM开始迁徙,直到到达觅食区AA。因此,它们从 开始AR,然后开始迁移AM,然后到达觅食区AA

我试图用一些我还不知道该怎么做的条件创建一个新列,例如我有这个数据框

id     area   
2304   AR
2304   AR
2304   AR
2304   AM  #this AM for example, can repeat until 20 times and then came back to AR
2304   AM
2304   AR
2304   AR
2304   AR
2304   AM
2304   AM
2304   AM
2304   AM
2304   ...
2304   AM
2304   AM
2304   AM
2304   AA
2304   AA
2304   ...
2304   AA

所以,当有 AR x 次并且在此之后有一个或直到 20 点并且回来有 AR 时,我想要一个带有 AR 的新列。到有 AM x 次且只有 AM 的那一刻,没有回到 AR,我想要 AM 的新列。像这样:

和 AA 没关系,AA = AA 总是

我期待这个:

id    area    fixed_area
2304   AR      AR
2304   AR      AR
2304   AR      AR
2304   AM      AR  #this AM for example, can repeat until 20 times and then came back to AR
2304   AM      AR
2304   AR      AR
2304   AR      AR
2304   AR      AR
2304   AM      AM
2304   AM      AM
2304   AM      AM
2304   AM      AM
2304   ...     ...
2304   AM      AM
2304   AM      AM
2304   AM      AM
2304   AA      AA
2304   AA      AA
2304   ...    ...
2304   AA      AA

我试过这个:

但是AA缺少了,也许问题是因为需要对每只动物(id)进行这种分离

> table(df$area)

   AA    AM    AR 
31460 39101 28820 

class(df$area)
[1] "character"
> idx <- with(rle(as.character(df$area)), rep(seq_along(lengths), lengths))
> df$fixed_area <- with(df, replace(area, idx < max(idx[area == 'AM']), 'AR'))
> table(df$fixed_area)

   AM    AR 
  145 99236 
> 

在此之后我输入了数据框,但我的数据框有超过 90.000 行,所以我只复制了 head 值

> dput(head(df))
structure(list(DeployID = c("111868_16", "111868_16", "111868_16", 
"111868_16", "111868_16", "111868_16"), Start = structure(c(1477323868, 
1477323946, 1477324002, 1477324044, 1477324260, 1477324480), class = c("POSIXct", 
"POSIXt"), tzone = "GMT"), End = structure(c(1477323944, 1477324000, 
1477324042, 1477324170, 1477324458, 1477324542), class = c("POSIXct", 
"POSIXt"), tzone = "GMT"), What = structure(c(1L, 1L, 1L, 1L, 
1L, 1L), .Label = c("Dive", "Message", "Surface"), class = "factor"), 
    Shape = structure(c(2L, 4L, 3L, 2L, 2L, 2L), .Label = c("", 
    "Square", "U", "V"), class = "factor"), DepthMean = c(14.5, 
    16.5, 13, 14.5, 11, 12.5), DurationMean = c(76, 54, 40, 126, 
    198, 62), DepthMin = c(14.5, 16.5, 13, 14.5, 11, 12.5), DepthMax = c(14.5, 
    16.5, 13, 14.5, 11, 12.5), depth_range = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L), .Label = c("shallow", "deep"), class = c("ordered", 
    "factor")), MidTime = structure(c(1477323906, 1477323973, 
    1477324022, 1477324107, 1477324359, 1477324511), class = c("POSIXct", 
    "POSIXt"), tzone = "GMT"), year = c(2016, 2016, 2016, 2016, 
    2016, 2016), id = c("111868_16", "111868_16", "111868_16", 
    "111868_16", "111868_16", "111868_16"), segmentid = c("111868_16", 
    "111868_16", "111868_16", "111868_16", "111868_16", "111868_16"
    ), mu.x = c(-4446545.25191192, -4446557.10576816, -4446565.77504969, 
    -4446580.81370994, -4446625.40007808, -4446652.29459533), 
    mu.y = c(-2305423.86124176, -2305461.88537725, -2305489.69364377, 
    -2305537.93137917, -2305680.93056743, -2305767.17264774), 
    lon = c(-39.9439956132156, -39.944102098218, -39.944179975699, 
    -39.9443150702825, -39.9447155964422, -39.9449571940013), 
    lat = c(-20.3985940756941, -20.3989161274532, -20.3991516537744, 
    -20.3995602097098, -20.4007713539709, -20.4015017842338), 
    lq_closest_filt = c(7L, 7L, 7L, 7L, 7L, 7L), dt_closest_filt = c(0.0516666666666667, 
    0.0702777777777778, 0.0838888888888889, 0.1075, 0.1775, 0.219722222222222
    ), dist_closest_filt = c(0.103680210832692, 0.141026573116106, 
    0.168339162761167, 0.215717097671267, 0.356168027785347, 
    0.440874049523752), rel.angle = c(NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_), speed = c(NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_), depth_bin = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L), .Label = c("(0,50]", "(50,100]", "(100,150]", 
    "(150,200]", "(200,250]", "(250,300]", "(300,350]", "(350,400]", 
    "(400,450]", "(450,500]", "(500,550]", "(550,600]", "(600,650]", 
    "(650,700]"), class = "factor"), bat = structure(list(depth = c(-59L, 
    -59L, -59L, -59L, -59L, -59L)), row.names = c(NA, 6L), class = "data.frame"), 
    area = c("AR", "AR", "AR", "AR", "AR", "AR")), row.names = c(NA, 
6L), class = "data.frame") 

有人知道如何解决这个问题吗?谢谢!

标签: rconditional-statementsrepeatbreakpointsconditional-breakpoint

解决方案


听起来您可能需要使用一些规则来决定哪些行带有AMbecome AR

  • 如果连续AM数 < 20
  • 如果以下目的地不是 AA

一种方法是添加与这两个规则相关的列,使用rle. 一列将具有lengths重复序列中的连续值的数量。另一列将具有“下一个”区域。这与决定目的地是回到繁殖区还是继续到饲养区有关。

最后,您可以使用条件语句并将这些行更改AMAR满足以下条件:

  • 当前areaAM
  • 接下来不是area_ AA
  • 重复值的个数小于 20

这是代码:

df_rle <- rle(df$area)
df2 <- cbind(df, next_area = with(df_rle, rep(c(values[-1], NA), lengths)),
                 count = with(df_rle, rep(lengths, lengths)))
df2$area <- ifelse(with(df2, area == "AM" & next_area != "AA" & count < 20),
                   "AR", df2$area)

推荐阅读