首页 > 解决方案 > 非 equi 连接不保留原始列值

问题描述

我在运行非 equi 连接(来自 R 的 data.table 库)时发现了一个奇怪的行为,我不知道为什么会发生这种情况。

为什么在运行非等连接时,如果我想保留左表的原始值,我需要写x.colname而不是只colnamej连接的属性内?

这是我正在谈论的一个可重复的小示例:

library(tidyverse)
library(data.table)

# Setting seed for reproducibility
set.seed(666)

# data.table that contains roadway segments.
# The "frm_dfo" and "to_dfo" columns represent the start and end mileposts 
# of each roadway segment. For example, the segment with road_ID=101 refers 
# to the portion of IH20 that starts at milepost 10 and ends at milepost 20.
roads = data.table(road_id=101:109,
                   hwy=c('IH20','IH20','IH20','SH150','SH150','SH150','TX66','TX66','TX66'),
                   frm_dfo=c(10,20,30,10,20,30,10,20,30),
                   to_dfo=c(20,30,40,20,30,40,20,30,40),
                   seg_name=c('Seg 1','Seg 2', 'Seg 3','Seg 10','Seg 20', 'Seg 30','Seg 100','Seg 200', 'Seg 300'))

# data.table that contains crashes. 
# The "dfo" column represents the milepost of the roadway on which the 
# crash occurs. For example, the crash with crash_id=1 happens on milepost 33.23105 of IH20.
crashes = data.table(crash_id=1:30,
                     hwy=rep(c('IH20','SH150','BOB11'),each=10),
                     dfo=runif(min=10,max=40, n=30))

# Non-equi join that finds which segment each crash happens on.
joined_data_v1 = crashes %>%
                  .[roads, 
                    j  = list(crash_id, hwy, x.dfo, seg_name, frm_dfo, to_dfo),
                    on = list(hwy=hwy, dfo >= frm_dfo, dfo <= to_dfo)] %>%
                  arrange(crash_id, by_group = TRUE)

# Again, joining crashes and roadway segments. 
# Here, though, note that I've swapped x.dfo for just dfo inside the `j` argument 
joined_data_v2 = crashes %>%
                  .[roads, 
                    j  = list(crash_id, hwy, dfo, seg_name, frm_dfo, to_dfo),
                    on = list(hwy=hwy, dfo >= frm_dfo, dfo <= to_dfo)] %>%
                  arrange(crash_id, by_group = TRUE)

这是joined_data_v1x.dfoj参数中使用)的快照: 加入数据v1

这是joined_data_v2dfoj参数中使用)的快照: 加入数据v2

请注意,在 中joined_data_v1,被调用的列如何包含来自data.table的列x.dfo的确切值。但是,在 中,被调用的列包含来自 data.table 列的值(而不是来自data.table列的实际数据)。dfocrashesjoined_data_v2dfofrm_dforoadscrashesdfo

这里发生了什么?为什么这行为如此奇怪?为什么dfo/x.dfo结果 data.table 的列中包含的值并不总是准确地反映 data.table 的原始dfo列中包含的crashes内容?

我尝试查看非 equi 连接的一些文档,但在这里找不到任何可以帮助我的东西。

是一个相关的相关问题,但他们没有提到为什么会发生这种行为。

标签: rjoindata.tabletidyversenon-equi-join

解决方案


推荐阅读