首页 > 解决方案 > Why to use secondary index (`on`) in data.table, unless for joining?

问题描述

I'm a big fan of data.table, but have this confusion about use of on

Is this

A) flights["JFK", on = "origin"]

not the same as

B) flights["JFK" == origin]; # Or  flights["JFK" == get("origin")];

?

What would be the reason to use the former (A) instead of latter (B) ? In other words, if anyone can use dt[this == "that"], what was the reason to introduce another way of doing exactly the same with dt["that" , on = "this"]? Can't see the reason for this from the vignette.

PS. I do understand why on was introduced for merging files (like it in dtA[dtB, on=.(A2=B2)]). I used all the time and love it - as it makes the code so much shorter and easier to read, and is also very fast!

标签: rdata.table

解决方案


有很多优化。有趣的是,如果您将 order 切换到flights[origin == "JFK"]then 将在有足够的行时创建一个索引。以下是一些verbose = TRUE用于帮助查看正在发生的事情的选项:

library(data.table)
flights <- fread("vignettes/flights14.csv")

## Using binary merge method explicitly
invisible(flights["JFK", on = "origin", verbose = TRUE])

## i.V1 has same type (character) as x.origin. No coercion needed.
## forder.c received 253316 rows and 11 columns
## Calculated ad hoc index in 0.010s elapsed (0.000s cpu) 
## Starting bmerge ...
## forder.c received 1 rows and 1 columns
## bmerge done in 0.000s elapsed (0.000s cpu) 
## Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu) 



## No optimization. Very quiet
invisible(flights["JFK" == origin, verbose = TRUE])



## Note an index is created and the message is similar to the first option
invisible(flights[origin == "JFK", verbose = TRUE])
## Creating new index 'origin'
## Creating index origin done in ... forder.c received 253316 rows and 11 columns
## forder took 0.09 sec
## 0.500s elapsed (0.570s cpu) 
## Optimized subsetting with index 'origin'
## forder.c received 1 rows and 1 columns
## forder took 0 sec
## x is already ordered by these columns, no need to call reorder
## i.origin has same type (character) as x.origin. No coercion needed.
## on= matches existing index, using index
## Starting bmerge ...
## bmerge done in 0.000s elapsed (0.000s cpu) 
## Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu) 

那么对于您的问题,为什么要使用on = join 方法?它将使用二进制合并来有效地查找匹配项和子集,当存在现有索引时,这可以超快。此外,on = 不会自动创建可能需要的索引。

相关的,通过额外的处理dt["a", on = "ID"]被翻译成dt[data.table(V1 = "a"), on = "ID"]帮助名称。换句话说,它只是用户方便到dtA[dtB, on=.(A2=B2)]您喜欢的更常见的。

你为什么要使用dt[this == "that"]代码是非常直接的 - R 中的任何人都会认识到正在发生的事情。此外,对于较大的 data.tables,将自动创建一个新索引,这可能是可取的。这可能是我将继续前进的代码。


推荐阅读