r - Why to use secondary index (`on`) in data.table, unless for joining?
问题描述
I'm a big fan of data.table
, but have this confusion about use of on
Is this
A) flights["JFK", on = "origin"]
not the same as
B) flights["JFK" == origin]; # Or flights["JFK" == get("origin")];
?
What would be the reason to use the former (A) instead of latter (B) ? In other words, if anyone can use dt[this == "that"], what was the reason to introduce another way of doing exactly the same with dt["that" , on = "this"]? Can't see the reason for this from the vignette.
PS. I do understand why on
was introduced for merging files (like it in dtA[dtB, on=.(A2=B2)]
). I used all the time and love it - as it makes the code so much shorter and easier to read, and is also very fast!
解决方案
data.table有很多优化。有趣的是,如果您将 order 切换到flights[origin == "JFK"]
then data.table将在有足够的行时创建一个索引。以下是一些verbose = TRUE
用于帮助查看正在发生的事情的选项:
library(data.table)
flights <- fread("vignettes/flights14.csv")
## Using binary merge method explicitly
invisible(flights["JFK", on = "origin", verbose = TRUE])
## i.V1 has same type (character) as x.origin. No coercion needed.
## forder.c received 253316 rows and 11 columns
## Calculated ad hoc index in 0.010s elapsed (0.000s cpu)
## Starting bmerge ...
## forder.c received 1 rows and 1 columns
## bmerge done in 0.000s elapsed (0.000s cpu)
## Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu)
## No optimization. Very quiet
invisible(flights["JFK" == origin, verbose = TRUE])
## Note an index is created and the message is similar to the first option
invisible(flights[origin == "JFK", verbose = TRUE])
## Creating new index 'origin'
## Creating index origin done in ... forder.c received 253316 rows and 11 columns
## forder took 0.09 sec
## 0.500s elapsed (0.570s cpu)
## Optimized subsetting with index 'origin'
## forder.c received 1 rows and 1 columns
## forder took 0 sec
## x is already ordered by these columns, no need to call reorder
## i.origin has same type (character) as x.origin. No coercion needed.
## on= matches existing index, using index
## Starting bmerge ...
## bmerge done in 0.000s elapsed (0.000s cpu)
## Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu)
那么对于您的问题,为什么要使用on =
join 方法?它将使用二进制合并来有效地查找匹配项和子集,当存在现有索引时,这可以超快。此外,on =
不会自动创建可能需要的索引。
相关的,通过额外的处理dt["a", on = "ID"]
被翻译成dt[data.table(V1 = "a"), on = "ID"]
帮助名称。换句话说,它只是用户方便到dtA[dtB, on=.(A2=B2)]
您喜欢的更常见的。
你为什么要使用dt[this == "that"]
代码是非常直接的 - R 中的任何人都会认识到正在发生的事情。此外,对于较大的 data.tables,将自动创建一个新索引,这可能是可取的。这可能是我将继续前进的代码。
推荐阅读
- ios - 创建一个airplay服务器nodejs
- python - 借助 Posh-SSH 模块 (W10) 在远程服务器上运行本地 python 脚本
- python - python 3.7.4:FLASK_SQLALCHEMY 没有名为“_sqlite3”的模块
- java - 每个数据库的 JPA 列定义更改
- angular - Angular:更改动态表的执行顺序
- macros - 通用构建枚举的一些错误
- css - CSS - 更改元素标签并替换其内容
- azure - 能否使用 Azure VM 设置网络带宽限制?
- angular - 尽管通过了审核,但 pwa 无法通过 chrome 安装
- python-2.7 - Python2.7-DB2v11.1.4.4 连接失败