首页 > 解决方案 > 尝试使用定义的拆分模式使用 str_split,适用于样本(df)但不适用于完整文档(df 和 d.table)

问题描述

所以我试图将地址拆分为 addr1 (街道地址)和 addr2 (单位) - 或多或少是准确的。我创建了一个“拆分模式”值来识别单元编号,并尝试将其应用于地址字符串。在这样做时,我收到以下错误:

split_patterns <- c(" APT "," STE "," UNIT "," # ")
split_patterns <- paste(split_patterns,collapse="|")

addsimple[c("addr1","addr2")] <- apply(str_split(addsimple$address,split_patterns,simplify=TRUE,n=2),2,str_trim)

Error in `[.data.table`(x, i, which = TRUE) : 
  When i is a data.table (or character vector), the columns to join by must be specified using 'on=' argument (see ?data.table), by keying x (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing column names between x and i (i.e., a natural join). Keyed joins might have further speed benefits on very large data due to x being sorted in RAM.

我尝试创建一个带有子集的示例数据文件以在此处获得建议,并且示例文件有效:

address <-  c("100 W 26TH ST", "100 W 26TH ST APT 7H", "11 PENN PLZ FL 6", "1170 BROADWAY", "1186 BROADWAY",      
"1186 BROADWAY # 1003", "1200 BROADWAY", "1200 BROADWAY APT 3G", "125 W 31ST ST",  "125 W 31ST ST APT 39F",
"126 W 34TH ST", "1261 BROADWAY" , "130 W 29TH ST RM 500", "134 W 32ND ST", "151 W 26TH ST FL 3",  
"154 W 27TH ST", "154 W 27TH ST RM 4W", "155 W 29TH ST", "165 W 26TH ST", "20 W 27TH ST")

df_address <- as.data.frame(address)

split_patterns <- c(" APT "," STE "," UNIT "," # ")
split_patterns <- paste(split_patterns,collapse="|")

df_address[c("addr1","addr2")] <- apply(str_split(df_address$address,split_patterns,simplify=TRUE,n=2),2,str_trim)

我不允许放置整个原始数据框。但我认为让我对原始数据框而不是示例数据框造成麻烦的一点是原始的类是“data.table”“data.frame”。

> class(addsimple)
[1] "data.table" "data.frame"
> 
> class(df_address)
[1] "data.frame"

将其重铸为 data.frame 是否有任何成本?

标签: r

解决方案


推荐阅读