r - 如何根据来自另一个 data.frame 的信息更新一个 data.frame
问题描述
我有两张桌子:Display
和Review
。该Review
表包含有关在线商店产品评论的信息。每行代表评论的日期以及截至日期的评论的累积数量和产品的平均评分。
page_id<-c("1072659", "1072659" , "1072659","1072650","1072660","1072660")
review_id<-c("1761023","1761028","1762361","1918387","1761427","1863914")
date<-as.Date(c("2013-07-11","2013-08-12","2014-07-15","2014-09-10","2013-07-27","2014-08-12"),format = "%Y-%m-%d")
cumulative_No_reviews<-c(1,2,3,1,1,2)
average_rating<-c(5,3.5,4,3,5,5)
Review<-data.frame(page_id,review_id,date,cumulative_No_reviews,average_rating)
page_id review_id date cumulative_No_reviews average_rating
1072659 1761023 2013-07-11 1 5
1072659 1761028 2013-08-12 2 3.5
1072659 1762361 2014-07-15 3 4
1072650 1918387 2014-09-10 1 3
1072660 1761427 2013-07-27 1 5
1072660 1863914 2014-08-12 2 5
该Display
表捕获了客户访问产品页面的数据。
page_id<-c("1072659","1072659","1072659","1072650","1072650","1072660","1072660","1072660")
date<-as.Date(c("2013-07-10","2013-08-03","2015-02-11","2014-08-10","2014-09-09","2013-08-12","2014-09-12","2015-08-12"),format = "%Y-%m-%d")
Display<-data.frame(page_id,date)
page_id date
1072659 2013-07-10
1072659 2013-08-03
1072659 2015-02-11
1072650 2014-08-10
1072650 2014-09-09
1072660 2013-08-12
1072660 2014-09-12
1072660 2015-08-12
我想在表格中添加两列Display
(称为它Display2
),以反映每个产品访问点的最新评论信息,如下所示:
page_id<-c("1072659","1072659","1072659","1072650","1072650","1072660","1072660","1072660")
date<-as.Date(c("2013-07-10","2013-08-03","2015-02-11","2014-08-10","2014-09-09","2013-08-12","2014-09-12","2015-08-12"),format = "%Y-%m-%d")
cumulative_No_reviews<-c(0,1,3,0,0,1,2,2)
average_rating<-c(NA,5,4,NA,NA,5,5,5)
Display2<-data.frame(page_id,date,cumulative_No_reviews,average_rating)
page_id date cumulative_No_reviews average_rating
1072659 2013-07-10 0 NA
1072659 2013-08-03 1 5
1072659 2015-02-11 3 4
1072650 2014-08-10 0 NA
1072650 2014-09-09 0 NA
1072660 2013-08-14 1 5
1072660 2014-09-11 2 5
1072660 2015-08-12 2 5
我会很感激你的帮助。
解决方案
您可以通过data.table
加入来做到这一点。您可以在s 匹配且日期小于日期的情况下将Review
表与表连接起来。对于某些行,将根据这些条件匹配多行,因此我们只选择最后一行。由于按日期排序,这意味着日期最近的那个。Display
page_id
Review
Display
Display
Review
mult = 'last'
Review
library(data.table) # 1.12.6 for nafill (used below)
setDT(Display)
setDT(Review)
Display2 <- Review[Display, on = .(page_id, date < date), mult = 'last']
Display2
# page_id review_id date cumulative_No_reviews average_rating
# 1: 1072659 <NA> 2013-07-10 NA NA
# 2: 1072659 1761023 2013-08-03 1 5
# 3: 1072659 1762361 2015-02-11 3 4
# 4: 1072650 <NA> 2014-08-10 NA NA
# 5: 1072650 <NA> 2014-09-09 NA NA
# 6: 1072660 1761427 2013-08-12 1 5
# 7: 1072660 1863914 2014-09-12 2 5
# 8: 1072660 1863914 2015-08-12 2 5
现在这个输出几乎与您在问题中显示的内容相匹配,我们只需要删除列并将列中的 sreview_id
替换为NA
s 。cumulative_No_reviews
0
Display2[, review_id := NULL]
Display2[, cumulative_No_reviews := nafill(cumulative_No_reviews, fill = 0)][]
# page_id date cumulative_No_reviews average_rating
# 1: 1072659 2013-07-10 0 NA
# 2: 1072659 2013-08-03 1 5
# 3: 1072659 2015-02-11 3 4
# 4: 1072650 2014-08-10 0 NA
# 5: 1072650 2014-09-09 0 NA
# 6: 1072660 2013-08-12 1 5
# 7: 1072660 2014-09-12 2 5
# 8: 1072660 2015-08-12 2 5