首页 > 解决方案 > 参数在 R 中应该有相同的长度错误

问题描述

我正在尝试创建一个键值存储,键是实体,值是新闻文章中实体的平均情绪得分。

我有一个数据框,其中包含新闻文章和一个由分类器在这些新闻文章中标识的名为 organizations1 的实体列表。organization1 列表的第一行包含在 news_us 数据帧第一行的文章中标识的实体。我正在尝试遍历组织列表并创建一个键值存储,其中键是 organization1 列表中的实体名称,值是提及实体的新闻描述的情绪分数。

我可以从一篇文章中获得实体的情绪分数,但我想将它们加在一起并平均情绪分数。

library(syuzhet)
sentiment <- list()
organization1 <- list(NULL, "US", "Bath", "Animal Crossing", "World Health Organization", 
    NULL, c("Microsoft", "Facebook"))
news_us <- structure(list(title = c("Stocks making the biggest moves after hours: Bed Bath & Beyond, JC Penney, United Airlines and more - CNBC", 
"Los Angeles mayor says 'very difficult to see' large gatherings like concerts and sporting events until 2021 - CNN", 
"Bed Bath & Beyond shares rise as earnings top estimates, retailer plans to maintain some key investments - CNBC", 
"6 weeks with Animal Crossing: New Horizons reveals many frustrations - VentureBeat", 
"Timeline: How Trump And WHO Reacted At Key Moments During The Coronavirus Crisis : Goats and Soda - NPR", 
"Michigan protesters turn out against Whitmer’s strict stay-at-home order - POLITICO"
), description = c("Check out the companies making headlines after the bell.", 
"Los Angeles Mayor Eric Garcetti said Wednesday large gatherings like sporting events or concerts may not resume in the city before 2021 as the US grapples with mitigating the novel coronavirus pandemic.", 
"Bed Bath & Beyond said that its results in 2020 \"will be unfavorably impacted\" by the crisis, and so it will not be offering a first-quarter nor full-year outlook.", 
"Six weeks with Animal Crossing: New Horizons has helped to illuminate some of the game's shortcomings that weren't obvious in our first review.", 
"How did the president respond to key moments during the pandemic? And how did representatives of the World Health Organization respond during the same period?", 
"Many demonstrators, some waving Trump campaign flags, ignored organizers‘ pleas to stay in their cars and flooded the streets of Lansing, the state capital."
), name = c("CNBC", "CNN", "CNBC", "Venturebeat.com", "Npr.org", 
"Politico")), na.action = structure(c(`35` = 35L, `95` = 95L, 
`137` = 137L, `154` = 154L, `213` = 213L, `214` = 214L, `232` = 232L, 
`276` = 276L, `321` = 321L), class = "omit"), row.names = c(NA, 
6L), class = "data.frame")

setNames(lapply(news_us$description, get_sentiment), unlist(organization1))

#$US
#[1] 0

#$Bath
#[1] -0.4

#$`Animal Crossing`
#[1] -0.1

#$`World Health Organization`
#[1] 1.1

#$Microsoft
#[1] -0.6

#$Facebook
#[1] -1.9

tapply(sapply(news_us$description, get_sentiment), unlist(organization1), mean) #this line throws the error

标签: rdata-sciencelapplysentiment-analysistapply

解决方案


您的问题似乎来自使用“unlist”。避免这种情况,因为它会删除 NULL 值并将列表条目与多个值连接起来。您的organization1列表有 7 个条目(其中两个为 NULL,一个为长度 = 2)。如果要与news_usdata.frame 匹配,您应该有 6 个条目 - 所以那里有些东西不同步。

让我们假设前 6 个条目organization1是正确的;我会将它们绑定到您的 data.frame 以避免进一步的“同步错误”:

news_us$organization1 = organization1[1:6]

然后你需要对data.frame的每一行进行情感分析,并将结果绑定到organization1value/s。下面的代码可能不是实现这一目标的最优雅方式,但我认为它可以满足您的需求:

results = do.call("rbind", apply(news_us, 1, function(item){
    if(!is.null(item$organization1[[1]])) cbind(item$organization1, get_sentiment(item$description))
}))

此代码删除没有检测到organization1值的任何行。organization1如果检测到多个情绪评分,它还应该复制情绪评分。结果将如下所示(我相信这是您的目标):

     [,1]                        [,2]  
[1,] "US"                        "-0.4"
[2,] "Bath"                      "-0.1"
[3,] "Animal Crossing"           "1.1" 
[4,] "World Health Organization" "-0.6"

然后可以使用 或类似方法折叠每个组织的平均by分数aggregate

by[编辑:和的例子aggregate]

by(as.numeric(results[, 2]), results$V1, mean)
aggregate(as.numeric(results[, 2]), list(results$V1), mean)

推荐阅读