首页 > 解决方案 > 如何在 HTML [in R] 中提取模式之间的最长字符串

问题描述

我正在从一系列文章的 HTML 中提取文本。但是,我仍然要把文章变成我喜欢的格式。更具体地说,我希望找到模式出现之间的最长字符串(“/n”)。

我现在使用的代码如下:

library(newsanchor)
library(htm2txt)
library(RCurl)
library(XML)    
results <- get_everything(query = "Trump +Trade", language = "en")
test <- results$results_df
test$txt <- NA

for(i in 1:22){
tryCatch({
  html <- getURL(test$url[i], followlocation = TRUE)
  doc = htmlParse(html, asText=TRUE)
  plain.text <- xpathSApply(doc, "//p", xmlValue)
  test$txt[i] <- c(paste(plain.text, collapse = "\n"))

}, error=function(e){})
  print(i)
}

结果看起来像这样

[1] "EDITION\nUS President Donald Trump has made his first meaningful remarks on the Huawei firestorm since his administration blacklisted the Chinese tech giant last week.\nThe president was speaking at a news conference announcing a $US16 billion aid package for farmers caught up in the China trade war when he addressed Huawei, which has been placed on a list that means US firms need permission to do business with the Chinese company.\nTrump started out by saying that Huawei poses a huge security threat to the US. US officials have long floated suspicions that Huawei acts as a conduit for Chinese surveillance.\n“Huawei is something that’s very dangerous. You look at what they have done from a security standpoint, from a military standpoint, it’s very dangerous,” the president told reporters.\n  Read more: Here are all the companies that have cut ties with Huawei, dealing the Chinese tech giant a crushing blow\nHe then immediately switched gears to suggest that Huawei could form part of a trade deal with America and China. “So it’s possible that Huawei even would be included in some kind of a trade deal. If we made a deal, I could imagine Huawei being possibly included in some form,” he said.\n\"Huawei is very dangerous,\" Trump says, adding that an exception for the company could be made in a trade deal with China pic.twitter.com/TFlClewBNt\n— TicToc by Bloomberg (@tictoc) May 23, 2019\n\nTrump: “Huawei is something that’s very dangerous. You look at what they have done from a security standpoint, from a military standpoint, it’s very dangerous. So, it’s possible that Huawei even would be included in some kind of a trade deal. If we made a deal, I could imagine Huawei being possibly included in some form of, or some part of a trade deal.”\nJournalist: “How would that look?”\nTrump: “It would look very good for us.”\nJournalist: 

我希望能得到最重要的部分——实际的文章。我不确定如何最好地做到这一点,但我认为这可能是在 ("/n") 的两次出现之间找到最长的字符串。任何人都可以帮助做到这一点,或者甚至提出更好的方法吗?

标签: rstring

解决方案


编辑:@user101 解释说nchar是矢量化的。这是一个更优化的解决方案:

splitarticle <- unlist(strsplit(i, "\n"))
splitarticle[which.max(nchar(splitarticle))]

除非我误解了你想要做的事情,否则这样的事情可能会奏效。

splitarticle <- unlist(strsplit(i, "\n"))
lengths <- unlist(lapply(splitarticle, nchar))
splitarticle[match(max(lengths), lengths)]

推荐阅读