r - 优化工作脚本速度所需的 for-loop 替代方案
问题描述
我已经有这个工作但希望优化它。提取与此相关的文章数据需要很长时间,因为我的方法是使用 for 循环。我需要逐行运行,运行每一行需要一秒钟多一点的时间。但是,在我的实际数据集中,我有大约 10,000 行,这需要很长时间。除了for循环之外,还有其他方法可以提取全文吗?我对每一行都使用相同的方法,所以我想知道 R 中是否有一个函数,类似于将一列乘以一个非常快的数字。
创建虚拟数据集:
date<- as.Date(c('2020-06-25', '2020-06-25','2020-06-25','2020-06-25','2020-06-25','2020-06-25','2020-06-25','2020-06-25','2020-06-25','2020-06-25'))
text <- c('Isko cites importance of wearing face mask, gives 10K pieces to 30 barangays',
'GMRC now a law; to be integrated in school curriculum',
'QC to impose stringent measures to screen applicants for PWD ID',
'‘Baka kalaban ka:’ Cops intimidate dzBB reporter',
'Is gov’t playing with traditional jeepney drivers? A lawmaker thinks so',
'PNP records highest single-day COVID-19 tally as cases rise to 579',
'IBP tells new lawyers: ‘Excel without sacrificing honor’',
'Senators express concern over DepEd’s preparedness for upcoming school year',
'Angara calls for probe into reported spread of ‘fake’ PWD IDs',
'Grab PH eyes new scheme to protect food couriers vs no-show customers')
link<- c('https://newsinfo.inquirer.net/1297621/isko-cites-importance-of-wearing-face-mask-gives-10k-pieces-to-30-barangays',
'https://newsinfo.inquirer.net/1297618/gmrc-now-a-law-to-be-integrated-in-school-curriculum',
'https://newsinfo.inquirer.net/1297614/qc-to-impose-stringent-measures-to-screen-applicants-for-pwd-id',
'https://newsinfo.inquirer.net/1297606/baka-kalaban-ka-cops-intimidate-dzbb-reporter',
'https://newsinfo.inquirer.net/1297582/is-govt-playing-with-traditional-jeepney-drivers-a-party-list-lawmaker-thinks-so',
'https://newsinfo.inquirer.net/1297577/pnp-records-highest-single-day-covid-19-tally-as-cases-rose-to-579',
'https://newsinfo.inquirer.net/1297562/ibp-tells-new-lawyers-excel-without-sacrificing-honor',
'https://newsinfo.inquirer.net/1297559/senators-on-depeds-preparedness-for-upcoming-school-year',
'https://newsinfo.inquirer.net/1297566/angara-calls-for-probe-into-reported-spread-of-fake-pwd-ids',
'https://newsinfo.inquirer.net/1297553/grab-ph-eyes-new-scheme-to-protect-food-couriers-vs-no-show-customers')
df<-data.frame(date, text, link)
虚拟数据集:
df
date text link
1 2020-06-25 Isko cites importance of wearing face mask, gives 10K pieces to 30 barangays https://newsinfo.inquirer.net/1297621/isko-cites-importance-of-wearing-face-mask-gives-10k-pieces-to-30-barangays
2 2020-06-25 GMRC now a law; to be integrated in school curriculum https://newsinfo.inquirer.net/1297618/gmrc-now-a-law-to-be-integrated-in-school-curriculum
3 2020-06-25 QC to impose stringent measures to screen applicants for PWD ID https://newsinfo.inquirer.net/1297614/qc-to-impose-stringent-measures-to-screen-applicants-for-pwd-id
4 2020-06-25 ‘Baka kalaban ka:’ Cops intimidate dzBB reporter https://newsinfo.inquirer.net/1297606/baka-kalaban-ka-cops-intimidate-dzbb-reporter
5 2020-06-25 Is gov’t playing with traditional jeepney drivers? A lawmaker thinks so https://newsinfo.inquirer.net/1297582/is-govt-playing-with-traditional-jeepney-drivers-a-party-list-lawmaker-thinks-so
6 2020-06-25 PNP records highest single-day COVID-19 tally as cases rise to 579 https://newsinfo.inquirer.net/1297577/pnp-records-highest-single-day-covid-19-tally-as-cases-rose-to-579
7 2020-06-25 IBP tells new lawyers: ‘Excel without sacrificing honor’ https://newsinfo.inquirer.net/1297562/ibp-tells-new-lawyers-excel-without-sacrificing-honor
8 2020-06-25 Senators express concern over DepEd’s preparedness for upcoming school year https://newsinfo.inquirer.net/1297559/senators-on-depeds-preparedness-for-upcoming-school-year
9 2020-06-25 Angara calls for probe into reported spread of ‘fake’ PWD IDs https://newsinfo.inquirer.net/1297566/angara-calls-for-probe-into-reported-spread-of-fake-pwd-ids
10 2020-06-25 Grab PH eyes new scheme to protect food couriers vs no-show customers https://newsinfo.inquirer.net/1297553/grab-ph-eyes-new-scheme-to-protect-food-couriers-vs-no-show-customers
获取每个链接的文章数据的代码:
now<-Sys.time()
for(i in 1:nrow(df)) {
test_article<- read_html(df[i, 3]) %>%
html_nodes(".article_align div p") %>%
html_text() %>%
toString()
text_df <- tibble(test_article)
df[i,4]<-test_article
print(paste(i,"/",nrow(df), sep = ""))
}
finish<-Sys.time()
finish-now
所以就10篇文章,我觉得用了10秒,真的很长。看看是否有更快的方法来做到这一点。
解决方案
您可以并行化循环:
#setup parallel backend to use many processors
cores=detectCores()
cl <- makeCluster(cores[1]-1) #not to overload your computer
registerDoParallel(cl)
now <- Sys.time()
result <- foreach(i =1:nrow(df),.combine=rbind,.packages=('dplyr','rvest') %dopar% {
test_article <- read_html(df[i, 3]) %>%
html_nodes(".article_align div p") %>%
html_text() %>%
toString()
data.frame( test_article = test_article, ID = paste(i,"-",nrow(df), sep = ""))
}
finish<-Sys.time()
finish-now
#stop cluster
stopCluster(cl)
请注意,您不能从 foreach 循环内部写入原始数据帧,因为每个任务都在单独的环境中运行。
推荐阅读
- javascript - 在 if else jquery 中添加一个新函数
- c# - 正则表达式从 C# 中的字符串中提取 2 个整数值
- ios - iOS 应用程序在打开和关闭 WebView (WKWebView) 时崩溃
- julia - 通过 julia 中的公共列值合并大量数组
- java - 在java中使用反射显示类的简单方法名称
- c# - unity 'TitleAnimScript' 类型与 'TitleAnimScript' 类型冲突
- .net - 来自 Arcgis 门户的 MapView 控件
- swift - Swift:使用动画以编程方式更改特定约束
- angular - 如果用户登录,角度重定向到仪表板
- opencv - 多线程 - openCV (imshow) - QMetaMethod 错误