r - 您如何在 Rstudio 上从同一网站上抓取多个页面
问题描述
所以我想使用 RStudio 从同一网站的多个页面下载数据 https://www.irishjobs.ie/ShowResults.aspx?Keywords=Data&autosuggestEndpoint=%2fautosuggest&Location=0&Category=&Recruiter=Company&btnSubmit=Search&Page=2
第 2 页和第 3 页之间的区别是……在超链接的末尾,我们只有 3 而不是 2 我可以从 1 页的 25 个工作中获得所需的内容,但我想从 4 中获得 100 个工作页。我正在使用选择器小工具 chrome 扩展。
我尝试了 for 循环
for (page_result in seq(from =1, to = 101, by = 25)) {
link = paste0(“ https://www.irishjobs.ie/ShowResults.aspx?Keywords=Data&autosuggestEndpoint=%2fautosuggest&Location=0&Category=&Recruiter=Company&btnSubmit=Search&Page=2)
page = read_html(link)
我不知道该怎么做
我想我需要将 page_result 放入链接中,但我不知道在哪里。我欢迎任何想法。我有 rvest 包和 dplyr 包。但我希望 for 循环遍历每一页。任何想法如何最好地做到这一点,谢谢
解决方案
4个链接可以很容易地放入for循环。从 DOM 复制 CSS 链接并迭代 5 到 30 次以获得所有 25 个作业。
AllJOBS <- vector()
for (i in 1:4) {
print("s")
url <- paste0("https://www.irishjobs.ie/ShowResults.aspx?Keywords=Data&autosuggestEndpoint=%2fautosuggest&Location=0&Category=&Recruiter=Company&btnSubmit=Search&Page=",i,sep="")
for (k in 5:30) {
jobs <- read_html(url) %>% html_node(css = paste0("#page > div.container > div.column-wrap.order-one-two > div.two-thirds > div:nth-child(",k,") > div > div.job-result-logo-title > div.job-result-title > h2 > a")) %>% html_text()
AllJOBS <- append(AllJOBS,jobs)
Sys.sleep(runif(1,1,2))
print(k)
}
print(paste0("Page",i))
}
输出
> AllJOBS
[1] "Senior Consultant - Fund Static Data"
[2] "Data Warehouse Engineer"
[3] "Senior Software Engineer - Big Data DevOps"
[4] "HR Data Analyst"
[5] "Data Insights Engineer - Dublin - Permanent/Contract - SQL Server"
[6] NA
[7] "Data Engineer - Master Data Services - SQL Server - Permanent/Contract"
[8] "Senior Data Protection Officer (DPO) - Contract"
[9] "QC Data Analyst (Trending)"
[10] "Senior Data Warehouse Developer"
[11] "Senior Data Analyst FTC"
[12] "Compliance Advisory and Data Protection Relationship Manager"
[13] "Contracts Manager-Data Center"
[14] "Payments Product Data Analyst"
[15] "Data Center Product Hardware Platform Engineer"
[16] "People Data Privacy Program Lead"
[17] "Head of Data Science"
[18] "Data Protection Counsel (Product or Compliance)"
[19] "Data Engineer, GMS"
[20] "Data Protection Associate General Counsel"
[21] "Senior Data Engineer"
[22] "Geospatial Data Scientist"
[23] "Data Solutions Manager"
[24] "Data Protection Solicitor"
[25] "Junior Data Scientist"
[26] "Master Data Specialist"
[27] "Temp QC Electronic Data Management Analyst"
[28] "20725 -Data Scientist - Limerick"
[29] "Technical Support Specialist - Data Centre"
[30] "Lead QC Micro Analyst (data review and compliance)"
[31] "Temp QC Data Analyst"
[32] "#Abbvie Compliance Engineer (Data Integrity)"
[33] "People Data Analyst"
[34] "Senior Electrical Design Engineer - Data Centre Ex"
[35] "Laboratory Data Entry Assistant, UCD NVRL"
[36] "Data Migrations Specialist"
[37] "Data Protection Officer"
[38] "Data Center Operations Engineer (Linux)"
[39] "Senior Electrical Engineer | Data Centre LV Design"
[40] "Data Scientist - (Process Sciences)"
[41] "Mgr Supply Logistics Global Materials Data"
[42] "Data Protection / Privacy Delivery Consultant"
[43] "Global Supply Chain Data Analyst"
[44] "QC Data Analyst"
[45] "0582GradeVIIFOIOLOL1120 - Grade VII Data Protection / Freedom of Information & Compliance Officer"
[46] "DPO001 - Deputy Data Protection Officer (General Manager) Office of the Head of Data Protection, HSE"
[47] "Senior Campaign Data Analyst"
[48] "Data & Reporting Analyst II"
[49] "Azure Data Analytics Solution Architect"
[50] "Head of Risk Assurance for IT, Data, Projects and Outsourcing"
[51] "Trainee Data Technician, Ireland"
[52] NA
您可以单独处理 NA。这能回答你的问题还是我误解了它?
推荐阅读
- python - pandas - 从 df 提取多个重复到另一个
- reactjs - 读取未知父母的孩子道具
- reactjs - Google Cloud 触发器构建 - NextJs 应用程序在本地构建 docker 映像,但不在 Gcloud 上
- asp.net-mvc - 如何从动态生成的表单中收集数据?
- python - 如何编写监控存储桶的云函数?
- android - 在具有相对布局父级的片段中使用持久底部表
- c# - 使用两个参数(字符串,整数)来定义字符串输出中特定字符的最大数量
- python - 我如何计算提到的 X 的数量?
- c++ - 传递模板以供以后在其他结构/类上下文中使用
- azure - 在 DeployIfNotExist 天蓝色策略中指定托管标识的位置