javascript - PuppeteerCrawler:多用户登录和抓取
问题描述
我正在使用 Apify 和 PuppeteerCrawler 为多个用户抓取页面。我必须将每个用户登录到系统并抓取 5 页,然后注销并继续下一个用户。
最好的方法是什么 - 为每个用户调用爬虫或只调用一次爬虫并让它处理登录/注销?
我正在从https://sdk.apify.com/docs/examples/puppeteercrawler扩展示例并在 Apify 云中运行它。现在我正在更改request.userData对象,为其添加一个标签“登录”,因此可以最初处理登录案例。登录后,将要抓取的相关 5 个页面排队。
解决方案
I would say both options are pretty much as valid. Having multiple crawlers is certainly simpler although doing everything in one can be more efficient (as you can handle all users at once). I would argue to start with the first option until you get better feeling how to handle the second one properly.
This version I present is the simplest as it assumes the pages you access automatically redirect to the login page and from it. If it is not the case, you just need to do it with the labels.
// Let's assume you have some object with your users.
// This may in fact be loaded from input or somewhere else but for simplicity, let's define it right away
const users = {
stevejobs: {
credentials: {
username: 'stevejobs@gmail.com',
password: '123',
},
cookies: null, // Cookies can be also loaded so you can use persistent login
},
billgates: {
credentials: {
username: 'billgates@gmail.com',
password: '123',
},
cookies: null,
},
// etc...
};
const myUrls = ['https://resource1.com', 'https://resource2.com']; // replace with real URLs
// initialize request queue
const requestQueue = await Apify.openRequestQueue();
// Now we will loop over the users and for each of them define a crawler and run it
for (const user of Object.keys(users)) {
// enqueue some pages
for (const url of myUrls)
await requestQueue.addRequest({
url,
uniqueKey: `${url}_${user}` // Otherwise the queue would dedup them
});
}
const crawler = new Apify.PuppeteerCrawler({
requestQueue,
gotoFunction: async ({ page, request }) => {
// if you have cookies, you simply add them to the page
const { cookies } = users[user];
if (cookies) {
await page.setCookie(...cookies);
}
return page.goto(request.url);
},
handlePageFunction: async ({ page, request }) => {
// Check if you are logged in by some selector, if not log in
const loggedIn = $('am-i-logged'); // Change to real selector
if (!loggedIn) {
// log in with credentials
const { username, password } = users[user].credentials;
// do your login
// ...
// wait for redirect
// then we save cookies
const cookies = await page.cookies();
users[user].cookies = cookies;
}
// Usually the log in page will redirect directly to the resource so we can scrape data right away
const data = scrapeData(); // replace with real function
await Apify.pushData(data);
}
})
await crawler.run();
}
推荐阅读
- angular - 邮政编码验证器不适用于角度反应形式验证器模式
- django - views.py 中全局变量的作用域是什么?
- javascript - 如何在没有硬编码字符串的情况下检查 MIME 类型
- javascript - 如何根据角度 8 中复选框的选择启用/禁用按钮
- scala - 在 flatMap 中使用 Try
- razor - 从视图上传文件时.net core IFormFile null
- javascript - 在反应中打开一个新页面
- javascript - 如何在 yo office excel 加载项的同一个自定义选项卡上添加多个 excel 加载项
- javascript - 如何在 Three.JS 中整合加速度和速度
- angular - 升级到 Capacitor v3 包 com.getcapacitor.annotation 后报错不存在