首页 > 解决方案 > PuppeteerCrawler:多用户登录和抓取

问题描述

我正在使用 Apify 和 PuppeteerCrawler 为多个用户抓取页面。我必须将每个用户登录到系统并抓取 5 页,然后注销并继续下一个用户。

最好的方法是什么 - 为每个用户调用爬虫或只调用一次爬虫并让它处理登录/注销?

我正在从https://sdk.apify.com/docs/examples/puppeteercrawler扩展示例并在 Apify 云中运行它。现在我正在更改request.userData对象,为其添加一个标签“登录”,因此可以最初处理登录案例。登录后,将要抓取的相关 5 个页面排队。

标签: javascriptweb-scrapingpuppeteerapify

解决方案


I would say both options are pretty much as valid. Having multiple crawlers is certainly simpler although doing everything in one can be more efficient (as you can handle all users at once). I would argue to start with the first option until you get better feeling how to handle the second one properly.

This version I present is the simplest as it assumes the pages you access automatically redirect to the login page and from it. If it is not the case, you just need to do it with the labels.

    // Let's assume you have some object with your users.
    // This may in fact be loaded from input or somewhere else but for simplicity, let's define it right away
    const users = {
        stevejobs: {
            credentials: {
                username: 'stevejobs@gmail.com',
                password: '123',
            },
            cookies: null, // Cookies can be also loaded so you can use persistent login
        },
        billgates: {
            credentials: {
                username: 'billgates@gmail.com',
                password: '123',
            },
            cookies: null,
        },
        // etc...
    };

    const myUrls = ['https://resource1.com', 'https://resource2.com']; // replace with real URLs

    // initialize request queue
    const requestQueue = await Apify.openRequestQueue();

    // Now we will loop over the users and for each of them define a crawler and run it
    for (const user of Object.keys(users)) {

        // enqueue some pages
        for (const url of myUrls)
            await requestQueue.addRequest({
                url,
                uniqueKey: `${url}_${user}` // Otherwise the queue would dedup them
            });
        }

        const crawler = new Apify.PuppeteerCrawler({
            requestQueue,
            gotoFunction: async ({ page, request }) => {
                // if you have cookies, you simply add them to the page
                const { cookies } = users[user];
                if (cookies) {
                    await page.setCookie(...cookies);
                }
                return page.goto(request.url);
            },
            handlePageFunction: async ({ page, request }) => {
                // Check if you are logged in by some selector, if not log in
                const loggedIn = $('am-i-logged'); // Change to real selector
                if (!loggedIn) {
                    // log in with credentials
                    const { username, password } = users[user].credentials;
                    // do your login
                    // ...
                    // wait for redirect
                    // then we save cookies
                    const cookies = await page.cookies();
                    users[user].cookies = cookies;
                }
                // Usually the log in page will redirect directly to the resource so we can scrape data right away
                const data = scrapeData(); // replace with real function
                await Apify.pushData(data);
            }
        })
        await crawler.run();
    }

推荐阅读