首页 > 解决方案 > 无法修改功能以坚持一个工作代理

问题描述

我在 Python 中编写了一个脚本,使用代理和多处理,同时向某些链接发送请求,以便从那里解析产品名称。我当前的尝试错误地完成了这项工作,但它通过在每次调用中尝试使用三个新代理来减慢进程,而不管正在运行的代理是好是坏。

由于我在multiprocessing.dummy脚本中使用了多处理,因此我希望以parse_product_info()这样的方式修改函数,以便即使代理被识别为坏的,它也不会多次调用process_proxy()函数来生成三个新的代理。更清楚一点——根据我当前的尝试,无论正在运行的代理是好是坏,我可以看到,当链接被用于 inside 时parse_product_info(link),三个新的代理在每次调用中都会发挥作用,就像我在3within中使用的那样Pool()

我试过:

import random
import requests
from bs4 import BeautifulSoup
from multiprocessing.dummy import Pool

linklist = [
    'https://www.amazon.com/dp/B00OI0RGGO', 
    'https://www.amazon.com/dp/B00TPKOPWA', 
    'https://www.amazon.com/dp/B00TH42HWE', 
    'https://www.amazon.com/dp/B00TPKNREM', 
]

def process_proxy():
    global proxyVault
    if len(proxyVault)!=0:
        random.shuffle(proxyVault)
        proxy_url = proxyVault.pop()
        proxy = {'https': f'http://{proxy_url}'}
    else:
        proxy = None
    return proxy


def parse_product_info(link):
    global proxy
    try:
        if not proxy:raise #if proxy variable doesn't contain any proxy yet, it goes to the exception block to get one as long as the proxy list is not empty
        print("proxy to be used:",proxy)
        res = requests.get(link,proxies=proxy,timeout=5)
        soup = BeautifulSoup(res.text,"html5lib")
        try:
            product_name = soup.select_one("#productTitle").get_text(strip=True)
        except Exception: product_name = ""
        print(link,product_name)

    except Exception:
        proxy = process_proxy()
        if proxy!=None:
            return parse_product_info(link)
        else:
            pass


if __name__ == '__main__':
    proxyVault = ['103.110.37.244:36022', '180.254.218.229:8080', '110.74.197.207:50632', '1.20.101.95:49001', '200.10.193.90:8080', '173.164.26.117:3128', '103.228.118.66:43002', '178.128.231.201:3128', '1.2.169.54:55312', '181.52.85.249:31487', '97.64.135.4:8080', '190.96.214.123:53251', '52.144.107.142:31923', '45.5.224.145:52035', '89.218.22.178:8080', '192.241.143.186:80', '113.53.29.218:38310', '36.78.131.182:39243']
    pool = Pool(3)
    pool.map(parse_product_info,linklist)

如何parse_product_info()以这种方式修改功能,以便如果它是一个有效的代理,它将坚持一个代理?

标签: pythonmultithreadingweb-scrapingproxymultiprocessing

解决方案


首先,尽管使用了multiprocessing-module,但您在这里使用的是多线程,因为.dummy使用线程而不是进程。

我最初认为 OP 可以很好地处理多线程,因为在示例中没有迹象表明繁重的 cpu 绑定工作,但由于我们现在知道 OP 确实可能想要使用多处理,所以我还提供了一个多处理解决方案。

OP 的示例需要对整个代理处理的同步进行返工。我通过“模拟”请求部分并删除酸味部分来简化示例,因为它对问题并不重要。


多处理

此解决方案使用multiprocessing.Value用作索引到代理列表的共享计数器。如果工作人员超时,它会增加共享索引。共享计数器和代理列表在 (worker-) 进程启动时在Pool's initializer-parameter 的帮助下注册一次。

对非静态共享资源的任何非原子操作使用锁很重要。multiprocessing.Value默认情况下,multiprocessing.RLock我们可以使用一个附件。

import time
import random
import logging
from multiprocessing import Pool, Value, get_logger, log_to_stderr


def request_get(link, proxies, timeout):
    """Dummy request.get()"""
    res = random.choices(["Result", "Timeout"], [0.5, 0.5])
    if res[0] == "Result":
        time.sleep(random.uniform(0, timeout))
        return f"{res[0]} from {link}"
    else:
        time.sleep(timeout)
        raise TimeoutError


def parse_product_info(link):
    global proxy_list, proxy_index    
    while True:
        with proxy_index.get_lock():
            idx = proxy_index.value
        try:
            proxy = {'https': proxy_list[idx]}
        except IndexError:
            # get_logger().info(f"No proxies left.")
            return    
        try:
            # get_logger().info(f"attempt using: {proxy}")
            res = request_get(link, proxies=proxy, timeout=5)
        except TimeoutError:
            # get_logger().info(f"timeout with: {proxy}")
            with proxy_index.get_lock():
                # check with lock held if index is still the same
                if idx == proxy_index.value:
                    proxy_index.value += 1
                    # get_logger().info(f"incremented index: {proxy_index.value}")
        else:
            # get_logger().info(f"processing: {res}")
            return    


def _init_globals(proxy_list, proxy_index):
    globals().update(
        {'proxy_list': proxy_list, 'proxy_index': proxy_index}
    )

主要的:

if __name__ == '__main__':

    log_to_stderr(logging.INFO)

    links = [
        'https://www.amazon.com/dp/B00OI0RGGO',
        'https://www.amazon.com/dp/B00TPKOPWA',
        'https://www.amazon.com/dp/B00TH42HWE',
        'https://www.amazon.com/dp/B00TPKNREM',
    ]

    proxies = [
        '103.110.37.244:36022', '180.254.218.229:8080', '110.74.197.207:50632',
        '1.20.101.95:49001', '200.10.193.90:8080', '173.164.26.117:3128',
        '103.228.118.66:43002', '178.128.231.201:3128', '1.2.169.54:55312',
        '181.52.85.249:31487', '97.64.135.4:8080', '190.96.214.123:53251',
        '52.144.107.142:31923', '45.5.224.145:52035', '89.218.22.178:8080',
        '192.241.143.186:80', '113.53.29.218:38310', '36.78.131.182:39243'
    ]
    proxies = [f"http://{proxy}" for proxy in proxies]
    proxy_index = Value('i', 0)

    with Pool(
            processes=3,
            initializer=_init_globals,
            initargs=(proxies, proxy_index)
    ) as pool:

        pool.map(parse_product_info, links)

示例输出:

[INFO/MainProcess] allocating a new mmap of length 4096
[INFO/ForkPoolWorker-1] child process calling self.run()
...
[INFO/ForkPoolWorker-1] attempt using: {'https': 'http://103.110.37.244:36022'}
[INFO/ForkPoolWorker-2] attempt using: {'https': 'http://103.110.37.244:36022'}
[INFO/ForkPoolWorker-3] attempt using: {'https': 'http://103.110.37.244:36022'}
[INFO/ForkPoolWorker-2] processing: Result from https://www.amazon.com/dp/B00TPKOPWA
[INFO/ForkPoolWorker-2] attempt using: {'https': 'http://103.110.37.244:36022'}
[INFO/ForkPoolWorker-3] timeout with: {'https': 'http://103.110.37.244:36022'}
[INFO/ForkPoolWorker-3] incremented index: 1
[INFO/ForkPoolWorker-3] attempt using: {'https': 'http://180.254.218.229:8080'}
[INFO/ForkPoolWorker-1] timeout with: {'https': 'http://103.110.37.244:36022'}
[INFO/ForkPoolWorker-1] attempt using: {'https': 'http://180.254.218.229:8080'}
[INFO/ForkPoolWorker-3] processing: Result from https://www.amazon.com/dp/B00TH42HWE
[INFO/ForkPoolWorker-2] processing: Result from https://www.amazon.com/dp/B00TPKNREM
[INFO/ForkPoolWorker-1] processing: Result from https://www.amazon.com/dp/B00OI0RGGO
[INFO/ForkPoolWorker-3] process shutting down
[INFO/ForkPoolWorker-2] process shutting down
...

Process finished with exit code 0

多线程

threading.Lock下面的提议在 a (也可用包装为)的帮助下同步代理处理multiprocessing.dummy.Lock,这是可能的,因为multiprocessing.dummy仅使用线程。

请注意,相比之下multiprocessing.Lock(不是 from .dummy)是一个沉重的(相对较慢的)IPC-Lock,它将根据您同步的频率对整体性能产生影响。

编辑:

多线程解决方案已从早期的草案中重构,以从上面的多处理解决方案中获取设计。parse_product_info()现在对于多线程/多处理几乎相同。

import time
import random
import logging
from itertools import repeat
from multiprocessing.dummy import Pool, Lock
get_logger = logging.getLogger


def request_get(link, proxies, timeout):
    ... # same as in multiprocessing solution above


def parse_product_info(link):
    global proxies, proxy_index
    while True:
        with proxy_lock:
            idx_proxy = proxy_index
        try:
            proxy = {'https': proxies[idx_proxy]}
        except IndexError:
            # get_logger().info(f"No proxies left.")
            return
        try:
            # get_logger().info(f"attempt using: {proxy}")
            res = request_get(link, proxies=proxy, timeout=5)
        except TimeoutError:
            # get_logger().info(f"timeout with: {proxy}")
            with proxy_lock:
                if idx_proxy == proxy_index:
                    proxy_index += 1
                    # get_logger().info(f"incremented index:{proxy_index}")
        else:
            # get_logger().info(f"processing: {res}")
            return    


def init_logging(level=logging.INFO):
    fmt = '[%(asctime)s %(threadName)s] --- %(message)s'
    logging.basicConfig(format=fmt, level=level)
    return logging.getLogger()

主要的:

if __name__ == '__main__':

    init_logging()

    linklist = ... # same as in multiprocessing solution above    
    proxies = ... # same as in multiprocessing solution above
    proxy_index = 0
    proxy_lock = Lock()

    with Pool(processes=3) as pool:
        pool.map(parse_product_info, links)

示例输出:

[2019-12-18 01:40:25,799 Thread-1] --- attempt using: {'https': 'http://103.110.37.244:36022'}
[2019-12-18 01:40:25,799 Thread-2] --- attempt using: {'https': 'http://103.110.37.244:36022'}
[2019-12-18 01:40:25,799 Thread-3] --- attempt using: {'https': 'http://103.110.37.244:36022'}
[2019-12-18 01:40:26,164 Thread-1] --- processing: Result from https://www.amazon.com/dp/B00OI0RGGO
[2019-12-18 01:40:26,164 Thread-1] --- attempt using: {'https': 'http://103.110.37.244:36022'}
[2019-12-18 01:40:29,568 Thread-1] --- processing: Result from https://www.amazon.com/dp/B00TPKNREM
[2019-12-18 01:40:30,800 Thread-2] --- timeout with: {'https': 'http://103.110.37.244:36022'}
[2019-12-18 01:40:30,800 Thread-2] --- incremented index: 1
[2019-12-18 01:40:30,800 Thread-2] --- attempt using: {'https': 'http://180.254.218.229:8080'}
[2019-12-18 01:40:30,800 Thread-3] --- timeout with: {'https': 'http://103.110.37.244:36022'}
[2019-12-18 01:40:30,801 Thread-3] --- attempt using: {'https': 'http://180.254.218.229:8080'}
[2019-12-18 01:40:32,941 Thread-3] --- processing: Result from https://www.amazon.com/dp/B00TH42HWE
[2019-12-18 01:40:34,677 Thread-2] --- processing: Result from https://www.amazon.com/dp/B00TPKOPWA

Process finished with exit code 0

回复 OP 的最新评论:

如果您愿意,您可以在使用完IndexError所有代理后在异常处理程序块中交换代理列表。在您交换的代码return中:

        with proxy_lock:
            proxies = new_proxies
            proxy_index = 0
        continue

推荐阅读