python - 无法修改功能以坚持一个工作代理
问题描述
我在 Python 中编写了一个脚本,使用代理和多处理,同时向某些链接发送请求,以便从那里解析产品名称。我当前的尝试错误地完成了这项工作,但它通过在每次调用中尝试使用三个新代理来减慢进程,而不管正在运行的代理是好是坏。
由于我在multiprocessing.dummy
脚本中使用了多处理,因此我希望以parse_product_info()
这样的方式修改函数,以便即使代理被识别为坏的,它也不会多次调用process_proxy()
函数来生成三个新的代理。更清楚一点——根据我当前的尝试,无论正在运行的代理是好是坏,我可以看到,当链接被用于 inside 时parse_product_info(link)
,三个新的代理在每次调用中都会发挥作用,就像我在3
within中使用的那样Pool()
。
我试过:
import random
import requests
from bs4 import BeautifulSoup
from multiprocessing.dummy import Pool
linklist = [
'https://www.amazon.com/dp/B00OI0RGGO',
'https://www.amazon.com/dp/B00TPKOPWA',
'https://www.amazon.com/dp/B00TH42HWE',
'https://www.amazon.com/dp/B00TPKNREM',
]
def process_proxy():
global proxyVault
if len(proxyVault)!=0:
random.shuffle(proxyVault)
proxy_url = proxyVault.pop()
proxy = {'https': f'http://{proxy_url}'}
else:
proxy = None
return proxy
def parse_product_info(link):
global proxy
try:
if not proxy:raise #if proxy variable doesn't contain any proxy yet, it goes to the exception block to get one as long as the proxy list is not empty
print("proxy to be used:",proxy)
res = requests.get(link,proxies=proxy,timeout=5)
soup = BeautifulSoup(res.text,"html5lib")
try:
product_name = soup.select_one("#productTitle").get_text(strip=True)
except Exception: product_name = ""
print(link,product_name)
except Exception:
proxy = process_proxy()
if proxy!=None:
return parse_product_info(link)
else:
pass
if __name__ == '__main__':
proxyVault = ['103.110.37.244:36022', '180.254.218.229:8080', '110.74.197.207:50632', '1.20.101.95:49001', '200.10.193.90:8080', '173.164.26.117:3128', '103.228.118.66:43002', '178.128.231.201:3128', '1.2.169.54:55312', '181.52.85.249:31487', '97.64.135.4:8080', '190.96.214.123:53251', '52.144.107.142:31923', '45.5.224.145:52035', '89.218.22.178:8080', '192.241.143.186:80', '113.53.29.218:38310', '36.78.131.182:39243']
pool = Pool(3)
pool.map(parse_product_info,linklist)
如何parse_product_info()
以这种方式修改功能,以便如果它是一个有效的代理,它将坚持一个代理?
解决方案
首先,尽管使用了multiprocessing
-module,但您在这里使用的是多线程,因为.dummy
使用线程而不是进程。
我最初认为 OP 可以很好地处理多线程,因为在示例中没有迹象表明繁重的 cpu 绑定工作,但由于我们现在知道 OP 确实可能想要使用多处理,所以我还提供了一个多处理解决方案。
OP 的示例需要对整个代理处理的同步进行返工。我通过“模拟”请求部分并删除酸味部分来简化示例,因为它对问题并不重要。
多处理
此解决方案使用multiprocessing.Value
用作索引到代理列表的共享计数器。如果工作人员超时,它会增加共享索引。共享计数器和代理列表在 (worker-) 进程启动时在Pool's
initializer
-parameter 的帮助下注册一次。
对非静态共享资源的任何非原子操作使用锁很重要。multiprocessing.Value
默认情况下,multiprocessing.RLock
我们可以使用一个附件。
import time
import random
import logging
from multiprocessing import Pool, Value, get_logger, log_to_stderr
def request_get(link, proxies, timeout):
"""Dummy request.get()"""
res = random.choices(["Result", "Timeout"], [0.5, 0.5])
if res[0] == "Result":
time.sleep(random.uniform(0, timeout))
return f"{res[0]} from {link}"
else:
time.sleep(timeout)
raise TimeoutError
def parse_product_info(link):
global proxy_list, proxy_index
while True:
with proxy_index.get_lock():
idx = proxy_index.value
try:
proxy = {'https': proxy_list[idx]}
except IndexError:
# get_logger().info(f"No proxies left.")
return
try:
# get_logger().info(f"attempt using: {proxy}")
res = request_get(link, proxies=proxy, timeout=5)
except TimeoutError:
# get_logger().info(f"timeout with: {proxy}")
with proxy_index.get_lock():
# check with lock held if index is still the same
if idx == proxy_index.value:
proxy_index.value += 1
# get_logger().info(f"incremented index: {proxy_index.value}")
else:
# get_logger().info(f"processing: {res}")
return
def _init_globals(proxy_list, proxy_index):
globals().update(
{'proxy_list': proxy_list, 'proxy_index': proxy_index}
)
主要的:
if __name__ == '__main__':
log_to_stderr(logging.INFO)
links = [
'https://www.amazon.com/dp/B00OI0RGGO',
'https://www.amazon.com/dp/B00TPKOPWA',
'https://www.amazon.com/dp/B00TH42HWE',
'https://www.amazon.com/dp/B00TPKNREM',
]
proxies = [
'103.110.37.244:36022', '180.254.218.229:8080', '110.74.197.207:50632',
'1.20.101.95:49001', '200.10.193.90:8080', '173.164.26.117:3128',
'103.228.118.66:43002', '178.128.231.201:3128', '1.2.169.54:55312',
'181.52.85.249:31487', '97.64.135.4:8080', '190.96.214.123:53251',
'52.144.107.142:31923', '45.5.224.145:52035', '89.218.22.178:8080',
'192.241.143.186:80', '113.53.29.218:38310', '36.78.131.182:39243'
]
proxies = [f"http://{proxy}" for proxy in proxies]
proxy_index = Value('i', 0)
with Pool(
processes=3,
initializer=_init_globals,
initargs=(proxies, proxy_index)
) as pool:
pool.map(parse_product_info, links)
示例输出:
[INFO/MainProcess] allocating a new mmap of length 4096
[INFO/ForkPoolWorker-1] child process calling self.run()
...
[INFO/ForkPoolWorker-1] attempt using: {'https': 'http://103.110.37.244:36022'}
[INFO/ForkPoolWorker-2] attempt using: {'https': 'http://103.110.37.244:36022'}
[INFO/ForkPoolWorker-3] attempt using: {'https': 'http://103.110.37.244:36022'}
[INFO/ForkPoolWorker-2] processing: Result from https://www.amazon.com/dp/B00TPKOPWA
[INFO/ForkPoolWorker-2] attempt using: {'https': 'http://103.110.37.244:36022'}
[INFO/ForkPoolWorker-3] timeout with: {'https': 'http://103.110.37.244:36022'}
[INFO/ForkPoolWorker-3] incremented index: 1
[INFO/ForkPoolWorker-3] attempt using: {'https': 'http://180.254.218.229:8080'}
[INFO/ForkPoolWorker-1] timeout with: {'https': 'http://103.110.37.244:36022'}
[INFO/ForkPoolWorker-1] attempt using: {'https': 'http://180.254.218.229:8080'}
[INFO/ForkPoolWorker-3] processing: Result from https://www.amazon.com/dp/B00TH42HWE
[INFO/ForkPoolWorker-2] processing: Result from https://www.amazon.com/dp/B00TPKNREM
[INFO/ForkPoolWorker-1] processing: Result from https://www.amazon.com/dp/B00OI0RGGO
[INFO/ForkPoolWorker-3] process shutting down
[INFO/ForkPoolWorker-2] process shutting down
...
Process finished with exit code 0
多线程
threading.Lock
下面的提议在 a (也可用包装为)的帮助下同步代理处理multiprocessing.dummy.Lock
,这是可能的,因为multiprocessing.dummy
仅使用线程。
请注意,相比之下multiprocessing.Lock
(不是 from .dummy
)是一个沉重的(相对较慢的)IPC-Lock,它将根据您同步的频率对整体性能产生影响。
编辑:
多线程解决方案已从早期的草案中重构,以从上面的多处理解决方案中获取设计。parse_product_info()
现在对于多线程/多处理几乎相同。
import time
import random
import logging
from itertools import repeat
from multiprocessing.dummy import Pool, Lock
get_logger = logging.getLogger
def request_get(link, proxies, timeout):
... # same as in multiprocessing solution above
def parse_product_info(link):
global proxies, proxy_index
while True:
with proxy_lock:
idx_proxy = proxy_index
try:
proxy = {'https': proxies[idx_proxy]}
except IndexError:
# get_logger().info(f"No proxies left.")
return
try:
# get_logger().info(f"attempt using: {proxy}")
res = request_get(link, proxies=proxy, timeout=5)
except TimeoutError:
# get_logger().info(f"timeout with: {proxy}")
with proxy_lock:
if idx_proxy == proxy_index:
proxy_index += 1
# get_logger().info(f"incremented index:{proxy_index}")
else:
# get_logger().info(f"processing: {res}")
return
def init_logging(level=logging.INFO):
fmt = '[%(asctime)s %(threadName)s] --- %(message)s'
logging.basicConfig(format=fmt, level=level)
return logging.getLogger()
主要的:
if __name__ == '__main__':
init_logging()
linklist = ... # same as in multiprocessing solution above
proxies = ... # same as in multiprocessing solution above
proxy_index = 0
proxy_lock = Lock()
with Pool(processes=3) as pool:
pool.map(parse_product_info, links)
示例输出:
[2019-12-18 01:40:25,799 Thread-1] --- attempt using: {'https': 'http://103.110.37.244:36022'}
[2019-12-18 01:40:25,799 Thread-2] --- attempt using: {'https': 'http://103.110.37.244:36022'}
[2019-12-18 01:40:25,799 Thread-3] --- attempt using: {'https': 'http://103.110.37.244:36022'}
[2019-12-18 01:40:26,164 Thread-1] --- processing: Result from https://www.amazon.com/dp/B00OI0RGGO
[2019-12-18 01:40:26,164 Thread-1] --- attempt using: {'https': 'http://103.110.37.244:36022'}
[2019-12-18 01:40:29,568 Thread-1] --- processing: Result from https://www.amazon.com/dp/B00TPKNREM
[2019-12-18 01:40:30,800 Thread-2] --- timeout with: {'https': 'http://103.110.37.244:36022'}
[2019-12-18 01:40:30,800 Thread-2] --- incremented index: 1
[2019-12-18 01:40:30,800 Thread-2] --- attempt using: {'https': 'http://180.254.218.229:8080'}
[2019-12-18 01:40:30,800 Thread-3] --- timeout with: {'https': 'http://103.110.37.244:36022'}
[2019-12-18 01:40:30,801 Thread-3] --- attempt using: {'https': 'http://180.254.218.229:8080'}
[2019-12-18 01:40:32,941 Thread-3] --- processing: Result from https://www.amazon.com/dp/B00TH42HWE
[2019-12-18 01:40:34,677 Thread-2] --- processing: Result from https://www.amazon.com/dp/B00TPKOPWA
Process finished with exit code 0
回复 OP 的最新评论:
如果您愿意,您可以在使用完IndexError
所有代理后在异常处理程序块中交换代理列表。在您交换的代码return
中:
with proxy_lock:
proxies = new_proxies
proxy_index = 0
continue
推荐阅读
- google-cloud-platform - 如何将记录插入 BigQuery 链接服务器
- python - 如何将“令牌”作为标头从 GUI 应用程序发送到烧瓶服务中的 GET 命令?
- java - 如果条件取决于当前值和先前值,我如何检查 takeWhile 中的条件?
- c# - 为什么字符串实习在这里失败(或者是这样)?
- swiftui - SwiftUI DragGesture 只在一个方向
- python - 从 Python 代码(正则表达式或 AST)字符串中提取所有变量
- lisp - 我是 lisp 的初学者。如何在 LISP 中为 R 关系创建传递函数?
- c# - 在 MvvmCross 中使用“标准”ChangePresentation Hints 时,使用“返回”导航导航到新 ViewModel 的方法是什么?
- rust - 我可以在不包装浮点类型和滥用 BTreeMap 的情况下使用标准库执行二叉树搜索吗?
- python - 如何在 databricks 工作区中使用 python 获取 azure datalake 存储中存在的每个文件的最后修改时间?