python-3.x - 获取特定网站中带有线程和队列的文件图片
问题描述
我正在尝试在 python3 中创建一个简单的程序,该程序具有线程和队列,以通过使用 4 个或更多线程同时下载 4 个图像并将所述图像下载到 PC 的下载文件夹中,同时从 URL 链接并发下载图像,同时避免重复通过在线程之间共享信息。我想我可以使用 URL1=“Link1”之类的东西?以下是一些链接示例。
“<a href="https://unab-dw2018.s3.amazonaws.com/ldp2019/1.jpeg" rel="nofollow noreferrer">https://unab-dw2018.s3.amazonaws.com/ldp2019/1 .jpeg”</p>
“<a href="https://unab-dw2018.s3.amazonaws.com/ldp2019/2.jpeg" rel="nofollow noreferrer">https://unab-dw2018.s3.amazonaws.com/ldp2019/2 .jpeg”</p>
但我不明白如何在队列中使用线程,我不知道如何做到这一点。
我尝试搜索一些可以解释如何将线程与队列一起使用以进行并发下载的页面我只找到了线程的链接。
这是一个部分工作的代码。我需要的是程序询问你想要多少个线程,然后下载图像直到它达到图像 20,但在代码上如果输入 5,它只会下载 5 个图像,依此类推。问题是,如果我放 5 个,它将首先下载 5 个图像,然后是以下 5 个,依此类推,直到 20 个。如果它的 4 个图像然后是 4、4、4、4、4。如果它是 6,那么它将去 6, 6,6,然后下载剩下的 2。不知何故,我必须在代码上实现队列,但我几天前才学习线程,我迷失了如何将线程和队列混合在一起。
import threading
import urllib.request
import queue # i need to use this somehow
def worker(cont):
print("The worker is ON",cont)
image_download = "URL"+str(cont)+".jpeg"
download = urllib.request.urlopen(image_download)
file_save = open("Image "+str(cont)+".jpeg", "wb")
file_save.write(download.read())
file_save.close()
return cont+1
threads = []
q_threads = int(input("Choose input amount of threads between 4 and 20"))
for i in range(0, q_threads):
h = threading.Thread(target=worker, args=(i+1, int))
threads.append(h)
for i in range(0, q_threads):
threads[i].start()
解决方案
我从一些我用来执行多线程 PSO 的代码中改编了以下内容
import threading
import queue
if __name__ == "__main__":
picture_queue = queue.Queue(maxsize=0)
picture_threads = []
picture_urls = ["string.com","string2.com"]
# create and start the threads
for url in picture_urls:
picture_threads.append(picture_getter(url, picture_queue))
picture_threads[i].start()
# wait for threads to finish
for picture_thread in picture_threads:
picture_thread.join()
# get the results
picture_list = []
while not picture_queue.empty():
picture_list.append(picture_queue.get())
class picture_getter(threading.Thread):
def __init__(self, url, picture_queue):
self.url = url
self.picture_queue = picture_queue
super(picture_getter, self).__init__()
def run(self):
print("Starting download on " + str(self.url))
self._get_picture()
def _get_picture(self):
# --- get your picture --- #
self.picture_queue.put(picture)
如您所知,stackoverflow 上的人喜欢在提供解决方案之前先查看您尝试过的内容。但是,无论如何我都有这段代码。欢迎新人同行!
我要补充的一件事是,这并不能通过在线程之间共享信息来避免重复。它避免了重复,因为每个线程都被告知要下载什么。如果您的文件名在您的问题中出现编号,这应该不是问题,因为您可以轻松构建这些列表。
更新了代码以解决对 Treyons 原始帖子的编辑问题
import threading
import urllib.request
import queue
import time
class picture_getter(threading.Thread):
def __init__(self, url, file_name, picture_queue):
self.url = url
self.file_name = file_name
self.picture_queue = picture_queue
super(picture_getter, self).__init__()
def run(self):
print("Starting download on " + str(self.url))
self._get_picture()
def _get_picture(self):
print("{}: Simulating delay".format(self.file_name))
time.sleep(1)
# download and save image
download = urllib.request.urlopen(self.url)
file_save = open("Image " + self.file_name, "wb")
file_save.write(download.read())
file_save.close()
self.picture_queue.put("Image " + self.file_name)
def remainder_or_max_threads(num_pictures, num_threads, iterations):
# remaining pictures
remainder = num_pictures - (num_threads * iterations)
# if there are equal or more pictures remaining than max threads
# return max threads, otherwise remaining number of pictures
if remainder >= num_threads:
return max_threads
else:
return remainder
if __name__ == "__main__":
# store the response from the threads
picture_queue = queue.Queue(maxsize=0)
picture_threads = []
num_pictures = 20
url_prefix = "https://unab-dw2018.s3.amazonaws.com/ldp2019/"
picture_names = ["{}.jpeg".format(i+1) for i in range(num_pictures)]
max_threads = int(input("Choose input amount of threads between 4 and 20: "))
iterations = 0
# during the majority of runtime iterations * max threads is
# the number of pictures that have been downloaded
# when it exceeds num_pictures all pictures have been downloaded
while iterations * max_threads < num_pictures:
# this returns max_threads if there are max_threads or more pictures left to download
# else it will return the number of remaining pictures
threads = remainder_or_max_threads(num_pictures, max_threads, iterations)
# loop through the next section of pictures, create and start their threads
for name, i in zip(picture_names[iterations * max_threads:], range(threads)):
picture_threads.append(picture_getter(url_prefix + name, name, picture_queue))
picture_threads[i + iterations * max_threads].start()
# wait for threads to finish
for picture_thread in picture_threads:
picture_thread.join()
# increment the iterations
iterations += 1
# get the results
picture_list = []
while not picture_queue.empty():
picture_list.append(picture_queue.get())
print("Successfully downloaded")
print(picture_list)
推荐阅读
- java - 使用 promise 的 gremlin 查询。不能用它来返回想要的结果
- javascript - 如何从导入的功能组件中访问钩子变量?
- flutter - 在 StreamBuilder 上的 null Flutter 问题上调用了 getter 'documents'
- c++ - Vector.push_back(std::function
); 编译器请求表达式的方法 - java - 如何避免使用 rest 模板 ResourceAccessException 异常泄漏我的 url 参数?
- html - 当窗口太小时,底部填充似乎没有显示
- ios - 如何确保字符串包含整个单词而不仅仅是其中的一部分
- javascript - 如何使用 Vue 从样式绑定中调用方法?
- groovy - 如何更改groovy中java超类只读字段的值?
- powershell - 显示计数器或任何其他东西,直到任何命令在 Powershell 的脚本中运行