首页 > 解决方案 > 如何使用 concurrent.futures ThreadPoolExecutor 循环代理

问题描述

这个 python 代码从 wikipedia 中抓取标题和内容,它在 wikipedia 中搜索来自 csv 的不同术语,并将标题和内容拉入 .csv。该代码与 ThreadPoolExecutor 一起运行良好。

我需要轮换代理,但我不知道如何使用 ThreadPoolExecutor 来做到这一点。

(我忘了提到我使用刮掉的免费代理)

scrapelist.csv
apple
banana
mango
etc.

.

proxylist.csv
161.35.22.17:80
165.227.108.19:80
161.35.52.72:80
135.125.107.126:80
15.188.22.231:3128

.

import requests
import csv
import time
import random
from datetime import datetime
from csv import reader
from bs4 import BeautifulSoup
import concurrent.futures

namelist = []
proxylist = []

with open('scrapelist.csv', 'r') as f:
    reader = csv.reader(f)
    for row in reader:
        namelist.append(row[0])    

with open('proxylist.csv', 'r') as prlist:
    reader = csv.reader(prlist)
    for row in reader:
        proxylist.append(row[0])
        
def scrapecontent(search_term):
 url="https://en.wikipedia.org/wiki/"+search_term
 r = requests.get(url)
 soup = BeautifulSoup(r.content, 'html.parser')
 title= soup.find('h1', {"class":"firstHeading"}).text
 alltext=soup.find('div', {"class":"mw-parser-output"}).text.replace("\n", " ")

 now = datetime.now() # current date and time
 date_time = now.strftime("%Y%m%d_%H%M%S")
 filename=search_term+"_"+date_time
 myfilename ="%s.csv"% filename
 print("-----\n"+myfilename)
 
 with open(myfilename, 'w', encoding="utf-8", newline='') as myfile:
  wikititle=title
  wikibody=alltext
  writer = csv.writer(myfile)
  writer.writerow([wikititle,wikibody])
  myfile.close()

with concurrent.futures.ThreadPoolExecutor() as executor:
        executor.map(scrapecontent, namelist)

标签: pythonweb-scrapingproxythreadpoolexecutor

解决方案


推荐阅读