首页 > 解决方案 > Python多处理抓取,重复结果

问题描述

我正在构建一个需要在大量网页上快速执行的刮板。下面代码的结果将是一个包含链接列表(和其他内容)的 csv 文件。基本上,我创建了一个包含多个链接的网页列表,并为每个页面收集这些链接。

实现多处理会导致一些奇怪的结果,我无法解释。如果我运行此代码,将池的值设置为 1(因此,没有多线程),我会得到一个最终结果,其中我有 0.5% 的重复链接(这很公平)。一旦我加快速度,将值设置为 8、12 或 24,我会在最终结果中得到大约 25% 的重复链接。

我怀疑我的错误在于我将结果写入 csv 文件的方式或我使用imap()函数的方式(同样发生在imap_unordered..map等),这导致线程以某种方式访问​​传递的迭代中的相同元素。有什么建议吗?

#!/usr/bin/env python
#  coding: utf8
import sys
import requests, re, time
from bs4 import BeautifulSoup
from lxml import etree
from lxml import html
import random
import unicodecsv as csv
import progressbar
import multiprocessing
from multiprocessing.pool import ThreadPool

keyword = "keyword"

def openup():
    global crawl_list
    try:
        ### Generate list URLS based on the number of results for the keyword, each of these contains other links. The list is subsequently randomized
        startpage = 1
        ## Get endpage
        url0 = myurl0
        r0 = requests.get(url0)
        print "First request: "+str(r0.status_code)
        tree = html.fromstring(r0.content)
        endpage = tree.xpath("//*[@id='habillagepub']/div[5]/div/div[1]/section/div/ul/li[@class='adroite']/a/text()")
        print str(endpage[0]) + " pages found"
        ### Generate random sequence for crawling
        crawl_list = random.sample(range(1,int(endpage[0])+1), int(endpage[0]))
        return crawl_list
    except Exception as e:
        ### Catches openup error and return an empty crawl list, then breaks
        print e 
        crawl_list = []
        return crawl_list

def worker_crawl(x):
    ### Open page
    url_base = myurlbase
    r = requests.get(url_base)
    print "Connecting to page " + str(x) +" ..."+ str(r.status_code)
    while True:
        if r.status_code == 200:
            tree = html.fromstring(r.content)
            ### Get data 
            titles = tree.xpath('//*[@id="habillagepub"]/div[5]/div/div[1]/section/article/div/div/h3/a/text()')
            links = tree.xpath('//*[@id="habillagepub"]/div[5]/div/div[1]/section/article/div/div/h3/a/@href')
            abstracts = tree.xpath('//*[@id="habillagepub"]/div[5]/div/div[1]/section/article/div/div/p/text()')
            footers = tree.xpath('//*[@id="habillagepub"]/div[5]/div/div[1]/section/article/div/div/span/text()')
            dates = []
            pagenums = []
            for f in footers:
                pagenums.append(x)
                match = re.search(r'\| .+$', f)
                if match:
                    date = match.group()
                    dates.append(date)
            pageindex = zip(titles,links,abstracts,footers,dates,pagenums) #what if there is a missing value?
            return pageindex
        else:
            pageindex = [[str(r.status_code),"","","","",str(x)]]
            return pageindex
            continue

def mp_handler():
    ### Write down:
    with open(keyword+'_results.csv', 'wb') as outcsv:
        wr = csv.DictWriter(outcsv, fieldnames=["title","link","abstract","footer","date","pagenum"])
        wr.writeheader()
        results = p.imap(worker_crawl, crawl_list)
        for result in results:
            for x in result:
                wr.writerow({
                    #"keyword": str(keyword),
                    "title": x[0],
                    "link": x[1],
                    "abstract": x[2],
                    "footer": x[3],
                    "date": x[4],
                    "pagenum": x[5],
                    })

if __name__=='__main__':
        p = ThreadPool(4)
        openup()
        mp_handler()
        p.terminate()
        p.join()

标签: pythonmultithreadingweb-scrapingweb-crawlerpython-multiprocessing

解决方案


您确定页面以快速的请求序列响应正确的响应吗?我曾经遇到过这样的情况,如果请求是快速的,或者请求是按时间间隔的,那么被抓取的站点会以不同的响应做出响应。Menaing,调试时一切都很完美,但是一旦请求快速且有序,网站决定给我一个不同的响应。除此之外,我会问您在非线程安全环境中编写的事实是否会产生影响:为了最大限度地减少最终 CSV 输出的交互和数据问题,您可以:


推荐阅读