首页 > 解决方案 > 从python中的url列表中提取数据时,多处理不起作用

问题描述

我正在尝试从存储在本地文本文件中的 url 列表中提取数据。代码运行完美但速度很慢,所以我应用了 Multiprocessing,如下所述。

import requests
from lxml import html
import pandas as pd
from pandas import ExcelWriter
from multiprocessing import Pool

filepath = open('d:\links.txt', 'r')

allnames = list()
alltitles = list()

def extractm():
    for ii in filepath:
        geturl = requests.get(ii)
        soup = html.fromstring(geturl.content)
        names = soup.xpath("//dt[contains(text(),'Corresponding Author')]//following::dd//text()[1]")
        titles = soup.xpath('//body//title//text()')
        for (name, title) in zip(names, titles):
            allnames.append(name)
            alltitles.append(title)
            fullfile = pd.DataFrame({'Names': allnames, 'Title': alltitles})
            writer = ExcelWriter('D:\\data.xlsx')
            fullfile.to_excel(writer, 'Sheet1', index=False)
            writer.save()

if __name__ == '__main__':
    pp = Pool(10)
    pp.apply_async(extractm())
    pp.close()
    pp.join()

即使在应用多进程之后,它的运行方式也和以前一样,请任何人帮助我。

标签: pythonweb-scrapingmultiprocessing

解决方案


推荐阅读