首页 > 解决方案 > 无法使用两个线程在一个脚本中执行两个函数

问题描述

我结合使用 python 创建了一个刮板,Thread以加快执行速度。刮板应该解析网页中以不同字母结尾的所有可用链接。它确实解析了它们。

但是,我希望再次使用这些单独的链接来解析所有的namesphone数字Thread。我可以设法运行的第一部分,Thread但我不知道如何创建另一个Thread来执行脚本的后半部分?

我本可以将它们包装在一个Thread中,但我的目的是知道如何使用两个Threads来执行两个功能。

对于第一部分:我尝试如下,它有效

import requests
import threading
from lxml import html

main_url = "https://www.houzz.com/proListings/letter/{}"

def alphabetical_links(mainurl):
    response = requests.get(link).text
    tree = html.fromstring(response)
    return [container.attrib['href'] for container in tree.cssselect(".proSitemapLink a")]

if __name__ == '__main__':
    linklist = []
    for link in [main_url.format(chr(page)) for page in range(97,123)]:
        thread = threading.Thread(target=alphabetical_links, args=(link,))
        thread.start()
        linklist+=[thread]

    for thread in linklist:
        thread.join()

我的问题是:如何sub_links()在另一个函数中使用函数Thread

import requests
import threading
from lxml import html

main_url = "https://www.houzz.com/proListings/letter/{}"

def alphabetical_links(mainurl):
    response = requests.get(link).text
    tree = html.fromstring(response)
    return [container.attrib['href'] for container in tree.cssselect(".proSitemapLink a")]

def sub_links(process_links):
    response = requests.get(process_links).text
    root = html.fromstring(response)

    for container in root.cssselect(".proListing"):
        try:
            name = container.cssselect("h2 a")[0].text
        except Exception: name = ""
        try:
            phone = container.cssselect(".proListingPhone")[0].text
        except Exception: phone = ""
        print(name, phone)

if __name__ == '__main__':
    linklist = []
    for link in [main_url.format(chr(page)) for page in range(97,123)]:
        thread = threading.Thread(target=alphabetical_links, args=(link,))
        thread.start()
        linklist+=[thread]

    for thread in linklist:
        thread.join()

标签: pythonpython-3.xweb-scrapingpython-multithreading

解决方案


尝试alphabetical_links使用自己的线程进行更新:

import requests
import threading
from lxml import html

main_url = "https://www.houzz.com/proListings/letter/{}"


def alphabetical_links(mainurl):
    response = requests.get(mainurl).text
    tree = html.fromstring(response)
    links_on_page = [container.attrib['href'] for container in tree.cssselect(".proSitemapLink a")]
    threads = []
    for link in links_on_page:
        thread = threading.Thread(target=sub_links, args=(link,))
        thread.start()
        threads.append(thread)
    for thread in threads:
        thread.join()


def sub_links(process_links):
    response = requests.get(process_links).text
    root = html.fromstring(response)

    for container in root.cssselect(".proListing"):
        try:
            name = container.cssselect("h2 a")[0].text
        except Exception: name = ""
        try:
            phone = container.cssselect(".proListingPhone")[0].text
        except Exception: phone = ""
        print(name, phone)

if __name__ == '__main__':
    linklist = []
    for link in [main_url.format(chr(page)) for page in range(97,123)]:
        thread = threading.Thread(target=alphabetical_links, args=(link,))
        thread.start()
        linklist+=[thread]


    for thread in linklist:
        thread.join()

请注意,这只是如何管理“内部线程”的示例。由于同时启动了许多线程,您的系统可能由于缺乏资源而无法启动其中一些线程,并且您将遇到RuntimeError: can't start new thread异常。在这种情况下,您应该尝试实现ThreadPool


推荐阅读