首页 > 解决方案 > 如何运行返回变量的线程函数?

问题描述

使用 Python 3.6,我希望完成的是创建一个函数,该函数可以连续地从网页中抓取动态/变化的数据,而脚本的其余部分会执行,并且能够引用从连续函数返回的数据。

我知道这可能是一个线程任务,但是我对它还不是很了解。我可能认为伪代码看起来像这样

def continuous_scraper():
    # Pull data from webpage
    scraped_table = pd.read_html(url)
    return scraped_table

# start the continuous scraper function here, to run either indefinitely, or preferably stop after a predefined amount of time
scraped_table = thread(continuous_scraper)

# the rest of the script is run here, making use of the updating “scraped_table”
while True:
    print(scraped_table[“Col_1”].iloc[0]

标签: python-3.xweb-scrapingpython-multithreading

解决方案


这是一个相当简单的示例,它使用了一些似乎每隔几秒钟更新一次的股票市场页面。

import threading, time

import pandas as pd

# A lock is used to ensure only one thread reads or writes the variable at any one time
scraped_table_lock = threading.Lock()

# Initially set to None so we know when its value has changed
scraped_table = None

# This bad-boy will be called only once in a separate thread
def continuous_scraper():
    # Tell Python this is a global variable, so it rebinds scraped_table 
    # instead of creating a local variable that is also named scraped_table
    global scraped_table
    url = r"https://tradingeconomics.com/australia/stock-market"
    while True:
        # Pull data from webpage
        result = pd.read_html(url, match="Dow Jones")[0]
        
        # Acquire the lock to ensure thread-safety, then assign the new result
        # This is done after read_html returns so it doesn't hold the lock for so long
        with scraped_table_lock:
            scraped_table = result
        
        # You don't wanna flog the server, so wait 2 seconds after each 
        # response before sending another request
        time.sleep(2)

# Make the thread daemonic, so the thread doesn't continue to run once the 
# main script and any other non-daemonic threads have ended
scraper_thread = threading.Thread(target=continuous_scraper, daemon=True)

# start the continuous scraper function here, to run either indefinitely, or 
# preferably stop after a predefined amount of time
scraper_thread.start()

# the rest of the script is run here, making use of the updating “scraped_table”
for _ in range(100):
    print("Time:", time.time())
    
    # Acquire the lock to ensure thread-safety
    with scraped_table_lock:
        # Check if it has been changed from the default value of None
        if scraped_table is not None:
            print("     ", scraped_table)
        else:
            print("scraped_table is None")
    
    # You probably don't wanna flog your stdout, either, dawg!
    time.sleep(0.5)

请务必阅读有关多线程编程和线程安全的信息。很容易出错。如果有错误,它通常只在罕见且看似随机的场合出现,因此难以调试。


推荐阅读