首页 > 解决方案 > 脚本正在生成重复输出

问题描述

我有一个抓取机器人,目前正在检查两个网站上按钮的更改。所有代码都可以在 github上找到,但这里是重点。

这是打印输出的函数:

def notify_difference(card, original_text):    
    print("#######################################")
    print(f"            {card.get_model()} STOCK ALERT           ")
    print(f"           {time.ctime()}")
    print(f"Button has changed from {original_text} to {card.get_button_text()} for {card.get_name()}.")
    if "newegg" in card.get_url():
        print(
            f"Add it to your cart: https://secure.newegg.com/Shopping/AddToCart.aspx?ItemList={card.get_item_id()}&Submit=ADD&target=NEWEGGCART\n\n")
    print(f"Current price: {card.get_price()}.")
    print(f"Please visit {card.get_url()} for more information.")
    print("#######################################")
    print("")
    print("")

这是生成请求任务的函数:

async def get_stock():    
    # Get the current time and append to the end of the url just to add some minor difference
    # between scrapes.
    t = int(round(time.time() * 1000))

    urls = {
        "..."
    }
    s = AsyncHTMLSession()

    tasks = (parse_url(s, url.split("-=")[1], url.split("-=")[0]) for url in urls)

    return await asyncio.gather(*tasks)

这是获取 url 并调用解析 html 的类的代码:

async def parse_url(s, url, model):
    # Narrow HTML search down using HTML class selectors.
    r = await s.get(url)
    cards = r.html.find('.right-column')

    for item in cards:
        card = Card.create(item, model)

        if card is not None:
            card_id = card.get_item_id()
            if card_id in card_set.keys():
                if card_set[card_id].get_button_text() != card.get_button_text():
                    original_text = card_set[card_id].get_button_text()
                    if card.is_in_stock():
                        notify_difference(card, original_text)

            card_set[card_id] = card

这一切都从__main__这里开始:

if __name__ == '__main__':
    print(f"{time.ctime()} ::: Checking Stock...")
    Util.clear_card_shelf()

    while True:
        card_set = Util.get_card_dict()

        try:
            asyncio.run(get_stock())
        except Exception as e:
            if "SSLError" in type(e).__name__:
                # SSL Error. Wait 8-15 seconds and try again.
                print(f"{time.ctime()} ::: {type(e).__name__} error. Retrying in 8-15 seconds...")
            else:
                print(f"{type(e).__name__} Exception: {str(e)}")

        Util.set_card_shelf(card_set)
        time.sleep(random.randint(8, 15))

现在查看此示例输出。注意时间戳。这些重复项出现在循环的后续运行中:

#######################################
            3070 STOCK ALERT           
           Wed Nov 18 11:38:10 2020
Button has changed from Sold Out to Add to cart for MSI GeForce RTX 3070 DirectX 12 RTX 3070 VENTUS 3X OC 8GB 256-Bit GDDR6 PCI Express 4.0 HDCP Ready Video Card.
Add it to your cart: https://secure.newegg.com/Shopping/AddToCart.aspx?ItemList=N82E16814137601&Submit=ADD&target=NEWEGGCART


Current price: $549.99.
Please visit https://www.newegg.com/msi-geforce-rtx-3070-rtx-3070-ventus-3x-oc/p/N82E16814137601 for more information.
#######################################


#######################################
            3070 STOCK ALERT           
           Wed Nov 18 11:40:12 2020
Button has changed from Sold Out to Add to cart for MSI GeForce RTX 3070 DirectX 12 RTX 3070 VENTUS 3X OC 8GB 256-Bit GDDR6 PCI Express 4.0 HDCP Ready Video Card.
Add it to your cart: https://secure.newegg.com/Shopping/AddToCart.aspx?ItemList=N82E16814137601&Submit=ADD&target=NEWEGGCART


Current price: $549.99.
Please visit https://www.newegg.com/msi-geforce-rtx-3070-rtx-3070-ventus-3x-oc/p/N82E16814137601 for more information.
#######################################


#######################################
            3070 STOCK ALERT           
           Wed Nov 18 11:40:50 2020
Button has changed from Sold Out to Add to cart for MSI GeForce RTX 3070 DirectX 12 RTX 3070 VENTUS 3X OC 8GB 256-Bit GDDR6 PCI Express 4.0 HDCP Ready Video Card.
Add it to your cart: https://secure.newegg.com/Shopping/AddToCart.aspx?ItemList=N82E16814137601&Submit=ADD&target=NEWEGGCART


Current price: $549.99.
Please visit https://www.newegg.com/msi-geforce-rtx-3070-rtx-3070-ventus-3x-oc/p/N82E16814137601 for more information.
#######################################

我一生都无法弄清楚为什么输出重复。每次都会发生这种情况。这是并行请求的问题吗?还是有问题shelve?还是完全不同的东西?

帮助!

附加代码

以下是上面引用的一些实用程序函数:

def get_card_dict():
    s = shelve.open('cards')

    stocks = s.items()
    stock_dict = convert_tuple_to_dict(stocks)    

    s.close()

    return stock_dict

def set_card_shelf(dic):
    s = shelve.open('cards')
    s.update(dic)
    s.close()

def clear_card_shelf():
    if path.exists(f"cards.dat"):
        card_dat_list = glob.glob(f"cards.*")
        for card_dat in card_dat_list:
            remove(card_dat)

def convert_tuple_to_dict(tup):
    dic = {}
    for a, b in tup:
        dic.setdefault(a, b)
    
    return dic

标签: pythonweb-scraping

解决方案


推荐阅读