首页 > 解决方案 > Python 从字面上隐藏一行并跳过调用函数

问题描述

我知道这可能没有多大意义。有一个函数正在调用另一个应该调用的函数,这是selenium.webdriver.Chrome().get('some_website')代码的简化版本(工作得很好):

from selenium.webdriver import Chrome


def func1(driver, url):
    driver.get(url)


def func2():
    driver = Chrome()
    func1(driver, 'https://stackoverflow.com/questions/ask')


if __name__ == '__main__':
    func2()

因为这很可能会遭到怀疑,所以我制作了一个 gif 来展示这个奇怪的现象:

wtf

这是不应该有问题的代码,很明显这里有一些奇怪的东西,我尝试了以下顺便说一句:

以上所有导致相同的结果

from selenium.webdriver import Chrome
from lxml import html
import time


def get_table_id(website):
    """
    Get table html id value.
    Args:
        website: url containing supported domains.

    Returns:
        website respective table id.
    """
    if 'free-proxy.cz' in website:
        return 'proxy_list'
    if 'proxyrack' in website:
        return 'proxy_table'
    raise ValueError(f'Unsupported website {website}')


def scrape_page(driver, page, wait_time=0):
    """
    Scrape a page from the following websites:
    - http://free-proxy.cz/en/proxylist/main/
    - https://www.proxyrack.com/free-proxy-list/
    Args:
        driver: selenium.webdriver class
        page: url.
        wait_time: seconds to wait for page load.

    Yields:
        dictionary per scraped ip address.
    """
    driver.get(page)
    content = driver.page_source
    if wait_time:
        time.sleep(wait_time)
    tree = html.fromstring(content)
    columns = [
        tree.xpath(f'//*[@id="proxy_list"]/tbody//tr//td[{i}]//text()')
        for i in range(1, 12)
    ]
    columns = [
        [item for item in row if 'adsbygoogle' not in item] for row in columns[:3]
    ]
    if 'proxyrack' not in page:
        columns[0] = [columns[0][i] for i in range(1, len(columns[0]), 2)]
    assert (
        len(set(len(item) for item in columns)) == 1
    ), f'row length mismatch \n{columns}'
    for row in zip(*columns):
        yield dict(zip(('ip_address', 'port', 'protocol'), row))


def scrape_pr_pages(total_pages=765):
    driver = Chrome()
    scrape_page(driver, 'http://free-proxy.cz/en/', 15)  # ?????
    driver.find_element_by_xpath(
        '//*[@id="dynatable-pagination-links-proxy_table"]/li[8]/a'
    ).click()
    time.sleep(2)
    print(driver.page_source)


if __name__ == '__main__':
    scrape_pr_pages()

标签: pythonpython-3.xselenium

解决方案


(复制我的评论来回答)

在 中scrape_page,使用yield. 这会将函数转换为迭代器。要处理结果,您需要迭代它们。

for row in scrape_page(driver, 'http://free-proxy.cz/en/', 15):
   print(row)

正如 OP 所提到的,删除yield调用并返回完整的字典也解决了这个问题。


推荐阅读