首页 > 解决方案 > 需要优化抓取代码 - 选择带参数的 URL

问题描述

这是一个使用搜索参数获取 url 的简单代码。它确实有效,但我认为它需要优化。

def target_url(search_term, include_term, intext_term, target_site_in, page):
    
    base_template_0 = f'https://www.google.com/search?q={search_term}+"{include_term}"+intext:{intext_term}+site:{target_site_in}&hl=en&rlz='
    base_template_1 = f'https://www.google.com/search?q={search_term}+"{include_term}"+intext:{intext_term}&hl=en&rlz='
    base_template_2 = f'https://www.google.com/search?q={search_term}+"{include_term}"&hl=en&rlz='
    base_template_3 = f'https://www.google.com/search?q={search_term}&hl=en&rlz='

    search_term = search_term.replace(' ', '+')

    base_url_0 = base_template_0.format(search_term)
    base_url_1 = base_template_1.format(search_term)
    base_url_2 = base_template_2.format(search_term)
    base_url_3 = base_template_3.format(search_term)

    url_template_0 = base_url_0 + '&start={}'
    url_template_1 = base_url_1 + '&start={}'
    url_template_2 = base_url_2 + '&start={}'
    url_template_3 = base_url_3 + '&start={}'

    if page == 0 and search_term and include_term and intext_term and target_site:
        return base_url_0
    if page == 0 and search_term and include_term and intext_term:
        return base_url_1
    if page == 0 and search_term and include_term:
        return base_url_2
    if page == 0 and search_term:
        return base_url_3
    else:
        if search_term and include_term and intext_term and target_site:
            return url_template_0.format(page)
        if search_term and include_term and intext_term:
            return url_template_1.format(page)
        if search_term and include_term:
            return url_template_2.format(page)
        if search_term:
            return url_template_3.format(page)


需要四个参数:search_term、inclusion_term、input_term、target_site_in - 在每种情况下,条件 URL 的指定方式都不同。

给我一个更好的优化思路。

标签: pythonweb-scrapingoptimization

解决方案


您可以创建一个为您提供最终搜索查询的方法,而不是拥有多个模板字符串并对其进行选择:


def get_search_query(search_term, include_term, intext_term, target_site_in):
  response = search_term.replace(' ', '+')
  if include_term:
    response = f"{response}+{include_term}"
  if intext_term:
    response = f"{response}+intext:{intext_term}"
  if target_site_in:
    response = f"{response}+site:{target_site_in}"
  return response

现在在你的方法中你可以调用它

def target_url(search_term, include_term, intext_term, target_site_in, page):
  query = get_search_query(search_term, include_term, intext_term, target_site_in)
  url = f'https://www.google.com/search?q={query}&hl=en&rlz='
  if page != 0:
    url = f"{url}&page={page}"
  return url

推荐阅读