首页 > 解决方案 > 在抓取 Google 结果的同时抓取精确匹配

问题描述

我正在使用以下代码抓取谷歌搜索结果:

from googlesearch import search   

query = "water outage site:https://www.heraldsun.com/"

for j in search(query, tld="com", num=100, stop=None, pause=2):
    print(j)

这目前为我提供了包含“水”一词以及其中包含“停电”的文章的结果,但我正在寻找包含“停水”一词的文章 - 在谷歌搜索中,这个类似于搜索“停水”。我试过这个:

query= "\"water outage\" site:https://www.heraldsun.com/"

但是,我仍然看到相同数量的结果。有没有办法获得精确匹配?

标签: pythonweb-scrapinggoogle-search

解决方案


您只需要稍微更改搜索查询:

# from this
query = "water outage site:https://www.heraldsun.com/"

# to this (removing https part and backslashes)
query = "water outage site:heraldsun.com"

所以你的代码看起来像这样:

from googlesearch import search   

query = "water outage site:heraldsun.com"

for j in search(query, tld="com", num=100, stop=None, pause=2):
    print(j)

或者,如果您需要解析 Google 或其他搜索引擎,您可以尝试使用SerpApi。这是一个付费 API,有一个免费计划,目前可以解析来自沃尔玛、AppStore 等市场的七个搜索引擎和搜索引擎的数据。

解析与您尝试做的相同事情的示例代码:

import os, json
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "water outage site:heraldsun.com",
    "hl": "en",                               # language
    "gl": "us",                               # country to search from
    "api_key": os.getenv("API_KEY")           # your API key
}

search = GoogleSearch(params)
results = search.get_dict()

# prints full JSON response from the first page
for result in results["organic_results"]:
    print(json.dumps(result, indent=2))  
    
    # want a title and a link?
    # print(result['title'])
    # print(result['link'])


# part of the output
'''
{
  "position": 1,
  "title": "Water out in parts of Hillsborough after contractor ruptures main",
  "link": "https://amp.heraldsun.com/news/local/counties/orange-county/article233990732.html",
  "displayed_link": "https://amp.heraldsun.com \u203a counties \u203a article233990732",
  "date": "Aug 14, 2019",
  "snippet": "A water main break Wednesday morning has left some Hillsborough water customers without service. A contractor installing communications ...",
  "about_this_result": {
    "source": {
      "description": "The Herald-Sun is an American, English language daily newspaper in Durham, North Carolina, published by the McClatchy Company.",
      "source_info_link": "https://en.wikipedia.org/wiki/The_Herald-Sun_(Durham,_North_Carolina)",
      "security": "secure",
      "icon": "https://serpapi.com/searches/61922a67c7ad41c5dfd89667/images/2a749afb909e95f5dfd0a23ae293037290151a6162f1bfcd35e02e2733abee61f2ede2c6f00c6c9c5e8cd5c2fcb2f610.png"
    },
    "keywords": [
      "water",
      "outage"
    ],
    "related_keywords": [
      "main break"
    ],
    "languages": [
      "English"
    ],
    "regions": [
      "the United States"
    ]
  },
  "cached_page_link": null
}
# other results..
'''

免责声明,我为 SerpApi 工作。


推荐阅读