python - 在抓取 Google 结果的同时抓取精确匹配
问题描述
我正在使用以下代码抓取谷歌搜索结果:
from googlesearch import search
query = "water outage site:https://www.heraldsun.com/"
for j in search(query, tld="com", num=100, stop=None, pause=2):
print(j)
这目前为我提供了包含“水”一词以及其中包含“停电”的文章的结果,但我正在寻找包含“停水”一词的文章 - 在谷歌搜索中,这个类似于搜索“停水”。我试过这个:
query= "\"water outage\" site:https://www.heraldsun.com/"
但是,我仍然看到相同数量的结果。有没有办法获得精确匹配?
解决方案
您只需要稍微更改搜索查询:
# from this
query = "water outage site:https://www.heraldsun.com/"
# to this (removing https part and backslashes)
query = "water outage site:heraldsun.com"
所以你的代码看起来像这样:
from googlesearch import search
query = "water outage site:heraldsun.com"
for j in search(query, tld="com", num=100, stop=None, pause=2):
print(j)
或者,如果您需要解析 Google 或其他搜索引擎,您可以尝试使用SerpApi。这是一个付费 API,有一个免费计划,目前可以解析来自沃尔玛、AppStore 等市场的七个搜索引擎和搜索引擎的数据。
解析与您尝试做的相同事情的示例代码:
import os, json
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "water outage site:heraldsun.com",
"hl": "en", # language
"gl": "us", # country to search from
"api_key": os.getenv("API_KEY") # your API key
}
search = GoogleSearch(params)
results = search.get_dict()
# prints full JSON response from the first page
for result in results["organic_results"]:
print(json.dumps(result, indent=2))
# want a title and a link?
# print(result['title'])
# print(result['link'])
# part of the output
'''
{
"position": 1,
"title": "Water out in parts of Hillsborough after contractor ruptures main",
"link": "https://amp.heraldsun.com/news/local/counties/orange-county/article233990732.html",
"displayed_link": "https://amp.heraldsun.com \u203a counties \u203a article233990732",
"date": "Aug 14, 2019",
"snippet": "A water main break Wednesday morning has left some Hillsborough water customers without service. A contractor installing communications ...",
"about_this_result": {
"source": {
"description": "The Herald-Sun is an American, English language daily newspaper in Durham, North Carolina, published by the McClatchy Company.",
"source_info_link": "https://en.wikipedia.org/wiki/The_Herald-Sun_(Durham,_North_Carolina)",
"security": "secure",
"icon": "https://serpapi.com/searches/61922a67c7ad41c5dfd89667/images/2a749afb909e95f5dfd0a23ae293037290151a6162f1bfcd35e02e2733abee61f2ede2c6f00c6c9c5e8cd5c2fcb2f610.png"
},
"keywords": [
"water",
"outage"
],
"related_keywords": [
"main break"
],
"languages": [
"English"
],
"regions": [
"the United States"
]
},
"cached_page_link": null
}
# other results..
'''
免责声明,我为 SerpApi 工作。