python - 如何使用 Beautiful Soup 在 Google 图片上查找 URL
问题描述
我试图在 Google 上查找无版权的图片,但我无法获得正确的图片 URL。我的代码应用了正确的过滤器并将我定向到正确的页面,但是它检索了没有无版权和大小过滤器的图像的 URL,我不确定为什么。先感谢您。
import requests
import urllib.request
from bs4 import BeautifulSoup
from urllib.request import urlopen, Request
url = 'https://google.com/search?q='
input = 'cat'
#string: tbm=isch --> means image search
#tbs=isz:m --> size medium
#il:cl --> copy right free(i think)
url = url+input+'&tbm=isch&tbs=isz:m%2Cil:cl'
print(url)
html = urlopen(Request(url, headers={'User-Agent': 'Google Chrome'}))
'''with urllib.request.urlopen(url) as response:
html = response.read()
print(html)'''
#print(str(r.content))
soup = BeautifulSoup(html.read(),'html.parser')
#using soup to find all img tags
results = soup.find_all('img')
str_result = str(results)
lst_result = str_result.split(',')
#trying to get the first link for the images with the appropriate settings
link = lst_result[4].split(' ')[4].split('"')[1]
# writing into the appropriate testing file, to be changed
file = open('.img1.png','wb')
get_img = requests.get(link)
file.write(get_img.content)
file.close()
解决方案
您可以尝试使用更简单的方法而不指定tbs=il:cl
参数,并通过搜索“ pexels cat”或“ unsplash cat”来玩猜谜游戏,哪些图像肯定是在知识共享下获得许可的。
或者,您可以尝试tbs=il:cl
在查询的开头添加过滤参数 ( ) 和 pexels/unsplash。
默认情况下,这些图片是完全免费的,因为这些网站旨在为商业或非商业用途提供免费图片,而 Google 将仅显示来自这些网站的结果。
要查找并提取原始图像 URL,您需要<script>
通过regex
.
首先,您需要使用以下命令查找所有脚本标签bs4
:
soup.select('script')
其次,使用匹配所需的模式regex
:
# one of the regex patterns to find original size URL
re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]", SOME_VARIABLE)
第三,遍历匹配,逐个提取和解码每个 URL:
for SOME_VARIABLE in SOME_VARIABLE:
# it needs to be decoded twice.
# otherwise Unicode characters will be still present after the first decode.
# yes, it is stupid.
original_size_img_not_fixed = bytes(fixed_full_res_image, 'ascii').decode('unicode-escape')
original_size_img = bytes(original_size_img_not_fixed, 'ascii').decode('unicode-escape')
在线 IDE 中的代码和完整示例可以抓取更多:
import requests, lxml, re, json
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "pexels cat",
"tbm": "isch",
"tbs": "il:cl",
"hl": "en",
"ijn": "0",
}
html = requests.get("https://www.google.com/search", params=params, headers=headers)
soup = BeautifulSoup(html.text, 'lxml')
def get_images_data():
print('\nGoogle Images Metadata:')
for google_image in soup.select('.isv-r.PNCib.MSM1fd.BUooTd'):
title = google_image.select_one('.VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb')['title']
source = google_image.select_one('.fxgdke').text
link = google_image.select_one('.VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb')['href']
print(f'{title}\n{source}\n{link}\n')
# this steps could be refactored to a more compact
all_script_tags = soup.select('script')
# # https://regex101.com/r/48UZhY/4
matched_images_data = ''.join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))
# https://kodlogs.com/34776/json-decoder-jsondecodeerror-expecting-property-name-enclosed-in-double-quotes
# if you try to json.loads() without json.dumps it will throw an error:
# "Expecting property name enclosed in double quotes"
matched_images_data_fix = json.dumps(matched_images_data)
matched_images_data_json = json.loads(matched_images_data_fix)
# https://regex101.com/r/pdZOnW/3
matched_google_image_data = re.findall(r'\[\"GRID_STATE0\",null,\[\[1,\[0,\".*?\",(.*),\"All\",', matched_images_data_json)
# https://regex101.com/r/NnRg27/1
matched_google_images_thumbnails = ', '.join(
re.findall(r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]',
str(matched_google_image_data))).split(', ')
print('Google Image Thumbnails:') # in order
for fixed_google_image_thumbnail in matched_google_images_thumbnails:
# https://stackoverflow.com/a/4004439/15164646 comment by Frédéric Hamidi
google_image_thumbnail_not_fixed = bytes(fixed_google_image_thumbnail, 'ascii').decode('unicode-escape')
# after first decoding, Unicode characters are still present. After the second iteration, they were decoded.
google_image_thumbnail = bytes(google_image_thumbnail_not_fixed, 'ascii').decode('unicode-escape')
print(google_image_thumbnail)
# removing previously matched thumbnails for easier full resolution image matches.
removed_matched_google_images_thumbnails = re.sub(
r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]', '', str(matched_google_image_data))
# https://regex101.com/r/fXjfb1/4
# https://stackoverflow.com/a/19821774/15164646
matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]",
removed_matched_google_images_thumbnails)
print('\nGoogle Full Resolution Images:') # in order
for fixed_full_res_image in matched_google_full_resolution_images:
# https://stackoverflow.com/a/4004439/15164646 comment by Frédéric Hamidi
original_size_img_not_fixed = bytes(fixed_full_res_image, 'ascii').decode('unicode-escape')
original_size_img = bytes(original_size_img_not_fixed, 'ascii').decode('unicode-escape')
print(original_size_img)
get_images_data()
--------------
'''
Google Images Metadata:
9,000+ Best Cat Photos · 100% Free Download · Pexels Stock Photos
pexels.com
https://www.pexels.com/search/cat/
...
Google Image Thumbnails:
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSb48h3zks_bf6y7HnZGyGPn3s2TAHKKm_7kzxufi5nzbouJcQderHqoEoOZ4SpOuPDjfw&usqp=CAU
...
Google Full Resolution Images:
https://images.pexels.com/photos/1170986/pexels-photo-1170986.jpeg?auto=compress&cs=tinysrgb&dpr=1&w=500
...
'''
或者,您可以使用SerpApi中的Google 图片 API跳过此过程。这是一个带有免费计划的付费 API。
主要区别在于您只需要遍历结构化 JSON,因为其他所有内容都已为最终用户完成。
要集成的代码:
import os, json # json for pretty output
from serpapi import GoogleSearch
def get_google_images():
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google",
"q": "minecraft shaders 8k photo",
"tbm": "isch"
}
search = GoogleSearch(params)
results = search.get_dict()
print(json.dumps(results['images_results'], indent=2, ensure_ascii=False))
----------
'''
...
{
"position": 60, # img number
"thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRt-tXSZMBNLLX8MhavbBNkKmjJ7wNXxtdr5Q&usqp=CAU",
"source": "pexels.com",
"title": "1,000+ Best Cats Videos · 100% Free Download · Pexels Stock Videos",
"link": "https://www.pexels.com/search/videos/cats/",
"original": "https://images.pexels.com/videos/855282/free-video-855282.jpg?auto=compress&cs=tinysrgb&dpr=1&w=500",
"is_product": false
}
...
'''
PS - 我写了一篇关于抓取Google 图片的博客文章,其中更深入地介绍了视觉表示。
免责声明,我为 SerpApi 工作。
推荐阅读
- apache-kafka - 强制消费者只阅读消费者复活后进入主题的那些消息
- sql - 根据多列查找重复批次
- entity-framework-6 - 如何从 EF6 的存储过程中读取 TPH 类型?
- xml - 在 XSLT 1.0 中从包含多个属性“stats”的 XML 中创建一个属性“stats”
- laravel - 如何使用 v-on 获取 html 属性值的值:单击 Vue.js 中的锚标记内?
- sql - SQL - 如何合并列值以等于新的总结果
- sas - SAS:更改 proc gplot 中参考标签的颜色和粗细
- javascript - 如何使用 axios-retry 重试状态 200
- javascript - 如何获取两个日期范围之间的天数(交集)/Javascript/Moment 中两个日期范围之间相交的天数
- python - Safari 中的下拉菜单不显示下拉项目