首页 > 解决方案 > Fail to load a subpart of "open-images-v6" with Fiftyone

问题描述

Context

I'm trying to retrieve a large amount of data to train a CNN. More specifically, I'm looking for pictures of Swimming pools. I have found a lot of them in the open-images-v6 database made by Google. So now, I just want to download these particular images (I don't want 9 Millions images to end up in my download folder).

Problem

In order to do this, I followed carefully the instructions given on the Download page (see : https://storage.googleapis.com/openimages/web/download.html). So, I installed "fiftyone", tried out the "testing" procedure (which would be loading the "quickstart" dataset and navigating through the data) and have not encountered any issues so far.

But when I tried to retrieve the Swimming pool images with the following code, I went through a lot of issues :

import fiftyone as fo
import fiftyone.zoo as foz

dataset = foz.load_zoo_dataset(
    "open-images-v6",
    split="validation",
    label_types="detections",
    classes="Swimming pool"
)
session = fo.launch_app(dataset)

I will skip right to the problem I couldn't figure out : when I run the code, it properly downloads a bunch of .csv files, but when it tries to download the data (the images) it shows a pretty bad looking error :

botocore.exceptions.ClientError: An error occurred (404) when calling the HeadObject operation: Not Found

State of the art

After hours of searching the origin of the error, I eventually discovered that it was somehow linked with AWS, but I have absolutely no clue what I can do on this field.

I saw a random tutorial on internet that recommended to install "awscli" via PIP but nothing changed. I tried to import other datasets with the same procedure (i.e foz.load_zoo_dataset("coco-2017")) and it seemed to work (at least the download started but I stopped it early).

Thank you for your time.

标签: pythonamazon-web-servicesdataset

解决方案


Thank you for the aws hint, that finally got me on the right trail.

Fiftyone uses the python os.path.join() functionality, which will create windows style paths when running windows. The s3 blob storage can't use those windows paths, therefore raising the 404 error.

Since this is a bug in fiftyone itself (I will create a pr to get that bug fixed), you will need to modify fiftyone yourself.

Go to your python site-packages dir, then open fityone/utils/openimages.py

In this file, add the following code to the import statements:

import re

Then search for the _download_images_if_necessary method and replace this line:

fp_download = os.path.join(split, image_id + ".jpg")

with this one:

fp_download = re.sub(r"\\", "/", os.path.join(split, image_id + ".jpg"))

This did fix the problem for me.


推荐阅读