首页 > 解决方案 > 处理嵌套字典列表的最 Pythonic 方式是什么?

问题描述

识别 API 返回的不同嵌套字典类型以便应用正确类型的解析的最 Pythonic 方式是什么?

我正在从 Reddit 进行 API 调用以获取 URL,并且我正在获取具有不同键名和不同嵌套字典结构的嵌套字典。
我正在提取我需要的 URL,但我需要一种更 Pythonic 的方式来识别嵌套字典的不同键名和不同结构,因为if我在一个循环中尝试的语句for会出错,因为“如果”字典不包含键 我NoneType只是从if“询问”语句中得到一个错误,如果所述键在字典中。

在接下来的几段中,我描述了这个问题,但您可能能够深入研究字典示例和下面的代码,并看到我无法一次性识别三种字典类型之一的问题。嵌套字典没有相同的结构,我的代码充满了trys 和我认为的冗余for循环。

我有一个函数来处理三种类型的嵌套字典。topics_data(在下面使用)是一个 Pandas 数据框,该列vid是一个topics_data包含嵌套字典的列名。有时,vid单元格中的对象是None我正在阅读的帖子不是视频帖子。

API 返回的嵌套字典只有三种主要类型(如果不是None)。NoneType我最大的问题是如果我尝试使用以另一个键开头if的键捕获嵌套字典的语句来识别第一个键名而不会出错reddit_video,例如oembed相反。由于这个问题,我为三种嵌套字典类型中的每一种迭代嵌套字典列表三次。我希望能够遍历嵌套字典列表一次,并一次识别和处理每种类型的嵌套字典。

下面是我得到的三种不同类型的嵌套字典的示例,以及我现在设置的用于处理它们的丑陋代码。我的代码有效,但很难看。请挖进去看看。

嵌套字典...

嵌套字典一

{'reddit_video': {'fallback_url': 'https://v.redd.it/te7wsphl85121/DASH_2_4_M?source=fallback',
  'height': 480,
  'width': 480,
  'scrubber_media_url': 'https://v.redd.it/te7wsphl85121/DASH_600_K',
  'dash_url': 'https://v.redd.it/te7wsphl85121/DASHPlaylist.mpd?a=1604490293%2CYmQzNDllMmQ4MDVhMGZhODMyYmIxNDc4NTZmYWNlNzE2Nzc3ZGJjMmMzZGJjMmYxMjRiMjJiNDU4NGEzYzI4Yg%3D%3D&v=1&f=sd',
  'duration': 17,
  'hls_url': 'https://v.redd.it/te7wsphl85121/HLSPlaylist.m3u8?a=1604490293%2COTg2YmIxZmVmZGNlYTVjMmFiYjhkMzk5NDRlNWI0ZTY4OGE1NzgxNzUyMDhkYjFiNWYzN2IxYWNkZjM3ZDU2YQ%3D%3D&v=1&f=sd',
  'is_gif': False,
  'transcoding_status': 'completed'}}

嵌套字典二

{'type': 'gfycat.com',
 'oembed': {'provider_url': 'https://gfycat.com',
  'description': 'Hi! We use cookies and similar technologies ("cookies"), including third-party cookies, on this website to help operate and improve your experience on our site, monitor our site performance, and for advertising purposes. By clicking "Accept Cookies" below, you are giving us consent to use cookies (except consent is not required for cookies necessary to run our site).',
  'title': 'Protestors in Hong Kong are cutting down facial recognition towers.',
  'type': 'video',
  'author_name': 'Gfycat',
  'height': 600,
  'width': 600,
  'html': '<iframe class="embedly-embed" src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fgfycat.com%2Fifr%2Fedibleunrulyargentineruddyduck&display_name=Gfycat&url=https%3A%2F%2Fgfycat.com%2Fedibleunrulyargentineruddyduck-hong-kong-protest&image=https%3A%2F%2Fthumbs.gfycat.com%2FEdibleUnrulyArgentineruddyduck-size_restricted.gif&key=ed8fa8699ce04833838e66ce79ba05f1&type=text%2Fhtml&schema=gfycat" width="600" height="600" scrolling="no" title="Gfycat embed" frameborder="0" allow="autoplay; fullscreen" allowfullscreen="true"></iframe>',
  'thumbnail_width': 280,
  'version': '1.0',
  'provider_name': 'Gfycat',
  'thumbnail_url': 'https://thumbs.gfycat.com/EdibleUnrulyArgentineruddyduck-size_restricted.gif',
  'thumbnail_height': 280}}

嵌套字典三

{'oembed': {'provider_url': 'https://gfycat.com',
  'description': 'Hi! We use cookies and similar technologies ("cookies"), including third-party cookies, on this website to help operate and improve your experience on our site, monitor our site performance, and for advertising purposes. By clicking "Accept Cookies" below, you are giving us consent to use cookies (except consent is not required for cookies necessary to run our site).',
  'title': 'STRAYA! Ski-roos.   Stephan Grenfell for Australian Geographic',
  'author_name': 'Gfycat',
  'height': 338,
  'width': 600,
  'html': '<iframe class="embedly-embed" src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fgfycat.com%2Fifr%2Fhairyvibrantamericanratsnake&display_name=Gfycat&url=https%3A%2F%2Fgfycat.com%2Fhairyvibrantamericanratsnake-snow-kangaroos&image=https%3A%2F%2Fthumbs.gfycat.com%2FHairyVibrantAmericanratsnake-size_restricted.gif&key=ed8fa8699ce04833838e66ce79ba05f1&type=text%2Fhtml&schema=gfycat" width="600" height="338" scrolling="no" title="Gfycat embed" frameborder="0" allow="autoplay; fullscreen" allowfullscreen="true"></iframe>',
  'thumbnail_width': 444,
  'version': '1.0',
  'provider_name': 'Gfycat',
  'thumbnail_url': 'https://thumbs.gfycat.com/HairyVibrantAmericanratsnake-size_restricted.gif',
  'type': 'video',
  'thumbnail_height': 250},
 'type': 'gfycat.com'}  

我的函数来处理这三种类型的嵌套字典。topics_data是 Pandas 数据框,列是包含嵌套字典vid的列名,或者它是.topics_dataNone

def download_vid(topics_data, ydl_opts):
    for i in topics_data['vid']:
        try:
            if i['reddit_video']:
                B = i['reddit_video']['fallback_url']
                with youtube_dl.YoutubeDL(ydl_opts) as ydl:
                    ydl.download([B])

                print(B)
        except:
            pass
    for n, i in enumerate(topics_data['vid']):
        try:
            if i['type'] == 'gfycat.com':
                C = topics_data.loc[n]['vid']['oembed']['thumbnail_url'].split('/')[-1:][0].split('-')[0]
                C = 'https://giant.gfycat.com/'+ C +'.mp4'
                sub = str(topics_data.loc[n]['subreddit']).lower()
                urllib.request.urlretrieve(C,
                                           '/media/iii/Q2/tor/Reddit/Subs/'+sub+'/'+C.split('/')[-1:][0])

                print(C)
        except:
            pass
    for i in topics_data['vid']:
        try:
            if i['oembed']['thumbnail_url']:
                D = topics_data.loc[n]['vid']['oembed']['thumbnail_url'].split('/')[-1:][0].split('-')[0]
                D = 'https://giant.gfycat.com/'+ D +'.mp4'
                sub = str(topics_data.loc[n]['subreddit']).lower()
            urllib.request.urlretrieve(D, '/media/iii/Q2/tor/Reddit/Subs/'+sub+'/'+D.split('/')[-1:][0])
                print(D)
        except:
            pass  

写完这段代码后,我发现这些if语句是多余的,因为它会try成功解析每个块topics_data.loc[n]['vid']['oembed']内是否可能。 不要陷入嵌套字典的解析方式中,因为这不是我的问题。我的问题主要是如何识别迭代器具有哪种类型的嵌套字典。我会假设这都可以在一个循环而不是三个循环内处理。 最后一个问题是偶尔会有第四、第五或第六类型的字典我对解析不感兴趣,因为它们太罕见了。try
for

最后一段代码可能不是必需的,但我添加它只是为了使问题完整。我的识别和解析字典的函数也接受了 youtube-dl 的参数。

def my_hook(d):
    if d['status'] == 'finished':
        print('Done downloading, now converting ...')

def yt_dl_opts(topics_data):
    ydl_opts = {
        'format': 'bestvideo+bestaudio/37/22/18/best',
        'merge': 'mp4',
        'noplaylist' : True,        
        'progress_hooks': [my_hook],
        'outtmpl' : '/media/iii/Q2/tor/Reddit/Subs/'+ str(topics_data.loc[0]['subreddit']).lower()+'/%(id)s'
    }
    return ydl_opts  

更新
这是在尼尔的帮助下问题的答案。只是为了让后代更清楚地了解 Q 和 A。
一切仍然被包裹在 a 中,try: except: pass因为仍然有一些随机的,并且总是返回新的 dic 结构。我写了一个循环来计算不是的视频结果,None并计算所有成功下载的视频os.walk

def download_vid(topics_data, ydl_opts):
    y_base = 'https://www.youtube.com/watch?v='
    for n, i in enumerate(topics_data['vid']):
        try:
            if 'type' in i:
                if 'youtube.com' in i[n]['type']:
                    print('This is a Youtube Video')
                    A = i['oembed']['html'].split('embed/')[1].split('?')[0]
                    with youtube_dl.YoutubeDL(ydl_opts) as ydl:
                        ydl.download([A])
                    print(y_base+A)

            if 'reddit_video' in i:
                print('This is a reddit_video Video')
                B = i['reddit_video']['fallback_url']
                with youtube_dl.YoutubeDL(ydl_opts) as ydl:
                    ydl.download([B])
                print(B)

            if 'type' in i:
                if 'gfycat.com' in i[n]['type']:
                    print('This is a type, gfycat Video')
                    C = topics_data.loc[n]['vid']['oembed']['thumbnail_url'].split('/')[-1:][0].split('-')[0]
                    C = 'https://giant.gfycat.com/'+ C +'.mp4'
                    sub = str(topics_data.loc[n]['subreddit']).lower()
                    urllib.request.urlretrieve(C,
                                       '/media/iii/Q2/tor/Reddit/Subs/'+sub+'/'+C.split('/')[-1:][0])
                print(C)

            if 'oembed' in i:
                print('This is a oembed, gfycat Video')
                D = topics_data.loc[n]['vid']['oembed']['thumbnail_url'].split('/')[-1:][0].split('-')[0]
                D = 'https://giant.gfycat.com/'+ D +'.mp4'
                sub = str(topics_data.loc[n]['subreddit']).lower()
                urllib.request.urlretrieve(C, '/media/iii/Q2/tor/Reddit/Subs/'+sub+'/'+D.split('/')[-1:][0])
                print(D)
        except:
            pass

标签: pythondictionaryreddit

解决方案


更新:意识到 OP 的文本正在处理非唯一查找。添加了一个段落来描述如何做到这一点。

如果您发现自己多次遍历字典列表以执行查找,请将列表重组为字典,以便查找成为键。例如这个:

a = [{"id": 1, "value": "foo"}, {"id": 2, "value": "bar"}]
for item in a:
    if item["id"] == 1:
        print(item["value"])

可以变成这样:

a = [{"id": 1, "value": "foo"}, {"id": 2, "value": "bar"}]
a = {item["id"]: item for item in a} # index by lookup field

print(a[1]["value"]) # no loop
... # Now we can continue to loopup by id eg a[2] without a loop

如果它是非唯一查找,您可以执行类似操作:

indexed = {}
a = [{"category": 1, "value": "foo"}, {"category": 2, "value": "bar"}, {"category": 1, "value": "baz"}]
for item in a: # This loop only has to be executed once
    if indexed.get(item["category"], None) is not None:
        indexed[item["category"]].append(item)
    else:
        indexed[item["category"]] = [item]

# Now we can do:
all_category_1_data = indexed[1]
all_category_2_data = indexed[2]

如果遇到索引错误,请使用默认字典索引更轻松地处理

if a.get(1, None) is not None:
    print(a[1]["value"])
else:
    print("1 was not in the dictionary")

这个 IMO 没有什么“Pythonic”,但如果 API 返回您需要循环的列表,它可能只是一个设计糟糕的 API

更新:好的,我将尝试修复您的代码:

def download_vid(topics_data, ydl_opts):
    indexed_data = {'reddit': [], 'gfycat': [], 'thumbnail': []}

    for item in topics_data['vid']:
        if item.get('reddit_video', None) is not None:
            indexed_data['reddit'].append(item)
        elif item.get('type', None) == "gfycat.com":
            indexed_data['gfycat'].append(item)
        elif item.get('oembed', None) is not None:
            if item['oembed'].get('thumbnail_url', None) is not None:
                indexed_data['thumbnail'].append(item)

    for k, v in indexed_data.items():
        assert k in ('reddit_video', 'gfycat', 'thumbnail')
        if k == 'reddit_video':
            B = v['reddit_video']['fallback_rul']
            ...
        elif k == 'gfycat':
            C = v['oembed']['thumbnail_url']
            ...
        elif k == 'thumbnail':
            D = v['oembed']['thumbnail_url']
            ...

以防万一不清楚为什么这更好:

  • OP 循环了 topic_data['vid'] 三次。我做了两次。

  • 更重要的是,如果添加更多的主题,我仍然只做两次。OP 将不得不再次循环。

  • 没有异常处理。

  • 现在为每组对象编制索引。所以 OP 可以做,例如 indexed_data['gfycat'] 来获取所有这些对象,如果需要,这是一个哈希表查找,所以它很快


推荐阅读