首页 > 解决方案 > 使用 Python 仅获取 Azure Data Lake 中目录中的子文件夹名称列表

问题描述

我在名为 infaqa 的 Azure 数据湖容器中有数据,infaqa 中的目录如下:`infqa/EIM//Sales/Raw/APXTConga4__Composer_Setting__mdt、infqa/EIM//Sales/Raw/Account 等。

我正在使用 Azure 笔记本和库blobserviceclient to list blobs,但我看到所有子文件夹列表和子子文件夹都在列出。当我只在列表中寻找子文件夹名称时,例如['APXTConga4__Composer_Setting__mdt','Account'...]从下面展示的输出中

    Input:
        blobPrefix = "/EIM/Sales/Raw/"     
        mylist=[]
        objects=[]
        blob_list = container_client.list_blobs(blobPrefix)
        for blob in blob_list:
            mylist.append(blob.name)
            print(blob.name)
    
    Ouptut:
    EIM/Sales/Raw/APXTConga4__Composer_Setting__mdt
    EIM/Sales/Raw/APXTConga4__Composer_Setting__mdt/2020
    EIM/Sales/Raw/APXTConga4__Composer_Setting__mdt/2020/12
    EIM/Sales/Raw/APXTConga4__Composer_Setting__mdt/2020/12/02
    EIM/Sales/Raw/Account
    EIM/Sales/Raw/Account/2020
    EIM/Sales/Raw/Account/2020/12
    EIM/Sales/Raw/Account/2020/12/02

标签: pythonazureazure-data-lake-gen2

解决方案


试试这个解决方案,我在我的系统中试过

我有文件夹结构,比如 where testcontainer 和accountis folder

1)test/account/main/sub1/
2)test/account/test1/sub2
3)test/account/test2/sub3


from azure.storage.blob import BlobServiceClient
import os

source_key = 'Key'
source_account_name = 'Account Name'
block_blob_service = BlobServiceClient(
    account_url=f'https://{source_account_name}.blob.core.windows.net/', credential=source_key)
source_container_client = block_blob_service.get_container_client(
    'Container name')
result=[]
allfolders=[]
generator =source_container_client.list_blobs("account")

for file in source_container_client.walk_blobs('account/', delimiter='/'):
    print(file.name)
    text=file.name
    result.append(text)
for data in result:
    
    allfolders.append(data.replace("account/",""))
print(allfolders)
for res in allfolders:
    print(res)

输出

存储帐户中的文件夹结构

在此处输入图像描述

在此处输入图像描述

能够获取所有子文件夹名称

在此处输入图像描述


推荐阅读