首页 > 解决方案 > 在基于非日期分区的动态帧中检索数据

问题描述

好的,我想将 amazon s3 中的数据加载到动态框架中,但将其限制在日期范围内。我的数据以这种格式存储在 s3 的 parquet 文件中:
s3://bucket/all-dates/year=2021/month=4/day=13/
s3://bucket/all-dates/year=2021/month =4/day=14/
s3://bucket/all-dates/year=2021/month=4/day=15/
s3://bucket/all-dates/year=2021/month=4/day=16 /

目前我将数据加载到我的脚本中:

ds1 = glueContext.create_dynamic_frame_from_options(
    connection_type = "s3",
    connection_options =
        {"paths":
            [
                "s3://bucket/all-dates/"
            ],
            "recurse": True
        },
      format = "parquet"
)

这很好,因为目前它将所有数据加载到数据框中。但是我想做的只是从脚本运行之日起递归到最近一周或最近两周的文件。

任何帮助表示赞赏。谢谢

标签: amazon-s3pysparkaws-glue

解决方案


您可以构建日期列表,然后构建 S3 路径列表,然后将其传递给选项

start_date = '2020-01-01'
end_date = '2020-01-10'
paths = [f's3://bucket/all-dates/year={d.year}/month={d.month}/day={d.day}/' for d in pd.date_range(start_date, end_date)]
# ['s3://bucket/all-dates/year=2020/month=1/day=1/',
#  's3://bucket/all-dates/year=2020/month=1/day=2/',
#  's3://bucket/all-dates/year=2020/month=1/day=3/',
#  's3://bucket/all-dates/year=2020/month=1/day=4/',
#  's3://bucket/all-dates/year=2020/month=1/day=5/',
#  's3://bucket/all-dates/year=2020/month=1/day=6/',
#  's3://bucket/all-dates/year=2020/month=1/day=7/',
#  's3://bucket/all-dates/year=2020/month=1/day=8/',
#  's3://bucket/all-dates/year=2020/month=1/day=9/',
#  's3://bucket/all-dates/year=2020/month=1/day=10/']

ds1 = glueContext.create_dynamic_frame_from_options(
    connection_type = "s3",
    connection_options =
        {
            "paths": paths,
            "recurse": True # probably unnecessary since we gave the exact paths
        },
      format = "parquet"
)

推荐阅读