首页 > 解决方案 > 在 MPI azure ml 管道中运行 MPI python 脚本

问题描述

我正在尝试通过使用 MPIStep 管道类的 azure ML 管道运行分布式 python 作业,方法是参考以下示例链接 - https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/机器学习管道/管道样式传输/管道样式传输.ipynb

我尝试实现相同但即使我更改了 MpiStep 类中的节点计数参数,在运行脚本时它总是显示大小(即 comm.Get_size())为 1。你能帮我解决我在这里缺少的东西吗?集群上是否需要任何特定设置?

代码片段:

管道代码片段:

model_dir = model_ds.path('./'+saved_model_blob+'/',data_reference_name='saved_model_path').as_mount()
label_dir = model_ds.path('./'+model_label_blob+'/',data_reference_name='model_label_blob').as_mount()

input_images = result_ds.path('./'+score_blob_name+'/',data_reference_name='Input_images').as_mount()

output_container = 'abc'
inti_container = 'xyz'



distributed_batch_score_step = MpiStep(
    name="batch_scoring",
    source_directory=SCRIPT_FOLDER,
    script_name="batch_scoring_script_mpi.py",
    arguments=["--dataset_path", input_images, 
               "--model_name", model_dir,
               "--label_dir", label_dir, 
               "--intermediate_data_container", inti_container, 
               "--output_container", output_container],
    compute_target=gpu_cluster,
    inputs=[input_images, model_dir,label_dir],
    pip_packages=["tensorflow","tensorflow-gpu==1.13.1","pillow","azure-keyvault","azure-storage-blob"],
    conda_packages=["mesa-libgl-cos6-x86_64","mpi4py==3.0.2","opencv=3.4.2","scikit-learn=0.21.2"],                                 
    use_gpu=True,
    allow_reuse = False,
    node_count = nodecount_param,
    process_count_per_node = 1

)

Python 脚本代码片段:

def run(input_dataset,comm):

rank = comm.Get_rank()
size = comm.Get_size()
print("Rank:" , rank)
print("Size:", size) # shows always 1, even the input node count is >1
print(MPI.Get_processor_name())


file_names = get_file_names(args.dataset_path)
sorted(file_names)


partition_size = len(file_names) // size
print("partition_size-->",partition_size)
partitioned_filenames = file_names[rank * partition_size: (rank + 1) * partition_size]
print("RANK {}  - is processing {} images out of the total {}".format(rank, len(partitioned_filenames),
                                                                     len(file_names)))

# call to Function 01

# call to Function 02

img_names = score_df['image_name'].unique()
output_batch = pd.DataFrame()
for i in img_names:
    # call to Function 3
    output_batch = output_batch.append(pp_output, ignore_index=True)
    output_paths_list = comm.gather(output_batch, root=0)



print("RANK {} - number of pre-aggregated output files {}".format(rank, len(output_batch)))

print("saved in", currentDT + '\\' + 'data.csv')

if rank == 0:
    print("RANK {} - number of aggregated output files {}".format(rank, len(output_paths_list)))
    print("RANK {} - end".format(rank))

if __name__ == "__main__":
    with tf.device('/GPU:0'):
        init()
        comm = MPI.COMM_WORLD
        run(args.dataset_path,comm)

标签: python-3.xazure-pipelinesmpi4pyazure-machine-learning-service

解决方案


知道问题是由于软件包版本引起的,之前它是通过 conda 使用 conda_packages=["mpi4py==3.0.2"] 安装的,它在通过 pip - pip_packages=["mpi4py"] 更改安装后工作


推荐阅读