python-3.x - 在 MPI azure ml 管道中运行 MPI python 脚本
问题描述
我正在尝试通过使用 MPIStep 管道类的 azure ML 管道运行分布式 python 作业,方法是参考以下示例链接 - https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/机器学习管道/管道样式传输/管道样式传输.ipynb
我尝试实现相同但即使我更改了 MpiStep 类中的节点计数参数,在运行脚本时它总是显示大小(即 comm.Get_size())为 1。你能帮我解决我在这里缺少的东西吗?集群上是否需要任何特定设置?
代码片段:
管道代码片段:
model_dir = model_ds.path('./'+saved_model_blob+'/',data_reference_name='saved_model_path').as_mount()
label_dir = model_ds.path('./'+model_label_blob+'/',data_reference_name='model_label_blob').as_mount()
input_images = result_ds.path('./'+score_blob_name+'/',data_reference_name='Input_images').as_mount()
output_container = 'abc'
inti_container = 'xyz'
distributed_batch_score_step = MpiStep(
name="batch_scoring",
source_directory=SCRIPT_FOLDER,
script_name="batch_scoring_script_mpi.py",
arguments=["--dataset_path", input_images,
"--model_name", model_dir,
"--label_dir", label_dir,
"--intermediate_data_container", inti_container,
"--output_container", output_container],
compute_target=gpu_cluster,
inputs=[input_images, model_dir,label_dir],
pip_packages=["tensorflow","tensorflow-gpu==1.13.1","pillow","azure-keyvault","azure-storage-blob"],
conda_packages=["mesa-libgl-cos6-x86_64","mpi4py==3.0.2","opencv=3.4.2","scikit-learn=0.21.2"],
use_gpu=True,
allow_reuse = False,
node_count = nodecount_param,
process_count_per_node = 1
)
Python 脚本代码片段:
def run(input_dataset,comm):
rank = comm.Get_rank()
size = comm.Get_size()
print("Rank:" , rank)
print("Size:", size) # shows always 1, even the input node count is >1
print(MPI.Get_processor_name())
file_names = get_file_names(args.dataset_path)
sorted(file_names)
partition_size = len(file_names) // size
print("partition_size-->",partition_size)
partitioned_filenames = file_names[rank * partition_size: (rank + 1) * partition_size]
print("RANK {} - is processing {} images out of the total {}".format(rank, len(partitioned_filenames),
len(file_names)))
# call to Function 01
# call to Function 02
img_names = score_df['image_name'].unique()
output_batch = pd.DataFrame()
for i in img_names:
# call to Function 3
output_batch = output_batch.append(pp_output, ignore_index=True)
output_paths_list = comm.gather(output_batch, root=0)
print("RANK {} - number of pre-aggregated output files {}".format(rank, len(output_batch)))
print("saved in", currentDT + '\\' + 'data.csv')
if rank == 0:
print("RANK {} - number of aggregated output files {}".format(rank, len(output_paths_list)))
print("RANK {} - end".format(rank))
if __name__ == "__main__":
with tf.device('/GPU:0'):
init()
comm = MPI.COMM_WORLD
run(args.dataset_path,comm)
解决方案
知道问题是由于软件包版本引起的,之前它是通过 conda 使用 conda_packages=["mpi4py==3.0.2"] 安装的,它在通过 pip - pip_packages=["mpi4py"] 更改安装后工作
推荐阅读
- c - ARM Cortex 上的超级简单 Tasker
- http - groovy - 下载带有身份验证的文件
- android - 如何在 SQLite 中对数据行求和?
- python - 抽象类的属性?
- python-3.x - UnicodeDecodeError:“charmap”编解码器无法解码位置 591 中的字节 0x8f:字符映射到
- corda - Corda - 为什么 deployNodes 输出一个无用的 JAR?
- c++ - 为什么不能订购函数参数评估?
- ios - RxSwift,依赖链的下载返回相同的 Observable 类型
- jquery - Jquery替代mailto链接与html正文
- android - 奥利奥(8.1)上的吐司重叠问题