airflow - airflow create sub process based on number of files
问题描述
A newbie question in airflow, I am having a list of 100+ servers in a text file. Currently, a python script is used to login to each server, read a file, and write the output. It's taking a long time to get the output. If this job is converted to Airflow DAG, is it possible to split the servers into multiple groups and a new task can be initiated by using any operators? Or this can be achieved by only modifying the Python script(like using async) and execute using the Python operator. Seeking advice/best practice. I tried searching for examples but was not able to find one. Thanks!
解决方案
Airflow is not really a "map-reduce" type of framework (which you seem to be trying to implement). The tasks of Airflow are not (at least currently) designed to split the work between them. This is very atypical for Airflow to have N tasks that do the same thing on a subset of data each. Airflow is more for orchestrating the logic, so each task in Airflow conceptually does a different thing and there are rarely cases where N parallel task do the same thing (but on a different subset of data). More often than not Airflow "tasks" do not "do" the job themselves, they are rather telling others what to do and wait until this gets done.
Typically Airflow can be used to orchestrate such services which excel in doing this kind of processing - you could have a Hadoop job which processes such "parallel" map-reduce kind of jobs using other tools. You could also - as you mentioned - perform an async, multi-threading or even multi-processing python operator, but at some scale, I think typically other, dedicated tools should be much easier to use and better to get the most value of (with efficient utilization of parallelism for example).
推荐阅读
- javascript - 必须为 EventListener 按键两次(javascript)
- python - 错误:“charmap”编解码器无法解码位置 3696 中的字节 0x9d:字符映射到
(Windows 命令行,散景程序) - swift - 是否可以在终端中运行 swift 文件,包括 Swift 包?
- angular - HTML试图在打字稿上的构造函数之前加载插值
- c# - char '00' 未从 C# 中的 SQL Server 转换为 int 01
- python - 如何链接到 README.rst 中 docs 文件夹中的图像?
- mysql - ASP.NET 中的 SQL 参数列
- python - 当它们都共享相同的日期时,获取具有最高值的行?
- java - 访问 JNA 指针的对等值
- visual-studio-2017 - .net 4.7.2 库引用标准 2.0 NuGetPackage 不起作用