January 8, 2020

465 words 3 mins read

Migrating Apache Oozie workflows to Apache Airflow

Migrating Apache Oozie workflows to Apache Airflow

Apache Oozie and Apache Airflow (incubating) are both widely used workflow orchestration systems, the former focusing on Apache Hadoop jobs. Feng Lu, James Malone, Apurva Desai, and Cameron Moberg explore an open source Oozie-to-Airflow migration tool developed at Google as a part of creating an effective cross-cloud and cross-system solution.

Talk Title Migrating Apache Oozie workflows to Apache Airflow
Speakers Feng Lu (Google Cloud), James Malone (Google), Apurva Desai (Google Cloud), Cameron Moberg (Truman State University
Conference Strata Data Conference
Conf Tag Making Data Work
Location London, United Kingdom
Date April 30-May 2, 2019
URL Talk Page
Slides Talk Slides
Video

Apache Oozie and Apache Airflow (incubating) are both widely used workflow orchestration systems. Oozie allows users to easily schedule Hadoop-related jobs out of the box (Java MapReduce, Pig, Hive, Sqoop, etc.) with support for some other system-specific jobs (SSH, Java programs, shell scripts, etc.). The Oozie workflow is defined as an XML file (most recent schema here) with, among others, control nodes that control the flow of the workflow, and action nodes that execute some sort of action. Oozie additionally supports subworkflow and allows workflow node properties to be parameterized and dynamically evaluated using EL function. In contrast, Airflow is a generic workflow orchestration for programmatically authoring, scheduling, and monitoring workflows. A workflow (a.k.a. Direct Acyclic Graph) is expressed using Python code with APIs provided by Airflow such as Dag or Operator. Airflow not only supports Hadoop/Spark tasks (actions in Oozie) but also includes connectors to interact with many other systems such as GCP and common RDBMS. Neither Oozie nor Airflow allow cycles in their workflows. Feng Lu, James Malone, Apurva Desai, and Cameron Moberg explore an open source Oozie-to-Airflow migration tool developed at Google as a part of creating an effective cross-cloud and cross-system solution. The high-level design idea is summarized as such: Leveraging the fact that Oozie XML schema is defined in a way that there are only a finite number of top-level node types (e.g., control and action), it converts the Oozie XML file as a collection of nodes (stored in an OrderedDictionary). It then processes these nodes in order and convert them into their corresponding Airflow representations. Based on the type of the control node (fork, join, etc.), it then retrofits the dependency relationships among converted Airflow operators and tasks. The design is purposefully structured as a number of easily extendable modules. For example, you can easily extend the base ActionMapper module to support converting a new Oozie action node. Feng, James, Apurva, and Cameron start with an overview of Oozie and Airflow, including a brief comparison, followed by a number of migration use cases. They then outline the Oozie-to-Airflow migration tool design, emphasizing its flexibility and extensibility, and wrap up with a quick demo and some future improvement ideas.

comments powered by Disqus