Airflow Trigger Rules for Building Complex Data Pipelines Explained, and My Initial Days of Airflow Selection and Experience

Posted May 1, 2022 by Gowri Shankar  ‐  9 min read

Dell acquiring Boomi(circa 2010) was a big topic of discussion among my peers then, I was just start shifting my career from developing system software, device driver development to building distributed IT products at enterprise scale. I was so ignorant and questioned, 'why would someone pay so much for a piece of code that connects systems and schedules events'. I argued that those data pipeline processes can easily built in-house rather than depending on an external product. To understand the value of an integration platform or a workflow management system - one should strive for excellence in maintaining and serving reliable data at large scale. Building in-house data-pipelines, using Pentaho Kettle at enterprise scale to enjoying the flexibility of Apache Airflow is one of the most significant parts of my data journey.

In this post, we shall explore the challenges involved in managing data, people issues, conventional approaches that can be improved without much effort and a focus on Trigger rules of Apache Airflow. This post falls under a new topic Data Engineering(at scale).

Source code for all the dags explained in this post can be found in this repo

Flow

I thank Marc Lamberti for his guide to Apache Airflow, this post is just an attempt to complete what he had started in his blog.

Objective

The objective of this post is to explore a few obvious challenges of designing and deploying data engineering pipelines with a specific focus on trigger rules of Apache Airflow 2.0. In the second section, we shall study the 10 different branching strategies that Airflow provides to build complex data pipelines.

Introductions

A robust data engineering framework cannot be deployed without using a sophisticated workflow management tool, I was using Pentaho Kettle extensively for large-scale deployments for a significant period of my career. When I say large scale, I meant significantly large but not of the size of Social Media platforms’ order. i.e The most voluminous data transfer was around 25-30 million records at the frequency of 30 minutes with a promise of 100% data integrity for an F500 company. It worked but not without problems, we had a rough journey, we paid hefty prices in the process but eventually succeeded.

The major problems I encountered are the following,

  • Workflow management tools that are popularly known as ETLs are usually graphical tools where the data engineer drags and drops actions and tasks in a closed environment.
  • They usually had connectors that made things easy but not extensible for a custom logic or complex algorithms
  • Reusability is practically impossible where we have to make copies for every new deployment and
  • Last but not least, licensing them is almost always obscure. Those ETL tools came with a conditional free version where the best-of-the-class attorneys fail to understand the conditions.

Small companies aspire for acquiring major enterprises in their customer list often trivialize the technology problem and treat them as people issues because 0. Creating a wow factor is the primary, secondary, and tertiary concern to acquiring new labels, and seldom focusing on the nuances of technology to achieve excellence in delivery

  1. Low or no funding to invest in tools(and right people) to make life easy for every stakeholder including the paying customer, e.g ETLs, Data Experts, etc
  2. Assuming software engineers can solve everything and there is no need for a data engineering speciality among the workforce.
  3. Delivering reports and analytics from OLTP databases is common even among the large corporations - Significant numbers of companies fail to deploy a Hadoop system or OLTP to OLAP scheme despite having huge funds because of the issues #1 and #2.
  4. Further, the legacy systems make it almost impossible for the IT team to even simplify the periodical backup(data) process.
  5. Stored procedures are the de-facto choice when it comes to data munging, Pls get rid of them.

Those issues are sufficient for a caring data engineer to keep himself awake for almost 16 hours a day. If he has no supporting DevOps and infrastructure team around, that 16 hours shoot north to 18 hours or may be more. Yet our system rand successfully to cater the needs of my ever-demanding customers. Deep down in my heart I know if not now, the next customer deployment - i.e larger than the current one is designed to fail.

Technology Selection: Kettle vs AWS Glue vs Airflow

The following are the 3 critical reasons for undergoing a technology transfer,

  • Single Instance, Our ETL system was running on a single EC2 node and we are vertically scaling as and when the need arose
  • We couldn’t decipher the licensing limitations of the free version and
  • There was no sufficient funding for moving to an enterprise edition

I thought It is just a technology transfer operation and did not fathom a new situation on the way to hit me hard that I will never recover from. The following are the technologies/tools I picked for initial study,

  • Talend
  • Apache Airflow
  • Pentaho Kettle
  • AWS Glue
  • Apache Beam

Challenges

  • It was easy for me to reject Talend, one more licensing nightmare.
  • There was a significant buy-in from the stakeholders for AWS Glue because it has the brand Amazon to resell to the customers. Amazon Stickiness is an issue for my leadership team, they are in perpetual fear of Amazon that was absolute stupidity. I took advantage of that fear to reject AWS Glue.
  • Beam vs Airflow was a difficult one because I had opposition from almost everyone for both - People still(Circa 2018) fear open source technologies in the enterprise world.
  • Never did I imagine, I will end up justifying Airflow over Pentaho Kettle because it wasn’t just a technology transfer but an org transformation.
    • We built a Pentaho Kettle workforce over a while, there was reluctance from the business for a new tech stack
    • Upskilling/reskilling the Kettle workforce without an extra investment was impossible
    • Learning Python is easy - only for those who are willing to learn and believe in lifelong learning is the only way forward to excellence

Yet, I Picked Airflow and I am Glad That I Did

Despite knowing the fact that user adoption is a challenge, I picked Apache Airflow 1.0(circa 2018-2020) and built a sophisticated, reusable, and robust pipeline that can be deployed in a distributed environment. I went with my gut feel, trusted my instincts, and believed that beyond a certain point there isn’t a need to have buy-in from everyone but the self.

Rather than brood over the hardships I faced, let us study the most awesome Airflow focusing on its triggering schemes of it in the next section.

Apache Airflow 2.0 - Few Thoughts

With version 2.0 out there, Airflow had come a long way since my last attempt. Some of the things that caught my attention,

  • An elegant user interface,
  • decorators for creating dags, tasks and,
  • improved and highly available schedulers etc.

However, the key feature I am excited about is the effort that went into making the airflow ecosystem modular. Following are the key contributors towards achieving a clean and elegant dag,

  • Task Groups
  • Separation of Airflow Core and Airflow Providers There is a talk that sub-dags are about to get deprecated in the forthcoming releases. To be frank sub-dags are a bit painful to debug/maintain and when things go wrong, sub-dags make them go truly wrong.

Trigger Rules

Branching the DAG flow is a critical part of building complex workflows. Often the branching schemes result in a state of confusion. In this section, we shall study 10 different branching schemes in Apache Airflow 2.0.

All Success (default)

Definition: all parents have succeeded

All Success: DefaultAll Success: Failure
All SuccessAll Success 2
  • All Success(Default): This is the default behavior where both the upstream tasks are expected to succeed to fire task 3
  • All Failure: Task 5 failed to result in skipping the subsequent downstream tasks

All Failed:

Definition: all parents are in a failed or upstream_failed state

All Fail

  • task 1 and 2 failed to result in the execution of the downstream task
  • Subsequently, task 3 succeeded to skip the downstream task

One Failed

Definition: fires as soon as at least one parent has failed, it does not wait for all parents to be done

One Failed

  • This rule expects at least one upstream task to fail and that is demonstrated in the first level where task 2 failed to fire the fire since 2 failed.
  • In the level 2, task 4, task 5 and task 6 success skipped skip_since_all_upstream_succeeded task
  • At level 3, though all the tasks are not completed i.e. wait_and_fail task was still running - A single failure result in firing the final task.

All Done

Definition: all parents are done with their execution

All Done: WaitingAll Done: Completed
All Done 1All Done![image.png]
  • The key capability of all done compared to the previous example we studied is that the execution waits until all upstream tasks are completed.
  • In the first illustration, the last task wait_and_fire_when_all_upstream_done is waiting because wait_and_fail task is yet to complete.

One Success

fires as soon as at least one parent succeeds, it does not wait for all parents to be done

One Success: WaitingOne Success: Completed
One Success 1One Success 2
  • In this example, task 6 is a time consuming tasks and wait_for_task_6_to_complete waits until task 6 execution completes.

None Failed

all parents have not failed (failed or upstream_failed) i.e. all parents have succeeded or been skipped

  • As the name indicates, all downstream tasks of task 1, task 2, and task 3 skipped because task 2 is failed

None Failed

None Skipped

no parent is in a skipped state, i.e. all parents are in a success, failed, or upstream_failed state

One Success: WaitingNone Skipped: Completed
One Success 1None Skipped
  • To demonstrate None Skipped, I took reference from One Success sample - Ideally, all downstream should be skipped if one upstream task is skipped. Then how dummy_7 task in the first pic fire to result in subsequent downstream tasks execution? Detailed answers are in the next section.
  • The default behavior - if one upstream task is skipped then all its downstream tasks will be skipped. Refer: Pic 2(right side one)

Dummy

dependencies are just for show, trigger at will

Dummy

  • Using a dummy task we can kick start the execution from the middle. This capability is very useful if we want parallel executions where the downstream block of tasks is not depend on the upstream ones.
  • no_impact_of_upstream task fires at will to subsequently executes its downstream ones.

Epilogue

Hope you all enjoyed reading this post. I tried my best to present how ignorant I was when it comes to data during the initial years and the subsequent journey. One important lesson I learned and wish to convey is “never stick with one technology or technology stack”. Things are moving faster than ever and there are better things in the tech world coming out every day. It may be a bit hard to manage many stacks in the beginning but over some time - we will be able to find a pattern among them all and maneuver through the challenges easily.

Reference

Source Code

Source code for all the dags explained in this post can be found in this repo

  • Airflow installation and configuration process is extremely tedious, I made sure you do not require to undergo that pain.
  • A docker compose file is provided in the repo. Just run the command specified in the README.md to start the airflow server: