Airflow Trigger Rules for Building Complex Data Pipelines Explained, and My Initial Days of Airflow Selection and Experience
Posted May 1, 2022 by Gowri Shankar ‐ 9 min read
Dell acquiring Boomi(circa 2010) was a big topic of discussion among my peers then, I was just start shifting my career from developing system software, device driver development to building distributed IT products at enterprise scale. I was so ignorant and questioned, 'why would someone pay so much for a piece of code that connects systems and schedules events'. I argued that those data pipeline processes can easily built in-house rather than depending on an external product. To understand the value of an integration platform or a workflow management system - one should strive for excellence in maintaining and serving reliable data at large scale. Building in-house data-pipelines, using Pentaho Kettle at enterprise scale to enjoying the flexibility of Apache Airflow is one of the most significant parts of my data journey.
In this post, we shall explore the challenges involved in managing data, people issues, conventional approaches that can be improved without much effort and a focus on Trigger rules of Apache Airflow. This post falls under a new topic
Data Engineering(at scale).
I thank Marc Lamberti for his guide to Apache Airflow, this post is just an attempt to complete what he had started in his blog.
The objective of this post is to explore a few obvious challenges of designing and deploying data engineering pipelines with a specific focus on trigger rules of Apache Airflow 2.0. In the second section, we shall study the 10 different branching strategies that Airflow provides to build complex data pipelines.
A robust data engineering framework cannot be deployed without using a sophisticated workflow management tool, I was using Pentaho Kettle extensively for large-scale deployments for a significant period of my career. When I say large scale, I meant significantly large but not of the size of Social Media platforms' order. i.e The most voluminous data transfer was around 25-30 million records at the frequency of 30 minutes with a promise of 100% data integrity for an F500 company. It worked but not without problems, we had a rough journey, we paid hefty prices in the process but eventually succeeded.
The major problems I encountered are the following,
- Workflow management tools that are popularly known as ETLs are usually graphical tools where the data engineer drags and drops actions and tasks in a closed environment.
- They usually had connectors that made things easy but not extensible for a custom logic or complex algorithms
- Reusability is practically impossible where we have to make copies for every new deployment and
- Last but not least, licensing them is almost always obscure. Those ETL tools came with a conditional free version where the best-of-the-class attorneys fail to understand the conditions.
Small companies aspire for acquiring major enterprises in their customer list often trivialize the technology problem and treat them as people issues because 0. Creating a wow factor is the primary, secondary, and tertiary concern to acquiring new labels, and seldom focusing on the nuances of technology to achieve excellence in delivery
- Low or no funding to invest in tools(and right people) to make life easy for every stakeholder including the paying customer, e.g ETLs, Data Experts, etc
- Assuming software engineers can solve everything and there is no need for a data engineering speciality among the workforce.
- Delivering reports and analytics from OLTP databases is common even among the large corporations - Significant numbers of companies fail to deploy a Hadoop system or OLTP to OLAP scheme despite having huge funds because of the issues #1 and #2.
- Further, the legacy systems make it almost impossible for the IT team to even simplify the periodical backup(data) process.
- Stored procedures are the de-facto choice when it comes to data munging, Pls get rid of them.
Those issues are sufficient for a caring data engineer to keep himself awake for almost 16 hours a day. If he has no supporting DevOps and infrastructure team around, that 16 hours shoot north to 18 hours or may be more. Yet our system rand successfully to cater the needs of my ever-demanding customers. Deep down in my heart I know if not now, the next customer deployment - i.e larger than the current one is designed to fail.
Technology Selection: Kettle vs AWS Glue vs Airflow
The following are the 3 critical reasons for undergoing a technology transfer,
- Single Instance, Our ETL system was running on a single EC2 node and we are vertically scaling as and when the need arose
- We couldn’t decipher the licensing limitations of the free version and
- There was no sufficient funding for moving to an enterprise edition
I thought It is just a technology transfer operation and did not fathom a new situation on the way to hit me hard that I will never recover from. The following are the technologies/tools I picked for initial study,
- Apache Airflow
- Pentaho Kettle
- AWS Glue
- Apache Beam
- It was easy for me to reject Talend, one more licensing nightmare.
- There was a significant buy-in from the stakeholders for AWS Glue because it has the brand Amazon to resell to the customers. Amazon Stickiness is an issue for my leadership team, they are in perpetual fear of Amazon that was absolute stupidity. I took advantage of that fear to reject AWS Glue.
- Beam vs Airflow was a difficult one because I had opposition from almost everyone for both - People still(Circa 2018) fear open source technologies in the enterprise world.
- Never did I imagine, I will end up justifying Airflow over Pentaho Kettle because it wasn’t just a technology transfer but an org transformation.
- We built a Pentaho Kettle workforce over a while, there was reluctance from the business for a new tech stack
- Upskilling/reskilling the Kettle workforce without an extra investment was impossible
- Learning Python is easy - only for those who are willing to learn and believe in lifelong learning is the only way forward to excellence
Yet, I Picked Airflow and I am Glad That I Did
Despite knowing the fact that user adoption is a challenge, I picked Apache Airflow 1.0(circa 2018-2020) and built a sophisticated, reusable, and robust pipeline that can be deployed in a distributed environment. I went with my gut feel, trusted my instincts, and believed that beyond a certain point there isn’t a need to have buy-in from everyone but the self.
Rather than brood over the hardships I faced, let us study the most awesome Airflow focusing on its triggering schemes of it in the next section.
Apache Airflow 2.0 - Few Thoughts
With version 2.0 out there, Airflow had come a long way since my last attempt. Some of the things that caught my attention,
- An elegant user interface,
- decorators for creating dags, tasks and,
- improved and highly available schedulers etc.
However, the key feature I am excited about is the effort that went into making the airflow ecosystem modular. Following are the key contributors towards achieving a clean and elegant dag,
- Task Groups
- Separation of Airflow Core and Airflow Providers There is a talk that sub-dags are about to get deprecated in the forthcoming releases. To be frank sub-dags are a bit painful to debug/maintain and when things go wrong, sub-dags make them go truly wrong.
Branching the DAG flow is a critical part of building complex workflows. Often the branching schemes result in a state of confusion. In this section, we shall study 10 different branching schemes in Apache Airflow 2.0.
All Success (default)
Definition: all parents have succeeded
|All Success: Default||All Success: Failure|
- All Success(Default): This is the default behavior where both the upstream tasks are expected to succeed to fire
- All Failure: Task 5 failed to result in skipping the subsequent downstream tasks
Definition: all parents are in a failed or upstream_failed state
task 1 and 2failed to result in the execution of the downstream task
- Subsequently, task 3 succeeded to skip the downstream task
Definition: fires as soon as at least one parent has failed, it does not wait for all parents to be done
- This rule expects at least one upstream task to fail and that is demonstrated in the first level where
task 2failed to fire the
fire since 2 failed.
- In the level 2,
task 6success skipped
- At level 3, though all the tasks are not completed i.e.
wait_and_failtask was still running - A single failure result in firing the final task.
Definition: all parents are done with their execution
|All Done: Waiting||All Done: Completed|
- The key capability of all done compared to the previous example we studied is that the execution waits until all upstream tasks are completed.
- In the first illustration, the last task
wait_and_fire_when_all_upstream_doneis waiting because
wait_and_failtask is yet to complete.
fires as soon as at least one parent succeeds, it does not wait for all parents to be done
|One Success: Waiting||One Success: Completed|
- In this example,
task 6is a time consuming tasks and
task 6execution completes.
all parents have not failed (failed or upstream_failed) i.e. all parents have succeeded or been skipped
- As the name indicates, all downstream tasks of
task 2, and
task 3skipped because
task 2is failed
no parent is in a skipped state, i.e. all parents are in a success, failed, or upstream_failed state
|One Success: Waiting||None Skipped: Completed|
- To demonstrate
None Skipped, I took reference from
One Successsample - Ideally, all downstream should be skipped if one upstream task is skipped. Then how
dummy_7task in the first pic fire to result in subsequent downstream tasks execution? Detailed answers are in the next section.
- The default behavior - if one upstream task is skipped then all its downstream tasks will be skipped. Refer: Pic 2(right side one)
dependencies are just for show, trigger at will
- Using a
dummytask we can kick start the execution from the middle. This capability is very useful if we want parallel executions where the downstream block of tasks is not depend on the upstream ones.
no_impact_of_upstreamtask fires at will to subsequently executes its downstream ones.
Hope you all enjoyed reading this post. I tried my best to present how ignorant I was when it comes to data during the initial years and the subsequent journey. One important lesson I learned and wish to convey is “never stick with one technology or technology stack”. Things are moving faster than ever and there are better things in the tech world coming out every day. It may be a bit hard to manage many stacks in the beginning but over some time - we will be able to find a pattern among them all and maneuver through the challenges easily.
- Airflow Trigger Rules: All you need to know! a blog by Mark Lemberti
- Apache Airflow Documentation from Apache Airflow
- Airflow installation and configuration process is extremely tedious, I made sure you do not require to undergo that pain.
- A docker compose file is provided in the repo. Just run the command specified in the README.md to start the airflow server: