Deploy a pipeline with Airflow and Google Composer
Before you can deploy a pipeline, you will need to install dlt and create a pipeline.
While this walkthrough deals specifically with Google Composer, it will generate DAGs and configuration files that you can use on any Airflow deployment. DAGs are generated using dlt Airflow helper that maps dlt resources into Airflow tasks, provides a clean working environment, retry mechanism, metrics, and logging via Airflow loggers.
If you want to explore other ways to run dlt with Airflow, such as using PythonOperator, PythonVirtualenvOperator, KubernetesPodOperator, or external services like Cloud Run, check out this guide by Francesco Mucio. It explains the trade-offs of each approach and helps you choose the right one for your setup.
1. Add your dlt project directory to GitHub
You will need a GitHub repository for your project. If you don't have one yet, you need to
initialize a Git repository in your dlt project directory and push it to GitHub as described in
Adding locally hosted code to GitHub.
2. Ensure your pipeline works
Before you can deploy, you must run your pipeline locally at least once.
python3 {pipeline_name}_pipeline.py
This should successfully load data from the source to the destination once and allows dlt to gather required information for the deployment.
3. Initialize deployment
First, you need to add additional dependencies that the deploy command requires:
pip install "dlt[cli]"
then:
dlt deploy {pipeline_name}_pipeline.py airflow-composer
This command checks if your pipeline has run successfully before and creates the following folders:
-
build
- This folder contains a file called
cloudbuild.yamlwith a simple configuration for cloud deployment. We will use it below.
- This folder contains a file called
-
dags
- This folder contains the Python script
dag_{pipeline_name}.py, which is an example of a simple serialized DAG using the Airflow PipelineTasksGroup wrapper.
Note: This folder is only needed to store DAG scripts, but it is not the Airflow dags_folder. Please refer to the Troubleshooting section for more information.
- This folder contains the Python script
By default, the dlt deploy command shows you the deployment credentials in ENV format.
Example with the pipedrive pipeline
1. Run the deploy command
dlt deploy pipedrive_pipeline.py airflow-composer
where pipedrive_pipeline.py is the pipeline script that you just ran and airflow-composer is a deployment method. The command will create deployment files and provide instructions to set up the credentials.
Your airflow-composer deployment for the pipedrive pipeline is ready!
* The airflow cloudbuild.yaml file was created in the build directory.
* The dag_pipedrive.py script was created in the dags directory.
You must prepare your repository first:
1. Import your sources in dag_pipedrive.py and change default_task_args if necessary.
2. Run the airflow pipeline locally.
See Airflow getting started: https://airflow.apache.org/docs/apache-airflow/stable/start.html
If you are planning to run the pipeline with Google Cloud Composer, follow the next instructions:
1. Read this doc and set up the environment: https://dlthub.com/docs/walkthroughs/deploy-a-pipeline/deploy-with-airflow-composer
2. Set _BUCKET_NAME up in the build/cloudbuild.yaml file.
3. Add the following toml-string to the Airflow UI as the dlt_secrets_toml variable.
[sources.pipedrive]
pipedrive_api_key = "c66..."
The deploy command will use an Airflow variable called dlt_secrets_toml to store all the required secrets as a TOML fragment. You can also use environment variables by passing the --secrets-format env option:
dlt deploy pipedrive_pipeline.py airflow-composer --secrets-format env
which will output the environment variable names and their values.
3. Add the following secret values (typically stored in ./.dlt/secrets.toml):
SOURCES__PIPEDRIVE__PIPEDRIVE_API_KEY
in ENVIRONMENT VARIABLES using Google Composer UI
Name:
SOURCES__PIPEDRIVE__PIPEDRIVE_API_KEY
Secret:
c66c..