Newer
Older
.. Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
.. http://www.apache.org/licenses/LICENSE-2.0
.. Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
.. contents:: :local:
Contributions
=============
Contributions are welcome and are greatly appreciated! Every little bit helps,
and credit will always be given.
This document aims to explain the subject of contributions if you have not contributed to
any Open Source project, but it will also help people who have contributed to other projects learn about the
rules of that community.
If you are a new contributor, please follow the `Contributors Quick Start <https://github.com/apache/airflow/blob/main
/CONTRIBUTORS_QUICK_START.rst>`__ guide to get a gentle step-by-step introduction to setting up the development
environment and making your first contribution.
Get Mentoring Support
---------------------
If you are new to the project, you might need some help in understanding how the dynamics
of the community works and you might need to get some mentorship from other members of the
community - mostly committers. Mentoring new members of the community is part of committers
job so do not be afraid of asking committers to help you. You can do it
via comments in your Pull Request, asking on a devlist or via Slack. For your convenience,
we have a dedicated #newbie-questions Slack channel where you can ask any questions
you want - it's a safe space where it is expected that people asking questions do not know
a lot about Airflow (yet!).
If you look for more structured mentoring experience, you can apply to Apache Software Foundation's
`Official Mentoring Programme <http://community.apache.org/mentoringprogramme.html>`_. Feel free
to follow it and apply to the programme and follow up with the community.
Report Bugs
-----------
Report bugs through `GitHub <https://github.com/apache/airflow/issues>`__.
Please report relevant information and preferably code that exhibits the
problem.
Fix Bugs
--------
Look through the GitHub issues for bugs. Anything is open to whoever wants to
implement it.
Issue reporting and resolution process
--------------------------------------
The Apache Airflow project uses a set of labels for tracking and triaging issues, as
well as a set of priorities and milestones to track how and when the enhancements and bug
fixes make it into an Airflow release. This is documented as part of
the `Issue reporting and resolution process <ISSUE_TRIAGE_PROCESS.rst>`_,
Implement Features
------------------
Look through the `GitHub issues labeled "kind:feature"
<https://github.com/apache/airflow/labels/kind%3Afeature>`__ for features.
Any unassigned feature request issue is open to whoever wants to implement it.
We've created the operators, hooks, macros and executors we needed, but we've
made sure that this part of Airflow is extensible. New operators, hooks, macros
and executors are very welcomed!
Improve Documentation
---------------------
Airflow could always use better documentation, whether as part of the official
Airflow docs, in docstrings, ``docs/*.rst`` or even on the web as blog posts or
articles.
Submit Feedback
---------------
The best way to send feedback is to `open an issue on GitHub <https://github.com/apache/airflow/issues/new/choose>`__.
If you are proposing a new feature:
- Explain in detail how it would work.
- Keep the scope as narrow as possible to make it easier to implement.
- Remember that this is a volunteer-driven project, and that contributions are
welcome :)
Roles
=============
There are several roles within the Airflow Open-Source community.
For detailed information for each role, see: `Committers and PMC's <./COMMITTERS.rst>`__.
The PMC (Project Management Committee) is a group of maintainers that drives changes in the way that
Airflow is managed as a project.
Considering Apache, the role of the PMC is primarily to ensure that Airflow conforms to Apache's processes
and guidelines.
Committers/Maintainers
----------------------
Committers are community members that have write access to the project’s repositories, i.e., they can modify the code,
documentation, and website by themselves and also accept other contributions.
The official list of committers can be found `here <https://airflow.apache.org/docs/apache-airflow/stable/project.html#committers>`__.
Additionally, committers are listed in a few other places (some of these may only be visible to existing committers):
* https://whimsy.apache.org/roster/committee/airflow
* https://github.com/orgs/apache/teams/airflow-committers/members
Committers are responsible for:
* Championing one or more items on the `Roadmap <https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Home>`__
* Reviewing & Merging Pull-Requests
* Responding to questions on the dev mailing list (dev@airflow.apache.org)
Contributors
------------
A contributor is anyone who wants to contribute code, documentation, tests, ideas, or anything to the
Apache Airflow project.
Contributors are responsible for:
* Fixing bugs
* Adding features
* Championing one or more items on the `Roadmap <https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Home>`__.
Contribution Workflow
=====================
Typically, you start your first contribution by reviewing open tickets
at `GitHub issues <https://github.com/apache/airflow/issues>`__.
If you create pull-request, you don't have to create an issue first, but if you want, you can do it.
Creating an issue will allow you to collect feedback or share plans with other people.
For example, you want to have the following sample ticket assigned to you:
`#7782: Add extra CC: to the emails sent by Airflow <https://github.com/apache/airflow/issues/7782>`_.
In general, your contribution includes the following stages:
.. image:: images/workflow.png
:align: center
:alt: Contribution Workflow
1. Make your own `fork <https://help.github.com/en/github/getting-started-with-github/fork-a-repo>`__ of
the Apache Airflow `main repository <https://github.com/apache/airflow>`__.
2. Create a `local virtualenv <LOCAL_VIRTUALENV.rst>`_,
initialize the `Breeze environment <BREEZE.rst>`__, and
install `pre-commit framework <STATIC_CODE_CHECKS.rst#pre-commit-hooks>`__.
If you want to add more changes in the future, set up your fork and enable GitHub Actions.
3. Join `devlist <https://lists.apache.org/list.html?dev@airflow.apache.org>`__
and set up a `Slack account <https://s.apache.org/airflow-slack>`__.
4. Make the change and create a `Pull Request from your fork <https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/creating-a-pull-request-from-a-fork>`__.
5. Ping @ #development slack, comment @people. Be annoying. Be considerate.
Step 1: Fork the Apache Airflow Repo
------------------------------------
From the `apache/airflow <https://github.com/apache/airflow>`_ repo,
`create a fork <https://help.github.com/en/github/getting-started-with-github/fork-a-repo>`_:
.. image:: images/fork.png
:align: center
:alt: Creating a fork
Step 2: Configure Your Environment
----------------------------------
Abdur-Rahmaan Janhangeer
committed
You can use either a local virtual env or a Docker-based env. The differences
between the two are explained `here <https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#development-environments/>`__.
Abdur-Rahmaan Janhangeer
committed
The local env's instructions can be found in full in the `LOCAL_VIRTUALENV.rst`_ file.
.. _LOCAL_VIRTUALENV.rst:
https://github.com/apache/airflow/blob/main/LOCAL_VIRTUALENV.rst
Abdur-Rahmaan Janhangeer
committed
The Docker env is here to maintain a consistent and common development environment so that you can replicate CI failures locally and work on solving them locally rather by pushing to CI.
You can configure the Docker-based Breeze development environment as follows:
1. Install the latest versions of the `Docker Community Edition`_ and `Docker Compose`_ and add them to the PATH.
.. _Docker Community Edition:
https://github.com/apache/airflow/blob/main/BREEZE.rst#docker-community-edition
.. _Docker Compose: https://github.com/apache/airflow/blob/main/BREEZE.rst#docker-compose
2. Install `jq`_ on your machine. The exact command depends on the operating system (or Linux distribution) you use.
.. _jq: https://stedolan.github.io/jq/
Nijanthan Vijayakumar
committed
For example, on Ubuntu:
.. code-block:: bash
Nijanthan Vijayakumar
committed
or on macOS with `Homebrew <https://formulae.brew.sh/formula/jq>`_
.. code-block:: bash
brew install jq
3. Enter Breeze, and run the following in the Airflow source code directory:
.. code-block:: bash
Nijanthan Vijayakumar
committed
Breeze starts with downloading the Airflow CI image from
the Docker Hub and installing all required dependencies.
This will enter the Docker Docker environment and mount your local sources
to make them immediately visible in the environment.
4. Create a local virtualenv, for example:
.. code-block:: bash
mkvirtualenv myenv --python=python3.6
Daniel Standish
committed
.. code-block:: bash
./breeze initialize-local-virtualenv --python 3.6
6. Open your IDE (for example, PyCharm) and select the virtualenv you created
as the project's default virtualenv in your IDE.
Step 3: Connect with People
---------------------------
For effective collaboration, make sure to join the following Airflow groups:
- Mailing lists:
- Developer’s mailing list `<dev-subscribe@airflow.apache.org>`_
(quite substantial traffic on this list)
- All commits mailing list: `<commits-subscribe@airflow.apache.org>`_
(very high traffic on this list)
- Airflow users mailing list: `<users-subscribe@airflow.apache.org>`_
(reasonably small traffic on this list)
- `Issues on GitHub <https://github.com/apache/airflow/issues>`__
- `Slack (chat) <https://s.apache.org/airflow-slack>`__
Step 4: Prepare PR
------------------
1. Update the local sources to address the issue.
For example, to address this example issue, do the following:
* Read about `email configuration in Airflow </docs/apache-airflow/howto/email-config.rst>`__.
* Find the class you should modify. For the example GitHub issue,
this is `email.py <https://github.com/apache/airflow/blob/main/airflow/utils/email.py>`__.
* Find the test class where you should add tests. For the example ticket,
this is `test_email.py <https://github.com/apache/airflow/blob/main/tests/utils/test_email.py>`__.
* Make sure your fork's main is synced with Apache Airflow's main before you create a branch. See
`How to sync your fork <#how-to-sync-your-fork>`_ for details.
* Create a local branch for your development. Make sure to use latest
``apache/main`` as base for the branch. See `How to Rebase PR <#how-to-rebase-pr>`_ for some details
on setting up the ``apache`` remote. Note, some people develop their changes directly in their own
``main`` branches - this is OK and you can make PR from your main to ``apache/main`` but we
recommend to always create a local branch for your development. This allows you to easily compare
changes, have several changes that you work on at the same time and many more.
If you have ``apache`` set as remote then you can make sure that you have latest changes in your main
by ``git pull apache main`` when you are in the local ``main`` branch. If you have conflicts and
want to override your locally changed main you can override your local changes with
``git fetch apache; git reset --hard apache/main``.
* Modify the class and add necessary code and unit tests.
* Run the unit tests from the `IDE <TESTING.rst#running-unit-tests-from-ide>`__
or `local virtualenv <TESTING.rst#running-unit-tests-from-local-virtualenv>`__ as you see fit.
* Run the tests in `Breeze <TESTING.rst#running-unit-tests-inside-breeze>`__.
* Run and fix all the `static checks <STATIC_CODE_CHECKS.rst>`__. If you have
`pre-commits installed <STATIC_CODE_CHECKS.rst#pre-commit-hooks>`__,
this step is automatically run while you are committing your code. If not, you can do it manually
via ``git add`` and then ``pre-commit run``.
2. Rebase your fork, squash commits, and resolve all conflicts. See `How to rebase PR <#how-to-rebase-pr>`_
if you need help with rebasing your change. Remember to rebase often if your PR takes a lot of time to
review/fix. This will make rebase process much easier and less painful and the more often you do it,
the more comfortable you will feel doing it.
3. Re-run static code checks again.
4. Make sure your commit has a good title and description of the context of your change, enough
for the committer reviewing it to understand why you are proposing a change. Make sure to follow other
PR guidelines described in `pull request guidelines <#pull-request-guidelines>`_.
Create Pull Request! Make yourself ready for the discussion!
5. Depending on "scope" of your changes, your Pull Request might go through one of few paths after approval.
We run some non-standard workflow with high degree of automation that allows us to optimize the usage
of queue slots in GitHub Actions. Our automated workflows determine the "scope" of changes in your PR
and send it through the right path:
* In case of a "no-code" change, approval will generate a comment that the PR can be merged and no
tests are needed. This is usually when the change modifies some non-documentation related RST
files (such as this file). No python tests are run and no CI images are built for such PR. Usually
it can be approved and merged few minutes after it is submitted (unless there is a big queue of jobs).
* In case of change involving python code changes or documentation changes, a subset of full test matrix
will be executed. This subset of tests perform relevant tests for single combination of python, backend
version and only builds one CI image and one PROD image. Here the scope of tests depends on the
scope of your changes:
* when your change does not change "core" of Airflow (Providers, CLI, WWW, Helm Chart) you will get the
comment that PR is likely ok to be merged without running "full matrix" of tests. However decision
for that is left to committer who approves your change. The committer might set a "full tests needed"
label for your PR and ask you to rebase your request or re-run all jobs. PRs with "full tests needed"
run full matrix of tests.
* when your change changes the "core" of Airflow you will get the comment that PR needs full tests and
the "full tests needed" label is set for your PR. Additional check is set that prevents from
accidental merging of the request until full matrix of tests succeeds for the PR.
More details about the PR workflow be found in `PULL_REQUEST_WORKFLOW.rst <PULL_REQUEST_WORKFLOW.rst>`_.
Step 5: Pass PR Review
----------------------
.. image:: images/review.png
:align: center
:alt: PR Review
Note that committers will use **Squash and Merge** instead of **Rebase and Merge**
when merging PRs and your commit will be squashed to single commit.
You need to have review of at least one committer (if you are committer yourself, it has to be
another committer). Ideally you should have 2 or more committers reviewing the code that touches
the core of Airflow.
Pull Request Guidelines
=======================
Before you submit a pull request (PR) from your forked repo, check that it meets
these guidelines:
- Include tests, either as doctests, unit tests, or both, to your pull
request.
The airflow repo uses `GitHub Actions <https://help.github.com/en/actions>`__ to
run the tests and `codecov <https://codecov.io/gh/apache/airflow>`__ to track
coverage. You can set up both for free on your fork. It will help you make sure you do not
break the build with your PR and that you help increase coverage.
- Follow our project's `Coding style and best practices`_.
These are things that aren't currently enforced programmatically (either because they are too hard or just
not yet done.)
- `Rebase your fork <http://stackoverflow.com/a/7244456/1110993>`__, and resolve all conflicts.
- When merging PRs, Committer will use **Squash and Merge** which means then your PR will be merged as one commit, regardless of the number of commits in your PR. During the review cycle, you can keep a commit history for easier review, but if you need to, you can also squash all commits to reduce the maintenance burden during rebase.
- Add an `Apache License <http://www.apache.org/legal/src-headers.html>`__ header
to all new files.
If you have `pre-commit hooks <STATIC_CODE_CHECKS.rst#pre-commit-hooks>`__ enabled, they automatically add
license headers during commit.
- If your pull request adds functionality, make sure to update the docs as part
of the same PR. Doc string is often sufficient. Make sure to follow the
Sphinx compatible standards.
- Make sure your code fulfills all the
`static code checks <STATIC_CODE_CHECKS.rst#pre-commit-hooks>`__ we have in our code. The easiest way
to make sure of that is to use `pre-commit hooks <STATIC_CODE_CHECKS.rst#pre-commit-hooks>`__
- Run tests locally before opening PR.
- You can use any supported python version to run the tests, but the best is to check
if it works for the oldest supported version (Python 3.6 currently). In rare cases
tests might fail with the oldest version when you use features that are available in newer Python
versions. For that purpose we have ``airflow.compat`` package where we keep back-ported
useful features from newer versions.
- Adhere to guidelines for commit messages described in this `article <http://chris.beams.io/posts/git-commit/>`__.
This makes the lives of those who come after you a lot easier.
Airflow Git Branches
====================
All new development in Airflow happens in the ``main`` branch. All PRs should target that branch.
We also have a ``v2-*-test`` branches that are used to test ``2.*.x`` series of Airflow and where committers
cherry-pick selected commits from the main branch.
Cherry-picking is done with the ``-x`` flag.
The ``v2-*-test`` branch might be broken at times during testing. Expect force-pushes there so
committers should coordinate between themselves on who is working on the ``v2-*-test`` branch -
usually these are developers with the release manager permissions.
The ``v2-*-stable`` branch is rather stable - there are minimum changes coming from approved PRs that
passed the tests. This means that the branch is rather, well, "stable".
Once the ``v2-*-test`` branch stabilises, the ``v2-*-stable`` branch is synchronized with ``v2-*-test``.
The ``v2-*-stable`` branches are used to release ``2.*.x`` releases.
The general approach is that cherry-picking a commit that has already had a PR and unit tests run
against main is done to ``v2-*-test`` branches, but PRs from contributors towards 2.0 should target
``v2-*-stable`` branches.
The ``v2-*-test`` branches and ``v2-*-stable`` ones are merged just before the release and that's the
The production images are released in DockerHub from:
* main branch for development
* ``2.*.*``, ``2.*.*rc*`` releases from the ``v2-*-stable`` branch when we prepare release candidates and
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
Development Environments
========================
There are two environments, available on Linux and macOS, that you can use to
develop Apache Airflow:
- `Local virtualenv development environment <#local-virtualenv-development-environment>`_
that supports running unit tests and can be used in your IDE.
- `Breeze Docker-based development environment <#breeze-development-environment>`_ that provides
an end-to-end CI solution with all software dependencies covered.
The table below summarizes differences between the two environments:
========================= ================================ =====================================
**Property** **Local virtualenv** **Breeze environment**
========================= ================================ =====================================
Test coverage - (-) unit tests only - (+) integration and unit tests
------------------------- -------------------------------- -------------------------------------
Setup - (+) automated with breeze cmd - (+) automated with breeze cmd
------------------------- -------------------------------- -------------------------------------
Installation difficulty - (-) depends on the OS setup - (+) works whenever Docker works
------------------------- -------------------------------- -------------------------------------
Team synchronization - (-) difficult to achieve - (+) reproducible within team
------------------------- -------------------------------- -------------------------------------
Reproducing CI failures - (-) not possible in many cases - (+) fully reproducible
------------------------- -------------------------------- -------------------------------------
Ability to update - (-) requires manual updates - (+) automated update via breeze cmd
------------------------- -------------------------------- -------------------------------------
Disk space and CPU usage - (+) relatively lightweight - (-) uses GBs of disk and many CPUs
------------------------- -------------------------------- -------------------------------------
IDE integration - (+) straightforward - (-) via remote debugging only
========================= ================================ =====================================
Typically, you are recommended to use both of these environments depending on your needs.
Local virtualenv Development Environment
----------------------------------------
All details about using and running local virtualenv environment for Airflow can be found
in `LOCAL_VIRTUALENV.rst <LOCAL_VIRTUALENV.rst>`__.
Benefits:
- Packages are installed locally. No container environment is required.
- You can benefit from local debugging within your IDE.
- With the virtualenv in your IDE, you can benefit from autocompletion and running tests directly from the IDE.
Limitations:
- You have to maintain your dependencies and local environment consistent with
other development environments that you have on your local machine.
- You cannot run tests that require external components, such as mysql,
postgres database, hadoop, mongo, cassandra, redis, etc.
The tests in Airflow are a mixture of unit and integration tests and some of
them require these components to be set up. Local virtualenv supports only
real unit tests. Technically, to run integration tests, you can configure
and install the dependencies on your own, but it is usually complex.
Instead, you are recommended to use
`Breeze development environment <#breeze-development-environment>`__ with all required packages
pre-installed.
- You need to make sure that your local environment is consistent with other
developer environments. This often leads to a "works for me" syndrome. The
Breeze container-based solution provides a reproducible environment that is
consistent with other developers.
- You are **STRONGLY** encouraged to also install and use `pre-commit hooks <STATIC_CODE_CHECKS.rst#pre-commit-hooks>`_
for your local virtualenv development environment.
Pre-commit hooks can speed up your development cycle a lot.
Breeze Development Environment
------------------------------
All details about using and running Airflow Breeze can be found in
`BREEZE.rst <BREEZE.rst>`__.
The Airflow Breeze solution is intended to ease your local development as "*It's
a Breeze to develop Airflow*".
Benefits:
- Breeze is a complete environment that includes external components, such as
mysql database, hadoop, mongo, cassandra, redis, etc., required by some of
Airflow tests. Breeze provides a preconfigured Docker Compose environment
where all these services are available and can be used by tests
automatically.
- Breeze environment is almost the same as used in the CI automated builds.
So, if the tests run in your Breeze environment, they will work in the CI as well.
See `<CI.rst>`_ for details about Airflow CI.
Limitations:
- Breeze environment takes significant space in your local Docker cache. There
are separate environments for different Python and Airflow versions, and
each of the images takes around 3GB in total.
- Though Airflow Breeze setup is automated, it takes time. The Breeze
environment uses pre-built images from DockerHub and it takes time to
download and extract those images. Building the environment for a particular
Python version takes less than 10 minutes.
- Breeze environment runs in the background taking precious resources, such as
disk space and CPU. You can stop the environment manually after you use it
or even use a ``bare`` environment to decrease resource usage.
**NOTE:** Breeze CI images are not supposed to be used in production environments.
They are optimized for repeatability of tests, maintainability and speed of building rather
than production performance. The production images are not yet officially published.
Airflow dependencies
====================
.. note::
Only ``pip`` installation is currently officially supported.
While they are some successes with using other tools like `poetry <https://python-poetry.org/>`_ or
`pip-tools <https://pypi.org/project/pip-tools/>`_, they do not share the same workflow as
``pip`` - especially when it comes to constraint vs. requirements management.
Installing via ``Poetry`` or ``pip-tools`` is not currently supported.
If you wish to install airflow using those tools you should use the constraint files and convert
them to appropriate format and workflow that your tool requires.
Extras
------
There are a number of extras that can be specified when installing Airflow. Those
extras can be specified after the usual pip install - for example
``pip install -e .[ssh]``. For development purpose there is a ``devel`` extra that
installs all development dependencies. There is also ``devel_ci`` that installs
all dependencies needed in the CI environment.
This is the full list of those extras:
.. START EXTRAS HERE
airbyte, alibaba, all, all_dbs, amazon, apache.atlas, apache.beam, apache.cassandra, apache.drill,
apache.druid, apache.hdfs, apache.hive, apache.kylin, apache.livy, apache.pig, apache.pinot,
apache.spark, apache.sqoop, apache.webhdfs, asana, async, atlas, aws, azure, cassandra, celery,
cgroups, cloudant, cncf.kubernetes, crypto, dask, databricks, datadog, deprecated_api, devel,
devel_all, devel_ci, devel_hadoop, dingding, discord, doc, docker, druid, elasticsearch, exasol,
facebook, ftp, gcp, gcp_api, github_enterprise, google, google_auth, grpc, hashicorp, hdfs, hive,
http, imap, influxdb, jdbc, jenkins, jira, kerberos, kubernetes, ldap, leveldb, microsoft.azure,
microsoft.mssql, microsoft.psrp, microsoft.winrm, mongo, mssql, mysql, neo4j, odbc, openfaas,
opsgenie, oracle, pagerduty, pandas, papermill, password, pinot, plexus, postgres, presto, qds,
qubole, rabbitmq, redis, s3, salesforce, samba, segment, sendgrid, sentry, sftp, singularity, slack,
snowflake, spark, sqlite, ssh, statsd, tableau, telegram, trino, vertica, virtualenv, webhdfs,
winrm, yandex, zendesk
.. END EXTRAS HERE
Provider packages
-----------------
Airflow 2.0 is split into core and providers. They are delivered as separate packages:
* ``apache-airflow`` - core of Apache Airflow
* ``apache-airflow-providers-*`` - More than 50 provider packages to communicate with external services
In Airflow 1.10 all those providers were installed together within one single package and when you installed
airflow locally, from sources, they were also installed. In Airflow 2.0, providers are separated out,
and not packaged together with the core, unless you set ``INSTALL_PROVIDERS_FROM_SOURCES`` environment
variable to ``true``.
In Breeze - which is a development environment, ``INSTALL_PROVIDERS_FROM_SOURCES`` variable is set to true,
but you can add ``--skip-installing-airflow-providers-from-sources`` flag to Breeze to skip installing providers when
building the images.
One watch-out - providers are still always installed (or rather available) if you install airflow from
sources using ``-e`` (or ``--editable``) flag. In such case airflow is read directly from the sources
without copying airflow packages to the usual installation location, and since 'providers' folder is
in this airflow folder - the providers package is importable.
Some of the packages have cross-dependencies with other providers packages. This typically happens for
transfer operators where operators use hooks from the other providers in case they are transferring
data between the providers. The list of dependencies is maintained (automatically with pre-commits)
in the ``airflow/providers/dependencies.json``. Pre-commits are also used to generate dependencies.
The dependency list is automatically used during PyPI packages generation.
Cross-dependencies between provider packages are converted into extras - if you need functionality from
the other provider package you can install it adding [extra] after the
``apache-airflow-providers-PROVIDER`` for example:
``pip install apache-airflow-providers-google[amazon]`` in case you want to use GCP
transfer operators from Amazon ECS.
If you add a new dependency between different providers packages, it will be detected automatically during
pre-commit phase and pre-commit will fail - and add entry in dependencies.json so that the package extra
dependencies are properly added when package is installed.
You can regenerate the whole list of provider dependencies by running this command (you need to have
``pre-commits`` installed).
.. code-block:: bash
pre-commit run build-providers-dependencies
Here is the list of packages and their extras:
.. START PACKAGE DEPENDENCIES HERE
========================== ===========================
Package Extras
========================== ===========================
amazon apache.hive,cncf.kubernetes,exasol,ftp,google,imap,mongo,mysql,salesforce,ssh
apache.druid apache.hive
apache.hive amazon,microsoft.mssql,mysql,presto,samba,vertica
apache.livy http
dingding http
discord http
google amazon,apache.beam,apache.cassandra,cncf.kubernetes,facebook,microsoft.azure,microsoft.mssql,mysql,oracle,postgres,presto,salesforce,sftp,ssh,trino
hashicorp google
microsoft.azure google,oracle,sftp
mysql amazon,presto,trino,vertica
postgres amazon
salesforce tableau
sftp ssh
slack http
snowflake slack
========================== ===========================
.. END PACKAGE DEPENDENCIES HERE
Developing community managed provider packages
----------------------------------------------
While you can develop your own providers, Apache Airflow has 60+ providers that are managed by the community.
They are part of the same repository as Apache Airflow (we use ``monorepo`` approach where different
parts of the system are developed in the same repository but then they are packaged and released separately).
All the community-managed providers are in 'airflow/providers' folder and they are all sub-packages of
'airflow.providers' package. All the providers are available as ``apache-airflow-providers-<PROVIDER_ID>``
packages.
The capabilities of the community-managed providers are the same as the third-party ones. When
the providers are installed from PyPI, they provide the entry-point containing the metadata as described
in the previous chapter. However when they are locally developed, together with Airflow, the mechanism
of discovery of the providers is based on ``provider.yaml`` file that is placed in the top-folder of
the provider. Similarly as in case of the ``provider.yaml`` file is compliant with the
`json-schema specification <https://github.com/apache/airflow/blob/main/airflow/provider.yaml.schema.json>`_.
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
Thanks to that mechanism, you can develop community managed providers in a seamless way directly from
Airflow sources, without preparing and releasing them as packages. This is achieved by:
* When Airflow is installed locally in editable mode (``pip install -e``) the provider packages installed
from PyPI are uninstalled and the provider discovery mechanism finds the providers in the Airflow
sources by searching for provider.yaml files.
* When you want to install Airflow from sources you can set ``INSTALL_PROVIDERS_FROM_SOURCES`` variable
to ``true`` and then the providers will not be installed from PyPI packages, but they will be installed
from local sources as part of the ``apache-airflow`` package, but additionally the ``provider.yaml`` files
are copied together with the sources, so that capabilities and names of the providers can be discovered.
This mode is especially useful when you are developing a new provider, that cannot be installed from
PyPI and you want to check if it installs cleanly.
Regardless if you plan to contribute your provider, when you are developing your own, custom providers,
you can use the above functionality to make your development easier. You can add your provider
as a sub-folder of the ``airflow.providers`` package, add the ``provider.yaml`` file and install airflow
in development mode - then capabilities of your provider will be discovered by airflow and you will see
the provider among other providers in ``airflow providers`` command output.
Documentation for the community managed providers
-------------------------------------------------
When you are developing a community-managed provider, you are supposed to make sure it is well tested
and documented. Part of the documentation is ``provider.yaml`` file ``integration`` information and
``version`` information. This information is stripped-out from provider info available at runtime,
however it is used to automatically generate documentation for the provider.
If you have pre-commits installed, pre-commit will warn you and let you know what changes need to be
done in the ``provider.yaml`` file when you add a new Operator, Hooks, Sensor or Transfer. You can
also take a look at the other ``provider.yaml`` files as examples.
Well documented provider contains those:
* index.rst with references to packages, API used and example dags
* configuration reference
* class documentation generated from PyDoc in the code
* example dags
* how-to guides
You can see for example ``google`` provider which has very comprehensive documentation:
* `Documentation <docs/apache-airflow-providers-google>`_
* `Example DAGs <airflow/providers/google/cloud/example_dags>`_
Part of the documentation are example dags. We are using the example dags for various purposes in
providers:
* showing real examples of how your provider classes (Operators/Sensors/Transfers) can be used
* snippets of the examples are embedded in the documentation via ``exampleinclude::`` directive
* examples are executable as system tests
Testing the community managed providers
---------------------------------------
We have high requirements when it comes to testing the community managed providers. We have to be sure
that we have enough coverage and ways to tests for regressions before the community accepts such
providers.
* Unit tests have to be comprehensive and they should tests for possible regressions and edge cases
not only "green path"
* Integration tests where 'local' integration with a component is possible (for example tests with
MySQL/Postgres DB/Trino/Kerberos all have integration tests which run with real, dockerized components
* System Tests which provide end-to-end testing, usually testing together several operators, sensors,
transfers connecting to a real external system
You can read more about out approach for tests in `TESTING.rst <TESTING.rst>`_ but here
are some highlights.
Dependency management
=====================
Airflow is not a standard python project. Most of the python projects fall into one of two types -
application or library. As described in
`this StackOverflow question <https://stackoverflow.com/questions/28509481/should-i-pin-my-python-dependencies-versions>`_,
the decision whether to pin (freeze) dependency versions for a python project depends on the type. For
applications, dependencies should be pinned, but for libraries, they should be open.
For application, pinning the dependencies makes it more stable to install in the future - because new
(even transitive) dependencies might cause installation to fail. For libraries - the dependencies should
be open to allow several different libraries with the same requirements to be installed at the same time.
The problem is that Apache Airflow is a bit of both - application to install and library to be used when
you are developing your own operators and DAGs.
This - seemingly unsolvable - puzzle is solved by having pinned constraints files. Those are available
as of airflow 1.10.10 and further improved with 1.10.12 (moved to separate orphan branches)
Pinned constraint files
=======================
.. note::
Only ``pip`` installation is officially supported.
While they are some successes with using other tools like `poetry <https://python-poetry.org/>`_ or
`pip-tools <https://pypi.org/project/pip-tools/>`_, they do not share the same workflow as
``pip`` - especially when it comes to constraint vs. requirements management.
Installing via ``Poetry`` or ``pip-tools`` is not currently supported.
If you wish to install airflow using those tools you should use the constraint files and convert
them to appropriate format and workflow that your tool requires.
By default when you install ``apache-airflow`` package - the dependencies are as open as possible while
still allowing the apache-airflow package to install. This means that ``apache-airflow`` package might fail to
install in case a direct or transitive dependency is released that breaks the installation. In such case
when installing ``apache-airflow``, you might need to provide additional constraints (for
example ``pip install apache-airflow==1.10.2 Werkzeug<1.0.0``)
Jarek Potiuk
committed
There are several sets of constraints we keep:
* 'constraints' - those are constraints generated by matching the current airflow version from sources
and providers that are installed from PyPI. Those are constraints used by the users who want to
install airflow with pip, they are named ``constraints-<PYTHON_MAJOR_MINOR_VERSION>.txt``.
* "constraints-source-providers" - those are constraints generated by using providers installed from
current sources. While adding new providers their dependencies might change, so this set of providers
is the current set of the constraints for airflow and providers from the current main sources.
Those providers are used by CI system to keep "stable" set of constraints. They are named
Jarek Potiuk
committed
``constraints-source-providers-<PYTHON_MAJOR_MINOR_VERSION>.txt``
* "constraints-no-providers" - those are constraints generated from only Apache Airflow, without any
providers. If you want to manage airflow separately and then add providers individually, you can
use those. Those constraints are named ``constraints-no-providers-<PYTHON_MAJOR_MINOR_VERSION>.txt``.
We also have constraints with "source-providers" but they are used i
The first ones can be used as constraints file when installing Apache Airflow in a repeatable way.
It can be done from the sources:
.. code-block:: bash
pip install -e . \
--constraint "https://raw.githubusercontent.com/apache/airflow/constraints-main/constraints-3.6.txt"
.. code-block:: bash
pip install apache-airflow \
--constraint "https://raw.githubusercontent.com/apache/airflow/constraints-main/constraints-3.6.txt"
This works also with extras - for example:
.. code-block:: bash
pip install .[ssh] \
--constraint "https://raw.githubusercontent.com/apache/airflow/constraints-main/constraints-3.6.txt"
As of apache-airflow 1.10.12 it is also possible to use constraints directly from GitHub using specific
tag/hash name. We tag commits working for particular release with constraints-<version> tag. So for example
fixed valid constraints 1.10.12 can be used by using ``constraints-1.10.12`` tag:
.. code-block:: bash
pip install apache-airflow[ssh]==1.10.12 \
--constraint "https://raw.githubusercontent.com/apache/airflow/constraints-1.10.12/constraints-3.6.txt"
There are different set of fixed constraint files for different python major/minor versions and you should
use the right file for the right python version.
Jarek Potiuk
committed
If you want to update just airflow dependencies, without paying attention to providers, you can do it using
-no-providers constraint files as well.
.. code-block:: bash
pip install . --upgrade \
--constraint "https://raw.githubusercontent.com/apache/airflow/constraints-main/constraints-no-providers-3.6.txt"
Jarek Potiuk
committed
The ``constraints-<PYTHON_MAJOR_MINOR_VERSION>.txt`` and ``constraints-no-providers-<PYTHON_MAJOR_MINOR_VERSION>.txt``
will be automatically regenerated by CI job every time after the ``setup.py`` is updated and pushed
if the tests are successful.
Documentation
=============
Documentation for ``apache-airflow`` package and other packages that are closely related to it ie. providers packages are in ``/docs/`` directory. For detailed information on documentation development, see: `docs/README.rst <docs/README.rst>`_
We check our code quality via static code checks. See
`STATIC_CODE_CHECKS.rst <STATIC_CODE_CHECKS.rst>`_ for details.
Your code must pass all the static code checks in the CI in order to be eligible for Code Review.
The easiest way to make sure your code is good before pushing is to use pre-commit checks locally
as described in the static code checks documentation.
.. _coding_style:
Coding style and best practices
===============================
Most of our coding style rules are enforced programmatically by flake8 and mypy (which are run automatically
on every pull request), but there are some rules that are not yet automated and are more Airflow specific or
semantic than style
Don't Use Asserts Outside Tests
-------------------------------
Our community agreed that to various reasons we do not use ``assert`` in production code of Apache Airflow.
For details check the relevant `mailing list thread <https://lists.apache.org/thread.html/bcf2d23fcd79e21b3aac9f32914e1bf656e05ffbcb8aa282af497a2d%40%3Cdev.airflow.apache.org%3E>`_.
In other words instead of doing:
.. code-block:: python
assert some_predicate()
you should do:
.. code-block:: python
if not some_predicate():
handle_the_case()
The one exception to this is if you need to make an assert for typechecking (which should be almost a last resort) you can do this:
.. code-block:: python
if TYPE_CHECKING:
assert isinstance(x, MyClass)
Database Session Handling
-------------------------
**Explicit is better than implicit.** If a function accepts a ``session`` parameter it should not commit the
transaction itself. Session management is up to the caller.
To make this easier, there is the ``create_session`` helper:
.. code-block:: python
from sqlalchemy.orm import Session
from airflow.utils.session import create_session
def my_call(*args, session: Session):
...
# You MUST not commit the session here.
with create_session() as session:
my_call(*args, session=session)
If this function is designed to be called by "end-users" (i.e. DAG authors) then using the ``@provide_session`` wrapper is okay:
.. code-block:: python
from sqlalchemy.orm import Session
from airflow.utils.session import NEW_SESSION, provide_session
@provide_session
def my_method(arg, *, session: Session = NEW_SESSION):
...
# You SHOULD not commit the session here. The wrapper will take care of commit()/rollback() if exception
In both cases, the ``session`` argument is a `keyword-only argument`_. This is the most preferred form if
possible, although there are some exceptions in the code base where this cannot be used, due to backward
compatibility considerations. In most cases, ``session`` argument should be last in the argument list.
.. _`keyword-only argument`: https://www.python.org/dev/peps/pep-3102/
Ash Berlin-Taylor
committed
-----------------------------------------
If you wish to compute the time difference between two events with in the same process, use
``time.monotonic()``, not ``time.time()`` nor ``timezone.utcnow()``.
Ash Berlin-Taylor
committed
If you are measuring duration for performance reasons, then ``time.perf_counter()`` should be used. (On many
platforms, this uses the same underlying clock mechanism as monotonic, but ``perf_counter`` is guaranteed to be
the highest accuracy clock on the system, monotonic is simply "guaranteed" to not go backwards.)
If you wish to time how long a block of code takes, use ``Stats.timer()`` -- either with a metric name, which