Skip to content
Snippets Groups Projects
README.md 9.36 KiB
Newer Older
etienne peysson's avatar
etienne peysson committed
CUSTOM AIRFLOW IMAGE
--------------------

etienne peysson's avatar
etienne peysson committed
The purpose of this repository is to provide a docker image featuring **Apache Airflow** on which you can rely to launch the whole stack (see [docker compose](compose/docker-compose.yml)).

There is also a readme covering the **kubernetes environment** set up instead of using the docker compose services. Kubernetes will offer you the ability to test Witsml parser and also the csv parser :

[KUBERNETES SET UP](KUBE_README.md)

And you find a few details about the **Airflow Stable API** here :

[Airflow Stable API for OSDU](AIRFLOW_README.md)

etienne peysson's avatar
etienne peysson committed
Main public : 

    - DAGs developers

For more details about the stack, please refer to [Airflow docker stack](https://airflow.apache.org/docs/docker-stack/build.html).

**Important note :**

The running Airflow instance is linked to the CSP APIs. You may test your dags under any CSPs.

# Content

## Mandatory libraries

If you take a look at the [Dockerfile](Dockerfile) you will notice the installation of the following libraries :

etienne peysson's avatar
etienne peysson committed
- [Airflow lib](https://community.opengroup.org/osdu/platform/data-flow/ingestion/osdu-airflow-lib)
- [Osdu Api](https://community.opengroup.org/osdu/platform/system/sdks/common-python-sdk)
etienne peysson's avatar
etienne peysson committed

Those libraries are mandatory.

## DAGs

Then comes the Ingestion dags which are simply copied to the [dags](compose/dags) folder (the folder should be writable - chmod -Rf 777 dags). 

This is the place where you can add your own dags for development.

For the sake of the example here we included only 1 DAG and its dependencies (Osdu_ingest). You can get a more up to date version from the [ingestion dags repository](https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/tree/master/src/osdu_dags)

## Plugins

In case you need to test some Airflow plugins, you can make the folder writable (chmod -Rf 777 plugins).

etienne peysson's avatar
etienne peysson committed
## Data folder
etienne peysson's avatar
etienne peysson committed

Under the [data folder](data/) you may add/alter some payload json files for testing purposes. 

Note that you will need to trigger the dags from the airflow container directly (see below).

Make sure you provide read/write access to the data folder (eg: chmod -Rf 777 data)

## Logs folder

Logs of the workflow are being written in the logs folder, you should also make it writable (eg: chmod -Rf 777 logs).

etienne peysson's avatar
etienne peysson committed

# Build the image

    docker build . --tag osdu-airflow:0.0.1

etienne peysson's avatar
etienne peysson committed
# Docker Compose    

From the [Original docker compose](https://airflow.apache.org/docs/apache-airflow/stable/docker-compose.yaml) set up on which you can rely, we've modified it using our own image (*eg : osdu-airflow:0.0.1*) plus a few environment variables including :

etienne peysson's avatar
etienne peysson committed
    - common variables (ex: CLOUD_PROVIDER...)
etienne peysson's avatar
etienne peysson committed
    - CSP specific environment variables

Again for the sake of the example, we've included a sample of [docker-compose.yml](compose/docker-compose.yml) file.
You will need alter it upon your needs.

Only the part at the top is usually modified :

```yaml
version: "3.7"
x-airflow-common:
  &airflow-common
  image: osdu-airflow:0.0.1
  environment:
    &airflow-common-env
    AIRFLOW__CORE__EXECUTOR: CeleryExecutor
    AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow
    AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres/airflow
    AIRFLOW__CELERY__BROKER_URL: redis://:@redis:6379/0
    AIRFLOW__CORE__FERNET_KEY: ''
    AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'true'
    AIRFLOW__CORE__LOAD_EXAMPLES: 'false'
    AIRFLOW__API__AUTH_BACKEND: 'airflow.api.auth.backend.basic_auth'
    CI_COMMIT_TAG: v1.0
    CLOUD_PROVIDER: ###CSP NAME###
    client_id: #####
    client_secret: #####
    username: #####
    password: #####
    # Place here additional required variables - see below
  volumes:
    - ./airflow/airflow.cfg:/opt/airflow/airflow.cfg
    - ./data:/opt/airflow/data
    - ./logs:/opt/airflow/logs
```

- For instance on IBM you will need those variables :

```yaml
    CLOUD_PROVIDER: ibm
    KEYCLOACK_URI: KEYCLOAK_AUTH_URL
    REALM_NAME: KEYCLOAK REALM
    COS_URL: URL_OF_STORAGE
    COS_ACCESS_KEY: STORAGE_ACCESS_KEY
    COS_SECRET_KEY: STORAGE_SECRET_KEY
    COS_REGION: STORAGE_REGION
    client_id: IBM_API_CLIENT_ID
    client_secret: IBM_API_CLIENT_SECRET
    username: IBM_GENERIC_USERNAME
    password: IBM_GENERIC_PASSWORD
```

- On GCP :

```yaml
    CLOUD_PROVIDER: gcp
    client_id: GCP_API_CLIENT_ID
    client_secret: GCP_API_CLIENT_SECRET
    username: GCP_GENERIC_USERNAME
    password: GCP_GENERIC_PASSWORD
    # more to come...
```

- On AWS :

```yaml
    CLOUD_PROVIDER: aws
    client_id: AWS_API_CLIENT_ID
    client_secret: AWS_API_CLIENT_SECRET
    username: AWS_GENERIC_USERNAME
    password: AWS_GENERIC_PASSWORD
    # more to come...
```
etienne peysson's avatar
etienne peysson committed

etienne peysson's avatar
etienne peysson committed
- On Azure/Microsoft :

```yaml
    CLOUD_PROVIDER: ms
    client_id: MS_API_CLIENT_ID
    client_secret: MS_API_CLIENT_SECRET
    username: MS_GENERIC_USERNAME
    password: MS_GENERIC_PASSWORD
    # more to come...
```

etienne peysson's avatar
etienne peysson committed
You may find those details from the **Airflow variables** (refer to the Airflow instance running on preship environments) but also from the [preshipping team repository](https://gitlab.opengroup.org/osdu/subcommittees/ea/projects/pre-shipping/home/-/tree/master/).
etienne peysson's avatar
etienne peysson committed


## Airflow Configuration file

Using the [default Airflow config](https://github.com/apache/airflow/blob/main/airflow/config_templates/default_airflow.cfg), we can play with the configuration with ease (ex: launch airflow in debug mode - *logging_level = DEBUG*). 

Make sure to fill in proper details of airflow path (eg: */opt/airflow*).

In case of any Airflow upgrade, just replace with the latest default configuration.


## Run the stack
etienne peysson's avatar
etienne peysson committed

From compose folder :

    Have data and logs folders under user user:root (chown -Rf required)

etienne peysson's avatar
etienne peysson committed
Make sure the logs, data and dags folders are accessible with write access :
etienne peysson's avatar
etienne peysson committed

    sudo chmod -Rf 777 logs

etienne peysson's avatar
etienne peysson committed
- Start the stack :
etienne peysson's avatar
etienne peysson committed

    docker-compose up

etienne peysson's avatar
etienne peysson committed
- Export CSP specific Airflow variables :

From the CSP Airflow instance of your choice, head toward the following configuration variables listing :

etienne peysson's avatar
etienne peysson committed
![list of variables](airflow-vars.png)
etienne peysson's avatar
etienne peysson committed

Then select all variables as follow :

etienne peysson's avatar
etienne peysson committed
![checkbox - all variables](var-select.png)
etienne peysson's avatar
etienne peysson committed

And click on export to download the json format of your configuration :

etienne peysson's avatar
etienne peysson committed
![export variables](var-export.png)
etienne peysson's avatar
etienne peysson committed

- Import variables into your instance :

ex : *variables_CSP.json*

- Add some other variables depending on your needs : 

    core__config__show_skipped_ids=true

The above one is a nice to have for debugging purpose.

etienne peysson's avatar
etienne peysson committed
### Windows users
etienne peysson's avatar
etienne peysson committed

etienne peysson's avatar
etienne peysson committed
If you are using Windows there is a procedure to follow in order to share volumes from docker-compose :

You need to add the volumes' path (logs, dags, data) into *FileSharing* of docker.


# Report of any issue

Please use the current repository issues board for any question/issue you may encounter.
You may also ask the community on the Ingestion DAGs Slack channel.


------------------
DAGs' specifics
------------------

# Preparation
etienne peysson's avatar
etienne peysson committed

When your stack is running, you may check the containers :

    docker ps

    or  (kubernetes)
etienne peysson's avatar
etienne peysson committed

    kubectl get pods -n airflow

etienne peysson's avatar
etienne peysson committed
Retrieve the id of the Airflow worker and get inside :

    docker exec -it #### bash

    or (kubernetes)
etienne peysson's avatar
etienne peysson committed

    kubectl exec -it airflow-worker-0 bash -n airflow


Note : remove the -n airflow if you are not using a namespace.


# Manifest ingestion

## Trigger a Workflow

ou can issue the following commands to trigger the dag (make sure to copy the payload inside a new file payload.json within the container):
etienne peysson's avatar
etienne peysson committed

    json=$(cat payload.json)
etienne peysson's avatar
etienne peysson committed
    
    airflow dags trigger -c "$json" Osdu_ingest -r manual_manifest_1

etienne peysson's avatar
etienne peysson committed

etienne peysson's avatar
etienne peysson committed
**IMPORTANT NOTE** : *manual_1* or *witsml_1* should match the RUN ID you have under the payload.json sample.
But **also** match an existing ID from the Workflow service of the CSP you are testing on.
etienne peysson's avatar
etienne peysson committed

In order to have a matching RUN ID on Workflow API, you should trigger the workflow from the Workflow API with a value for the run id.
etienne peysson's avatar
etienne peysson committed
You will need to achieve this step just before the above command.

Example :

    curl --location --request POST '.../api/workflow/v1/workflow/Osdu_ingest/workflowRun' \
    --header 'data-partition-id: ###' \
    --header 'Content-Type: application/json' \
    --header 'Authorization: Bearer ###' \
    --data-raw '{
        "runId": "manual_1",
        "executionContext" : {
        "acl" : {
            ...

etienne peysson's avatar
etienne peysson committed

etienne peysson's avatar
etienne peysson committed
---------------
Troubleshooting
---------------

# osdu_api.ini missing / config_file_path not found

Code => config_manager under common_python_sdk project
Environment variables must be filled in from this file : osdu_api.ini 
And link to this file must be set up using the OSDU_API_CONFIG_INI variable

- First Task of DAG shows the following in log :

    configparser.Error(f"Could not find the config file in '{config_file_path}'.")
    configparser.Error: Could not find the config file in 'osdu_api.ini'.

- Solution : 

Make sure the config_file_path is properly set up under the airflow chart values environment fields.
However, we are filling up variables there ar it is common to all CSP for our custom set up (we rather use the variables defined in overriden chart values)

ex : 
```yaml
  - name: OSDU_API_CONFIG_INI
    value: /opt/airflow/data/osdu_api.ini
```

**Note** : the Dockerfile provided from witsml repository might not be up to date - please check with the CSP team. As an alternative post an issue in the current repository (using customized CSPWitsmlParserDockerfile).
















Windows install :
hostPath to modifyto match your machine and location of the dags