Lineage and Asset Loader¶
This package creates lineage and (optionally) assets based on an input CSV file that contains source-to-target mappings.
Use Cases¶
- ELT/ETL is performed in a tool that Atlan does not have a native connector for but source-to-target mapping information can be extracted/produced from the tool
- Lineage cannot be extracted from any tool, but it is feasible to generate the source-to-target mapping information through simple logic or manual efforts
Pre-requisites¶
- AWS S3 bucket with access provisioned via AWS Access Key & Secret which allows access by a non-AWS application Resources:
Input¶
Each instance of the custom package workflow has one input mapping file. The file is a CSV with fields that identify/map the source and target assets to one another.
The fields in the input mapping file can be divided into 3 sets:
The precise fields that are required for Source and Target sets depend on the asset types.
Currently, the package supports Relational Database and S3 types only.
Each file can have only one source type and one target type.
Source Identifiers¶
Regardless of the source type, the following two fields are required:
- SOURCE_TYPE: Object type of the source asset that the lineage is to be created for. Must be "Table" for database type and "S3 Object" for S3 type.
- SOURCE_CONN: Qualified Name of the connection where the source assets reside or will be created
Use the following fields when the source assets are from a Relational Database.
- SOURCE_DB: Name of the database where the source table object is found
- SOURCE_SCHEMA: Name of the schema where the source table object is found
- SOURCE_TABLE: Name of the source table object
Use the following fields when the source assets are from S3.
- SOURCE_BUCKET: Name of the bucket where the S3 objects are found
- SOURCE_BUCKET_ARN: ARN of the bucket
- SOURCE_OBJECT: Name (key) of the S3 object
- SOURCE_OBJECT_ARN: ARN of the S3 object
Target Identifiers¶
Regardless of the target type, the following two fields are required:
- TARGET_TYPE: Object type of the target asset that the lineage is to be created for. Must be "Table" for database type and "S3 Object" for S3 type.
- TARGET_CONN: Qualified Name of the connection where the target assets reside or will be created
Use the following fields when the target assets are from a Relational Database.
- TARGET_DB: Name of the database where the target table object is found
- TARGET_SCHEMA: Name of the schema where the target table object is found
- TARGET_TABLE: Name of the target table object
Use the following fields when the target assets are from S3.
- TARGET_BUCKET: Name of the bucket where the S3 objects are found
- TARGET_BUCKET_ARN: ARN of the bucket
- TARGET_OBJECT: Name (key) of the S3 object
- TARGET_OBJECT_ARN: ARN of the S3 object
Asset creation controllers¶
- CREATE_SOURCE_IF_NOT_EXISTS: Controls whether or not the script should create the source asset referenced by the source identifier fields or if it should require them to exist apart from the script in order to generate the lineage. Valid values are "TRUE" and "FALSE".
- CREATE_TARGET_IF_NOT_EXISTS: Controls whether or not the script should create the target asset referenced by the target identifier fields or if it should require them to exist apart from the script in order to generate the lineage. Valid values are "TRUE" and "FALSE".
Lineage metadata fields¶
- DESCRIPTION: Description to be saved on the lineage/process asset that connects the source/target objects on that row of the mapping file.
- EXPRESSION: SQL/Expression to be saved on the lineage/process asset.
Templates¶
Workflow Setup¶
Credentials¶
Input Method¶
- Input: This identifies the method by which the package will acces the input file. Currently, the only option is S3 Bucket.
S3 Input Option Parameters¶
- AWS Access Key: AWS Access Key used to gain access to the S3 bucket where the mapping file is located.
- AWS Access Secret: AWS Access Secret used to gain access to the S3 bucket where the mapping file is located.
- S3 Bucket Name: Name of the S3 bucket where the input file is located.
- Mapping Filename/Key: Name of the CSV mapping file/key in S3 including the prefix.
- S3 Region: AWS Region where the S3 bucket is located.
Configuration¶
- Connection QN: Qualified Name of the connection where the lienage/process assets will be created. NOTE: This must be created via the API separate from running the workflow.
- Name: Will be name of the custom metadata set created/used by the workflow to store the reference info about this source. Can be used for multiple instances of this package.
- Instance Name: Will be the name of the custom metadata property that will store the identity of the workflow.
- Instance Unique ID: Unique identifier to be stored on each asset that is created by this workflow. MUST BE UNIQUE TO WORKFLOW
How it works¶
- The Custom Metadata set/property used for identifying the workflow may be pre-created, or it will be created by the workflow. It is recommended to let the workflow create it as it will be "locked" in the UI so that it cannot be inadvertently modified.
- Every unique asset in the input file (database, schema, table, S3 bucket, S3 ojbect, etc.) that has the "CREATE_SOURCE/TARGET_IF_NOT_EXISTS" field set to "True" will be created by the workflow.
- If an asset has the "CREATE_SOURCE/TARGET_IF_NOT_EXISTS" set to "False", the workflow will not create the asset. It must already exist in Atlan under the specified connection if the lineage represented by that line is to be generated.
- Lineage will be created in Atlan for every row in the input file for which both the source and target assets exist in Atlan and are active (either created by the workflow or pre-existing).
- If the Description or Expression lineage metadata are updated in the input mapping file after they were created by the lineage, they will be updated on the process asset in a subsequent run.
- Every asset (including lineage) will have the Custom Metadata property identifed in the configuration set with the CIF Unique ID value so that the workflow can locate the assets it authored in subsequent runs, and deprecate them if needed.
- If assets or lineage previously created by the workflow are no longer found in the input mapping file, the workflow will deprecate them (archive the assets, purge/delete the lineage).
- If on a subsequent run, the "CREATE_SOURCE/TARGET_IF_NOT_EXISTS" field is set to "False" for an asset that was previously created by the workflow, the asset (and corresponding lineage) will be deprecated.