Lineage and Asset Loader¶
This package creates lineage and (optionally) assets based on an input CSV file that contains source-to-target mappings.
Use Cases¶
- ELT/ETL is performed in a tool that Atlan does not have a native connector for but source-to-target mapping information can be extracted/produced from the tool
- Lineage cannot be extracted from any tool, but it is feasible to generate the source-to-target mapping information through simple logic or manual efforts
Pre-requisites¶
- AWS S3 bucket with access provisioned via AWS Access Key & Secret which allows access by a non-AWS application Resources:
Authentication Mechanism¶
User-based authentication¶
To configure user-based authentication: - Create an AWS IAM user by following the steps in the AWS Identity and Access Management User Guide. - On the Set permissions page, attach your IAM policy to this user. - Once the user is created, view or download the user's access key ID and secret access key.
Role-based authentication¶
To configure role-based authentication, attach your IAM policy to the EC2 role that Atlan uses for its EC2 instances in the EKS cluster. Please raise a support ticket to use this option.
Role delegation-based authentication¶
To configure role delegation-based authentication:
- Raise a support ticket to get the ARN of the Node Instance Role for your Atlan EKS cluster.
- Create a new role in your AWS account by following the steps in the AWS Identity and Access Management User Guide.
- When prompted for policies, attach your IAM policy to this role.
- When prompted, create a trust relationship for the role using the following trust policy. (Replace <atlan_nodeinstance_role_arn>
with the ARN received from Atlan support.)
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "<atlan_nodeinstance_role_arn>"
},
"Action": "sts:AssumeRole",
"Condition": {}
}
]
}
- Now, reach out to Atlan support with:
- The name of the role you created above.
- The ID of the AWS account where the role was created.
Input¶
Each instance of the custom package workflow has one input mapping file. The file is a CSV with fields that identify/map the source and target assets to one another.
The fields in the input mapping file can be divided into 3 sets:
The precise fields that are required for Source and Target sets depend on the asset types.
Currently, the package supports Relational Database and S3 types only.
Each file can have only one source type and one target type.
Source Identifiers¶
Regardless of the source type, the following two fields are required:
- SOURCE_TYPE: Object type of the source asset that the lineage is to be created for. Must be "Table" for database type and "S3 Object" for S3 type.
- SOURCE_CONN: Qualified Name of the connection where the source assets reside or will be created
Use the following fields when the source assets are from a Relational Database.
- SOURCE_DB: Name of the database where the source table object is found
- SOURCE_SCHEMA: Name of the schema where the source table object is found
- SOURCE_TABLE: Name of the source table object
Use the following fields when the source assets are from S3.
- SOURCE_BUCKET: Name of the bucket where the S3 objects are found
- SOURCE_BUCKET_ARN: ARN of the bucket
- SOURCE_OBJECT: Name (key) of the S3 object
- SOURCE_OBJECT_ARN: ARN of the S3 object
Target Identifiers¶
Regardless of the target type, the following two fields are required:
- TARGET_TYPE: Object type of the target asset that the lineage is to be created for. Must be "Table" for database type and "S3 Object" for S3 type.
- TARGET_CONN: Qualified Name of the connection where the target assets reside or will be created
Use the following fields when the target assets are from a Relational Database.
- TARGET_DB: Name of the database where the target table object is found
- TARGET_SCHEMA: Name of the schema where the target table object is found
- TARGET_TABLE: Name of the target table object
Use the following fields when the target assets are from S3.
- TARGET_BUCKET: Name of the bucket where the S3 objects are found
- TARGET_BUCKET_ARN: ARN of the bucket
- TARGET_OBJECT: Name (key) of the S3 object
- TARGET_OBJECT_ARN: ARN of the S3 object
Asset creation controllers¶
- CREATE_SOURCE_IF_NOT_EXISTS: Controls whether or not the script should create the source asset referenced by the source identifier fields or if it should require them to exist apart from the script in order to generate the lineage. Valid values are "TRUE" and "FALSE".
- CREATE_TARGET_IF_NOT_EXISTS: Controls whether or not the script should create the target asset referenced by the target identifier fields or if it should require them to exist apart from the script in order to generate the lineage. Valid values are "TRUE" and "FALSE".
Lineage metadata fields¶
- DESCRIPTION: Description to be saved on the lineage/process asset that connects the source/target objects on that row of the mapping file.
- EXPRESSION: SQL/Expression to be saved on the lineage/process asset.
Templates¶
Workflow Setup¶
Credentials¶
Input Method¶
- Input: This identifies the method by which the package will acces the input file. Currently, the only option is S3 Bucket.
S3 Input Option Parameters¶
The user can access the S3 bucket either via IAM User or IAM Role. AWS Access Key and AWS Access Secret are required for IAM User authentication. IAM Role Access supports both role-delegation and role-identity based access. - AWS Access Key: AWS Access Key used to gain access to the S3 bucket where the mapping file is located. - AWS Access Secret: AWS Access Secret used to gain access to the S3 bucket where the mapping file is located. - AWS Role ARN: AWS Role used to gain access to the S3 bucket where the mapping file is located. This is an optional field in case of role-identity based access. - S3 Bucket Name: Name of the S3 bucket where the input file is located. - Mapping Filename/Key: Name of the CSV mapping file/key in S3 including the prefix. - S3 Region: AWS Region where the S3 bucket is located.
Configuration¶
- Connection QN: Qualified Name of the connection where the lineage/process assets will be created. NOTE: This must be created via the API separate from running the workflow.
- Name: Will be name of the custom metadata set created/used by the workflow to store the reference info about this source. Can be used for multiple instances of this package.
- Instance Name: Will be the name of the custom metadata property that will store the identity of the workflow.
- Instance Unique ID: Unique identifier to be stored on each asset that is created by this workflow. MUST BE UNIQUE TO WORKFLOW
How it works¶
- The Custom Metadata set/property used for identifying the workflow may be pre-created, or it will be created by the workflow. It is recommended to let the workflow create it as it will be "locked" in the UI so that it cannot be inadvertently modified.
- Every unique asset in the input file (database, schema, table, S3 bucket, S3 ojbect, etc.) that has the "CREATE_SOURCE/TARGET_IF_NOT_EXISTS" field set to "True" will be created by the workflow.
- If an asset has the "CREATE_SOURCE/TARGET_IF_NOT_EXISTS" set to "False", the workflow will not create the asset. It must already exist in Atlan under the specified connection if the lineage represented by that line is to be generated.
- Lineage will be created in Atlan for every row in the input file for which both the source and target assets exist in Atlan and are active (either created by the workflow or pre-existing).
- If the Description or Expression lineage metadata are updated in the input mapping file after they were created by the lineage, they will be updated on the process asset in a subsequent run.
- Every asset (including lineage) will have the Custom Metadata property identifed in the configuration set with the CIF Unique ID value so that the workflow can locate the assets it authored in subsequent runs, and deprecate them if needed.
- If assets or lineage previously created by the workflow are no longer found in the input mapping file, the workflow will deprecate them (archive the assets, purge/delete the lineage).
- If on a subsequent run, the "CREATE_SOURCE/TARGET_IF_NOT_EXISTS" field is set to "False" for an asset that was previously created by the workflow, the asset (and corresponding lineage) will be deprecated.