Skip to content

S3 Hierarchy Crawler

The s3 hierarchy crawler package crawls a specified set of buckets that are structured hierarchically. The directory/prefix structure is represented in Atlan has DB hierarchy.

For example, with a structure such as:

Amazon S3 > Buckets > atlan-csa-raw > analytics-datamodels/ > customer/ > dt=20240710/ > hr=01/

  • Bucket atlan-csa-raw maps to the domain or database-level
  • Prefix level 1 analytics_datamodels/ maps to the schema-level
  • Prefix Level 2 customer/ maps to the entity or table-level
  • Prefix Level 3 dt=20240710/ represents a partition for the entity (may not be present)
  • Prefix Level 4 hr=01/ represents a second-level partition for the entity (may not be present)

Therefore, when the bucket is crawled, each of these should create the respective assets (database, schema, table) in Atlan.

Partitions are optional, but if they exist, these will always be in the format partition_column=partition_value. There could be multiple nested levels of partitions, but only one partition defined per folder level. There should be an option in the package config to update a custom metadata property with the partition column name(s) (e.g. dt hr). The order of these must match the order in the folder hierarchy. Note that in this example, the partitions are a date hierarchy, but in practice, any text value could be used to designate a partition.

Configuration

Connection Name

Provide a connection name to associate with the catalog.

Credentials

Two authentication models are available.

Provide the AWS Access Key and Secret Key for an IAM user that has access to the S3 bucket. The policy below illustrates the accesses needed.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "VisualEditor0",
      "Effect": "Allow",
      "Action": [
        "s3:GetBucketLocation",
        "s3:ListBucket",
        "s3:GetObject",
        "s3:GetEncryptionConfiguration"
      ],
      "Resource": [
        "arn:aws:s3:::<s3_bucket>",
        "arn:aws:s3:::<s3_bucket>/*"
      ]
    }
  ]
}

Allows for role-delegation. To configure:

  • Raise a support ticket to get the ARN of the Node Instance Role for your Atlan EKS cluster.

  • Create a new policy in your AWS account with the below accesses -

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Sid": "VisualEditor0",
          "Effect": "Allow",
          "Action": [
            "s3:GetBucketLocation",
            "s3:ListBucket",
            "s3:GetObject",
            "s3:GetEncryptionConfiguration"
          ],
          "Resource": [
            "arn:aws:s3:::<s3_bucket>",
            "arn:aws:s3:::<s3_bucket>/*"
          ]
        }
      ]
    }
    
  • Create a new role in your AWS account by following the steps in the AWS Identity and Access Management User Guide.

  • When prompted for policies, attach the policy created earlier to this role.

  • When prompted, create a trust relationship for the role using the following trust policy. (Replace with the ARN received from Atlan support.)

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "AWS": "<atlan_nodeinstance_role_arn>"
          },
          "Action": "sts:AssumeRole",
          "Condition": {}
        }
      ]
    }
    
  • Now, reach out to Atlan support with:

    • The name of the role you created above.
    • The ID of the AWS account where the role was created.

Warning

Wait until the support team confirms the account is allowlisted to assume the role before setting up the workflow.

Bucket Details

Specify the S3 Bucket name (without the s3:// prefix), Prefix and Region. To catalog all assets in the bucket including those at the root, leave Prefix empty.

Bucket List

Upload a JSON file containing the set of buckets to be crawled along with an optional 'domain-name'. Following is an example:

[
 {
   "bucket-name": "atlan-csa-raw",
   "domain-name": "atlan-cs"
 },
 {
   "bucket-name": "atlan-csm-raw",
   "domain-name": "atlan-cs"
 }
]

Database naming option

Select one of the option that will be used as the database naming strategy: - Bucket Name - Domain Name - Static - Static + Bucket - Static + Domain - Prefix in hierarchy

Database prefix static string

Required if one of the “static” options are selected above. This is a string value to be used as the name of the database, or prepended to the bucket or domain-name if one of the “static” options is chosen.

Database prefix level

A numeric value that defines the level in the hierarchy at which the “database” is defined. Default to “0”, which corresponds to the bucket.

Schema prefix level

A numeric value that defines the level in the hierarchy at which the “schema” is defined. Default to “1”, meaning that the schema is defined directly under the bucket. Must be at least Database prefix level +1.

Table prefix level

A numeric value that defines the level in the hierarchy at which the “tables” are defined. Default to “2”. Must be at least Schema prefix level +1. Setting this 2 or more levels higher than the Schema allows for intermediate levels to exist in the prefix/folder structure that are not part of the data model hierarchy (such as a date folder). In this case, only one set of intermediate folders should be traversed. Additionally, in this case, the Table prefix level may be an object, not a directory.

Add domain as custom metadata

Boolean input (defaults to False) that indicates the name of the domain should be added to all assets created by the package (database, schema, table)

Domain Custom Metadata Set Name

The name of the set for the property to which the Domain value should be written, according to the bucket under which the assets were crawled from. (required if “add domain as custom metadata” is True)

Domain Custom Metadata Property Name

The property name of the property in Domain Custom Metadata Set to which the Domain should be written, according to the bucket under which the assets were crawled from. (required if “add domain as custom metadata” is True)

Partition levels

A numeric value that defines the max depth / number of potential nested partition levels that could exist, and therefore which the package must look for beyond the table level in the folder hierarchy.

Partition Custom Metadata Set Name

The name of custom metadata set to store the partition column name(s) from any partitions in the levels defined.

Partition Custom Metadata Set Name

The name of the property in the custom metadata set to store the partition column name(s) from any partitions in the levels defined.

What it does

Iterates over the objects in the bucket list file and creates/updates databases, schemas and tables as determined by the prefix level specified. All the objects are associated with the connection specified in Connection name. Domain information will be added to the database, schema and table in the custom metadata property specified for the domain. Partition information will be added to the table in the custom metadata property specified for partition information.