S3 Hierarchy Crawler¶

The s3 hierarchy crawler package crawls the set(s) of metadata files produced by Amazon S3 Inventory for a specified set of buckets. The directory/prefix structure is represented in Atlan has DB hierarchy.

For example, with a structure such as:

Amazon S3 > Buckets > atlan-csa-raw > analytics-datamodels/ > customer/ > dt=20240710/ > hr=01/

Bucket atlan-csa-raw maps to the domain or database-level
Prefix level 1 analytics_datamodels/ maps to the schema-level
Prefix Level 2 customer/ maps to the entity or table-level
Prefix Level 3 dt=20240710/ represents a partition for the entity (may not be present)
Prefix Level 4 hr=01/ represents a second-level partition for the entity (may not be present)

Therefore, when the bucket is crawled, each of these should create the respective assets (database, schema, table) in Atlan.

Partitions are optional, but if they exist, these will always be in the format partition_column=partition_value. There could be multiple nested levels of partitions, but only one partition defined per folder level. The order of these must match the order in the folder hierarchy. Note that in this example, the partitions are a date hierarchy, but in practice, any text value could be used to designate a partition.

Configuration¶

Connection Name¶

Provide a connection name to associate with the catalog.

Inventory configuration name¶

The name of the inventory configuration

Credentials¶

Two authentication models are available.

IAM User

Provide the AWS Access Key and Secret Key for an IAM user that has access to the S3 bucket. The policy below illustrates the accesses needed.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "VisualEditor0",
      "Effect": "Allow",
      "Action": [
        "s3:GetBucketLocation",
        "s3:ListBucket",
        "s3:GetObject",
        "s3:GetEncryptionConfiguration"
      ],
      "Resource": [
        "arn:aws:s3:::<s3_bucket>",
        "arn:aws:s3:::<s3_bucket>/*"
      ]
    }
  ]
}

IAM Role

Allows for role-delegation. To configure:

Raise a support ticket to get the ARN of the Node Instance Role for your Atlan EKS cluster.

Create a new policy in your AWS account with the below accesses -

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "VisualEditor0",
      "Effect": "Allow",
      "Action": [
        "s3:GetBucketLocation",
        "s3:ListBucket",
        "s3:GetObject",
        "s3:GetEncryptionConfiguration"
      ],
      "Resource": [
        "arn:aws:s3:::<s3_bucket>",
        "arn:aws:s3:::<s3_bucket>/*"
      ]
    }
  ]
}

Create a new role in your AWS account by following the steps in the AWS Identity and Access Management User Guide.
When prompted for policies, attach the policy created earlier to this role.

When prompted, create a trust relationship for the role using the following trust policy. (Replace with the ARN received from Atlan support.)

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "<atlan_nodeinstance_role_arn>"
      },
      "Action": "sts:AssumeRole",
      "Condition": {}
    }
  ]
}

Now, reach out to Atlan support with:
- The name of the role you created above.
- The ID of the AWS account where the role was created.

Warning

Wait until the support team confirms the account is allowlisted to assume the role before setting up the workflow.

Bucket Details¶

Specify the S3 Bucket name (without the s3:// prefix), Prefix and Region where the S3 Inventory reports can be found.

Bucket List¶

Upload a JSON file containing the set of buckets to be crawled along with an optional 'domain-name', list of 'schemas-to-include' and list of 'schemas-to-exclude'. Following is an example:

[
  {
    "bucket-name": "autodesk-test",
    "domain-name": "atlan-cs",
    "schemas-to-include": [],
    "schemas-to-exclude": [
      "fusion_f3d_data"
    ]
  }
]

Database naming option¶

Select one of the option that will be used as the database naming strategy: - Bucket Name - Domain Name - Static - Static + Bucket - Static + Domain - Prefix in hierarchy

Database prefix static string¶

Required if one of the “static” options are selected above. This is a string value to be used as the name of the database, or prepended to the bucket or domain-name if one of the “static” options is chosen.

Database prefix level¶

A numeric value that defines the level in the hierarchy at which the “database” is defined. Default to “0”, which corresponds to the bucket.

Schema prefix level¶

A numeric value that defines the level in the hierarchy at which the “schema” is defined. Default to “1”, meaning that the schema is defined directly under the bucket. Must be at least Database prefix level +1.

Table prefix level¶

A numeric value that defines the level in the hierarchy at which the “tables” are defined. Default to “2”. Must be at least Schema prefix level +1. Setting this 2 or more levels higher than the Schema allows for intermediate levels to exist in the prefix/folder structure that are not part of the data model hierarchy (such as a date folder). In this case, only one set of intermediate folders should be traversed. Additionally, in this case, the Table prefix level may be an object, not a directory.

Add domain as custom metadata¶

Boolean input (defaults to False) that indicates the name of the domain should be added to all assets created by the package (database, schema, table)

Domain Custom Metadata Set Name¶

The name of the set for the property to which the Domain value should be written, according to the bucket under which the assets were crawled from. (required if “add domain as custom metadata” is True)

Domain Custom Metadata Property Name¶

The property name of the property in Domain Custom Metadata Set to which the Domain should be written, according to the bucket under which the assets were crawled from. (required if “add domain as custom metadata” is True)

Partition levels¶

A numeric value that defines the max depth / number of potential nested partition levels that could exist, and therefore which the package must look for beyond the table level in the folder hierarchy.

Partition Custom Metadata Set Name¶

The name of custom metadata set to store the partition column name(s) from any partitions in the levels defined.

Partition Custom Metadata Set Name¶

The name of the property in the custom metadata set to store the partition column name(s) from any partitions in the levels defined.

Count Custom Metadata Set Name¶

The name of custom metadata set to store the count of files associated with a table.

Count Custom Metadata Set Name¶

The name of the property in the custom metadata set to store the count of files associated with a table.

What it does¶

Locates the latest inventory file in the specified location. Only the buckets specified in uploaded bucket list will be processed. If a domain-name is given for the bucket-name then it will be used. If a list of schema-names-to-include is not specified then all schemas found will be included unless the schema is found in the list of schemas-to-eclude. Processes the data in the inventory file and creates/updates databases, schemas and tables as determined by the prefix level specified. All the objects are associated with the connection specified in Connection name. Domain information will be added to the database, schema and table in the custom metadata property specified for the domain. Partition information will be added to the table in the custom metadata property specified for partition information. Count information will be added for the table in the custom metadata property specified for count information.