S3 Crawler¶
The S3 Crawler package allows S3 buckets to be cataloged in Atlan, ingesting all the S3 objects residing in the buckets or optionally, those which comply with the choosen filters.
Warning
Package only cataloges S3 buckets and objects. Does not address any lineage-related aspect.
The assets crawled are:
- Buckets
- Objects
The relationship between the crawled assets can be understood from the developer portal reference.
Configuration¶
Connection¶
-
Connection Name
: Name of the connection that will be created in Atlan to assoicate it with the catalog. The connection name must be unique across all S3 connections. -
Authentication
: Select one of the authentication models that is to be used to access the S3 buckets. These creds are used for either of the ingestion methods selected in the next step.Provide the
AWS Access Key
andSecret Key
for an IAM user that has access to the S3 bucket. The policy below illustrates the accesses needed.{ "Version": "2012-10-17", "Statement": [ { "Sid": "AllowAccessAllBucketsListingOnly", "Effect": "Allow", "Action": [ "s3:ListAllMyBuckets" ], "Resource": [ "arn:aws:s3:::*" ] }, { "Sid": "AllowAccessBucketsAndObjects", "Effect": "Allow", "Action": [ "s3:GetBucketLocation", "s3:ListBucket", "s3:GetObject", "s3:GetEncryptionConfiguration", "s3:GetBucketVersioning" ], "Resource": [ "arn:aws:s3:::<s3_bucket>", "arn:aws:s3:::<s3_bucket>/*" ] } ] }
Allows for role-delegation. To configure:
-
Raise a support ticket to get the ARN of the Node Instance Role for your Atlan EKS cluster.
-
Create a new policy in your AWS account with the below accesses -
{ "Version": "2012-10-17", "Statement": [ { "Sid": "AllowAccessAllBucketsListingOnly", "Effect": "Allow", "Action": [ "s3:ListAllMyBuckets" ], "Resource": [ "arn:aws:s3:::*" ] }, { "Sid": "AllowAccessBucketsAndObjects", "Effect": "Allow", "Action": [ "s3:GetBucketLocation", "s3:ListBucket", "s3:GetObject", "s3:GetEncryptionConfiguration", "s3:GetBucketVersioning" ], "Resource": [ "arn:aws:s3:::<s3_bucket>", "arn:aws:s3:::<s3_bucket>/*" ] } ] }
-
Create a new role in your AWS account by following the steps in the AWS Identity and Access Management User Guide.
-
When prompted for policies, attach the policy created earlier to this role.
-
When prompted, create a trust relationship for the role using the following trust policy. (Replace
with the ARN received from Atlan support.) { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "AWS": "<atlan_nodeinstance_role_arn>" }, "Action": "sts:AssumeRole", "Condition": {} } ] }
-
Now, reach out to Atlan support with:
- The name of the role you created above.
- The ID of the AWS account where the role was created.
Warning
Wait until the support team confirms the account is allowlisted to assume the role before setting up the workflow.
-
-
Region
: Region for the buckets to be accessed. This region is used to createView in S3
button for the catalog assets in Atlan. The region is also used for either of the ingestion methods selected in the next step.
Ingestion Method¶
-
Ingestion Method
: Select the Ingestion method to be used for cataloging S3 assets.Recommendation
- The method supports multi-bucket ingestion and thus has to do a
getAllBucket
operation. If your goal is to ingest only one bucket (optionally, a specific prefix from that one bucket), use the filters in the next step. - We recommend using the Inventory Method for ingestion if you anticipate the total object count (across all buckets) to be very large (>=1,000,000).
Fetches all the buckets available to the provided credential user/role. Iterates over each bucket and fetches all the objects inside each prefix. Uses S3's boto3 library for the raw extraction.
Note
Setup S3 Inventory report of the buckets that are to be cataloged in Atlan. To setup the inventory report for a bucket follow this AWS Documentation to configure inventory report. Some important points to keep in mind while configuring :
- Select all metadata fields to be present in the report (extremly important for the workflow to ingest correct metadata).
-
If you setup an optional prefix for the reports in the bucket, remember this prefix for the workflow configuration later.
Warning
- Only
CSV
andApache Parquet
file formats are supported by the workflow. - The region for the inventory report bucket is picked from the region mentioned in the step above.
- Only
S3 Bucket Name
: Bucket name of the bucket which holds these inventory reports (without the s3:// prefix).- (Optional)
S3 Bucket Prefix
: Provide a prefix, if a prefix was configured while setting up the inventory report. Else leave this input as empty. When providing the prefix, append a trailing/
to the prefix (e.g. :prefix-name/
).
- The method supports multi-bucket ingestion and thus has to do a
Filters¶
Single Bucket and Single Prefix (Optional)
If your goal with the workflow is to catalog a single bucket into the connection, use the Include Bucket
filter to mention the bucket name.
Include Bucket
: Instead of a regular expression, mention the bucket name. e.g.:bucket-name
- (Optional)
Include Prefix
: In the single bucket, if you want to ingest objects from a specific prefix mentions that prefix in this filter. e.g.:folder-1/folder-2/
This method is only supported for direct
ingestion method.
The below filters also works in case of inventory method for single bucket and prefix ingestion. When it says : "This method is only supported for direct ingestion method"
, it means in case of direct method the s3:ListAllMyBuckets
permission is not needed if the policy has access to a specific bucket's specific prefix. In that case use the filters as defined above.
All filters use regular expression patterns to identify matching criteria.
- (Optional)
Include Bucket
: Buckets to be included in the catalog. e.g. :bucket-name-1 | bucket-name-2
- (Optional)
Exclude Bucket
: Buckets to be excluded from the catalog. e.g. :bucket-name-3 | bucket-name-4
- (Optional)
Include Prefix
: Prefixes to be included in each bucket. This filter will be applied for objects inside each bucket. e.g. :folder-1/folder-2/ | folder-1/folder-3/
- (Optional)
Exclude Prefix
: Prefixes to be excluded from each bucket. This filter will be applied for objects inside each bucket. e.g. :folder-1/folder-4/ | folder-1/folder-5/
What it does¶
The package performs the following steps:
- Gathers basic information on the buckets, including versioning and encryption details based on the filters configured.
- Retrieves a list of objects in the buckets, based on the filters, and the associated attributes.
- Creates a new connection upon the first run and ingests the buckets and objects identified along with their source url.
- For subsequent runs, compares the object listing derived from the buckets against the asset catalog on Atlan. Then adds/updates/removes assets as needed to address the delta.
Warning
The View in S3
button, created for each object and bucket is dependent on the region mentioned in the configuration step. In case of a region mismatch in the button URL, S3 handles the redirect (happens when the source buckets and inventory report bucket are in different region). These URLs are private and requies the users to:
- be logged into the AWS Management Console.
- have permissions (
s3:ListBucket
,s3:GetObject
) for the bucket or object as per IAM policies.