Skip to content

Google Cloud Storage crawler

The Google Cloud Storage crawler fetchs assets from Google Cloud Storage and publish them to Atlan for discovery. The assets crawled are:

  • Buckets
  • Objects

Configuration

Credentials

  • Project id: Google Cloud project id that contains the buckets.
  • Service Account JSON key: follow this article to create a service account JSON key. The following permissions have to be granted to the role assinged to the Service Account: storage.buckets.list and storage.objects.list.

Metadata

  • Bucket prefix: publish to Atlan only the buckets that start with the 'bucket prefix' specified in this parameter. Leave as empty if you need all buckets.
  • Object prefix: publish to Atlan only the objects that start with the 'object prefix' specified in this parameter. Leave as empty if you need all objects.
  • Object delimiter (applicable only if the inventory report is NOT selected): this can be used to list all blobs in a "folder", e.g. "public/". The delimiter argument can be used to restrict the results to only the "files" in the given "folder". Without the delimiter, the entire tree under the prefix is returned. For example, given these blobs:

    • a/1.txt
    • a/b/2.txt

    If prefix ='a/', without a delimiter, the following blobs will be published to Atlan:

    • a/1.txt
    • a/b/2.txt

    However, if prefix='a/' and delimiter='/', only the file directly under 'a/' will be published to Atlan:

    • a/1.txt
  • Bucket exclusion list: list of buckets (comma separated) to be excluded.

  • Use inventory report:

    • Inventory bucket name: bucket where the inventory is stored.
    • Inventory prefix: prefix within the inventory bucket where the inventory is located.
    • Inventory file format: file format used to generate the inventory report, CSV or Parquet.

    Warning

    The following permissions have to be granted to the role assigned to the Service Account: storage.buckets.list, storage.objects.list and roles/storage.objectViewer

  • Build abstraction layer: whether to build abstraction layer on top of files, default: No

  • Publish as-is patterns: list of comma separated patterns to be published as-is (without abstraction layer). Applicable only if Build abstraction layer = Yes
  • Regex to match characters to replace: regular expression to match characters to replace. It acts on the file full name (without bucket prefix).
  • Regex with replacement characters: regular expression with replacement characters. It acts on the file full name (without bucket prefix).

Assets

  • Input handling: how to handle assets in the CSV file that do not exist in Atlan

    Create a full-fledged asset that can be discovered and maintained like other assets in Atlan.

    Create a "partial" asset. These are only shown in lineage, and cannot be discovered through search. These are useful when you want to represent a placeholder for an asset that you lack full context about, but also do not want to ignore completely.

    Only update assets that already exist in Atlan, and do not create any asset, of any kind.

    Does not apply to related READMEs and links

    READMEs and links in Atlan are technically separate assets — but these will still be created, even in Update only mode.

  • Delta handling: Whether to treat the input file as an initial load, full replacement (deleting any existing assets not in the file) or only incremental (no deletion of existing assets).

  • Remove attributes: How to delete any assets not found in the latest file.
  • Reload which assets: Which assets to reload from the latest input CSV file. Changed assets only will calculate which assets have changed between the files and only attempt to reload those changes.

Configurations

  • Connection: name of the connection that will be created in Atlan.

Warning

The connection name must be unique across all Google Cloud Storage connections.

What it does

The package performs the following steps:

  • Create a connection in Atlan. If the connection already exists the step is skipped.
  • Fetch the list of buckets part of the Google Cloud projects (according to the prefix defined in the config section).
  • For each bucket fetch the list of objects (according to the prefix defined in the config section).
  • Compute abstraction and add the files name as object description (if Build abstraction layer = Yes).
  • Publish buckets and objects into Atlan.

Warning

Buckets and Objects deleted/archived in Google Cloud Storage are automatically archived in Atlan as well.