Skip to content

Google Cloud Storage crawler

The Google Cloud Storage crawler fetchs assets from Google Cloud Storage and publish them to Atlan for discovery. The assets crawled are:

  • Buckets
  • Objects

Configuration

Credentials

  • Project id: Google Cloud project id that contains the buckets.
  • Service Account JSON key: follow this article to create a service account JSON key. The following permissions have to be granted to the role assinged to the Service Account: storage.buckets.list and storage.objects.list.

Metadata

  • Bucket prefix: publish to Atlan only the buckets that start with the 'bucket prefix' specified in this parameter. Leave as empty if you need all buckets.
  • Object prefix: publish to Atlan only the objects that start with the 'object prefix' specified in this parameter. Leave as empty if you need all objects.
  • Object delimiter: this can be used to list all blobs in a "folder", e.g. "public/". The delimiter argument can be used to restrict the results to only the "files" in the given "folder". Without the delimiter, the entire tree under the prefix is returned. For example, given these blobs:

    • a/1.txt
    • a/b/2.txt

    If prefix ='a/', without a delimiter, the following blobs will be published to Atlan:

    • a/1.txt
    • a/b/2.txt

    However, if prefix='a/' and delimiter='/', only the file directly under 'a/' will be published to Atlan:

    • a/1.txt
  • Bucket exclusion list: list of buckets (comma separated) to be excluded.

  • Build abstraction layer: whether to build abstraction layer on top of files, default: No
  • Publish as-is patterns: list of comma separated patterns to be published as-is (without abstraction layer). Applicable only if Build abstraction layer = Yes
  • Regex to match characters to replace: regular expression to match characters to replace. It acts on the file full name (without bucket prefix).
  • Regex with replacement characters: regular expression with replacement characters. It acts on the file full name (without bucket prefix).

Configurations

  • Connection: name of the connection that will be created in Atlan.

Warning

The connection name must be unique across all Google Cloud Storage connections.

What it does

The package performs the following steps:

  • Create a connection in Atlan. If the connection already exists the step is skipped.
  • Fetch the list of buckets part of the Google Cloud projects (according to the prefix defined in the config section).
  • For each bucket fetch the list of objects (according to the prefix defined in the config section).
  • Compute abstraction (if Build abstraction layer = Yes).
  • Publish buckets and objects into Atlan.

Warning

Buckets and Objects deleted/archived in Google Cloud Storage are automatically archived in Atlan as well.