Skip to content

Azure Data Lake Storage crawler

The Azure Data Lake Storage crawler fetchs assets from Azure Data Lake Storage and publish them to Atlan for discovery. The assets crawled are:

  • Account
  • Container
  • Objects

Configuration

Credentials

  • Azure Client ID: unique application (client) ID assigned to your app by Azure AD when the app was registered.
  • Azure Client Secret: client secret.
  • Azure Tenant ID: unique identifier of the Azure Active Directory instance.
  • Storage Account Name List: list of the Azure storage account names (comma separated).
  • Network configuration: whether to use the public endpoint or a private link to connect to the storage account.

    • Private Link(s): list of private links (comma separated). In case of multiple private links the order of them has to follow the corresponding order of account names used in the Storage Account Name List input parameter.

    Warning

    Please contact the Atlan support team to set up the private link and private DNS if required.

Permissions

The following permissions have to be granted to the Service Principal in order for the package to work correctly:

Metadata

  • Container prefix: publish to Atlan only the containers that start with the 'container prefix' specified in this parameter. Leave as empty if you need all containers.
  • Object prefix: publish to Atlan only the objects that start with the 'object prefix' specified in this parameter. Leave as empty if you need all objects.
  • Object regex: enter regex to match the object name. Leave as empty if you need all objects.
  • Object create date (after): select the ADLS object create date from where you want to start the ingestion. All ADLS objects with a create date before than the one specified won't be ingested into Atlan. (Leave as default to not apply any date filter).
  • Object create date (before): select the ADLS object create date to where you want to start the ingestion. All ADLS objects with a create date after than the one specified won't be ingested into Atlan. (Leave as default to not apply any date filter).
  • Object update date (after): select the ADLS object update date from where you want to start the ingestion. All ADLS objects with an update date before than the one specified won't be ingested into Atlan. (Leave as default to not apply any date filter).
  • Object update date (before): select the ADLS object update date to where you want to start the ingestion. All ADLS objects with an update date after than the one specified won't be ingested into Atlan. (Leave as default to not apply any date filter).
  • Include folders: Whether to publish folders (Yes) as assets.

Warning

All filters (except Container prefix and Object prefix) acts after the objects extraction from the storage account, therefore the service principal has to have access to the objects even if filtered out during the publishing to Atlan.

Configurations

  • Connection: name of the connection that will be created in Atlan.

Warning

The connection name must be unique across all Azure Data Lake storage connections.

What it does

The package performs the following steps:

  • Create a connection in Atlan. If the connection already exists the step is skipped.
  • Fetch the list of containers part of the storage account.
  • For each container fetch the list of objects.
  • Publish containers and objects into Atlan.

Warning

Containers and Objects deleted/archived in Azure Data Lake Storage are automatically archived in Atlan as well.