Skip to content

Databricks Cross Workspace Miner

The databricks cross workspace miner custom package calculates usage and popularity metrics across different Databricks Workspaces. This package is an extension to the Databricks Miner package available in marketplace packages.

Pre-requisites

Grant Permissions

  1. Grant access to system tables for popularity calculation:

    This custom package supports extracting usage and popularity for your Databricks assets using system tables. You must have a Unity Catalog-enabled workspace to use system tables.

  2. Enable a system schema:

  3. You will need your account admin to enable the system.query schema using the SystemSchemas API. This will allow Atlan to extract query history using system tables.
  4. To enable a system schema, refer to Databricks documentation:
  5. Replace {schema_name} with access.
  6. To ensure that system schemas is enabled for each schema, follow the steps in Databricks documentation. If enabled for any given schema, the state will be EnableCompleted.

  7. Grant permissions:

Atlan supports extracting Databricks usage and popularity using system tables for three authentication methods including Personal access token authentication, AWS service principal authentication and Azure service principal authentication. You will need to grant the following permissions:

  1. CAN_USE on a SQL warehouse
  2. USE SCHEMA on system.query
  3. SELECT on system.query.history
  1. Locate warehouse ID:

To extract usage and popularity using system tables, you will also need the warehouse ID of your SQL warehouse. To locate the warehouse ID:

  1. Log in to your Databricks workspace as a workspace admin.
  2. From the left menu of your workspace, click SQL Warehouses.
  3. On the Compute page, select the warehouse you want to use.
  4. From the Overview tab of your warehouse page, next to the Name of your warehouse, copy the value for your SQL warehouse ID. For example, example-warehouse (ID: 123ab4c5def67890), copy the value 123ab4c5def67890 and store it in a secure location.

  5. Disable Popularity on the Databricks Miner:

Disable the usage popularity calculation on the Databricks Miner configuration by setting Fetch Query History and Calculate Popularity to No.

Configuration

  1. Connection : For Connection, select the connection to extract. (To select a connection, the crawler must have already run.)

  2. SQL Warehouse ID : Enter the Warehouse ID located in steps above.

  3. Popularity Window: For Popularity Window (days), 30 days is the maximum limit. You can set a shorter popularity window of less than 30 days.

  4. Start time: For Start time, choose the earliest date from which to mine query history.

Run the Databricks Cross Workspace Miner Custom package

To run the Databricks Cross Workspace Miner Custom package, after completing the steps above: 1. To check for any permissions or other configuration issues before running the package, click Preflight checks. 2. To run the package once immediately, at the bottom of the screen, click the Run button. 3. To schedule the package to run hourly, daily, weekly, or monthly, at the bottom of the screen, click the Schedule Run button.

Order of Operation

This custom package should be run after the Databricks Miner has completed its run.

What it does

The databricks cross workspace miner custom package connects to your database, queries the query.history system table for the defined period, and calculates popularity across different workspaces.