Skip to content

Duplicate detector

Source code

The duplicate detector package identifies assets (tables, views, etc) that are potential duplicates by comparing the set of columns within them in a case-insensitive and order-ambivalent way.

Configuration

  • Qualified name prefix: starting value for a qualifiedName that will determine which assets to check for duplicates.
  • Asset types: comma-separated list of asset types in scope for duplicate detection, default: Table,View,MaterialisedView.
  • Batch size: maximum number of results to process at a time (per API request), default: 50.

What it does

Glossary

A single glossary named Duplicate assets will be created (or updated, if it already exists).

Terms

One term will be created (or updated, if it already exists) to link together all potentially duplicate assets. The name of the term will be Dup. (00000000), where 00000000 is the unique hash of the names of the columns across all of the assets.

Relationships

A relationship will be created between each asset that has potential duplicates and the term (above).

How it works

Runs a search for all assets of the types specified. Then, for each result:

  • Retrieve the set of columns in that asset
  • Normalize the names of each column (make them case-insensitive, remove _'s)
  • Ignore ordering of the columns within the asset
  • Calculate a unique numeric hash for the set of normalized, unordered columns

Then compare these hashes between assets, to look for identical hashes (indicating an identical set of normalized, unordered columns).