Why You Should Catalog Everything

December 21, 2021

Data governance is intended to help you answer basic questions about your data – what data do I have? How much is there? Who uses it? Which assets are sensitive? How can it be used? Who makes these decisions?

One of the biggest tools used to clarify and guide finding these answers is a data catalog. But a data catalog is both more and less than that…

It’s more, because in addition to answering these questions, a catalog will provide insights about questions you might not have even considered, such as usage patterns, data quality, hidden or “dark” data, and asset valuation and monetization.

And it’s less, because even if you don’t have the bandwidth or organizational will to implement an enterprise-wide data governance program, implementing a data catalog can still be a huge asset for your data analysts and data scientists to accelerate their work and improve the reliability of their results.

Data Catalogs FTW

First let’s talk a little bit about what a data catalog is and does. In the simplest terms, it’s a centralized inventory of data where your data professionals can search for data, understand its uses and meaning, learn where it is located, find information that helps them to evaluate fitness for various purposes, and potentially gain access to the data. It is a repository of metadata where the existing tribal knowledge of your data community is stored, enriched, curated, and made accessible to your entire organization.

This does several beneficial things for your operation. It reduces the amount of time that data analysts and scientists spend looking for information about the data they need – all the business rules, metadata, filters, joins, queries, and existing reports are placed at their fingertips in the catalog. Next, it allows your analysts to discover sources of data they didn’t know about, instead of only relying on their familiar datasets or spending hours combing through documentation and asking around. And, it allows users of data to make informed decisions for which data to use and how to use it, speeding data prep operations.

How a catalog does these things varies between vendors, but most of them use connectors to ingest content from data sources, and a robust search engine that makes finding the info you need a simple, google-like search that returns results ranked by popularity and relevance. If you’re a Snowflake customer, and you are considering a data catalog, pay close attention to the strong coupling between Snowflake and Alation. This partnership has created a powerful synergy based on bringing together contextual business knowledge with Snowflake data sets into a single searchable repository.

Catalog Everything

Once you’ve migrated your data to Snowflake and set up your data warehouse, data mesh, or data lake exactly as you want it, you may be tempted to only point your catalog at the result of all your modeling, analysis, and organizational efforts, and ignore the rest. The temptation to only catalog the “good stuff” can be strong. But that’s not how catalogs work best. In fact, only bringing a catalog in after the hard work of migration and modeling has been done misses out on some of the most powerful capabilities of a modern data catalog: You can and should be leveraging the catalog as a tool for dealing with your data sprawl before you migrate! And it’s not difficult or time consuming either – it’s as simple as entering credentials for your sources into the catalog and letting it do the ingest automatically in seconds.

When you catalog your entire data landscape, the visibility you gain enables you to easily identify and migrate only the good stuff to the cloud in the first place! This gives your modelers and engineers a huge advantage as they embark on their efforts to clean up and refine your data into the streamlined and targeted solution you need.

Another benefit of cataloging the data sprawl you’ve inevitably accumulated over the years is the ability to identify unused data sets and products. Once you’ve ingested the metadata from all of your sprawling sources, analysts will be able to identify places where temporary, not-so-temporary, and even secret structures have been set up to do potentially duplicative work. We all know how these happen – you need something fast that’s slightly different from the existing set, so you create a quick staging table to load your custom sets and then…there it sits. Maybe you’re still using it, but maybe you’re not. Maybe you’re not using it but somebody else found it and built a process on top of it and you don’t know until you catalog it and see the lineage automatically diagrammed out for you in Alation (or any other catalog tool you may be using)! Now you’re able to do impact analysis on removing old, stale, or redundant data structures. When it’s time to modernize your data stack, you need to know what data you have, where it sits, and what the impact will be if you reduce, remove, or reuse. Cataloging all of your data sources will give you insights you need to make a clean migration, whether or not you are ready to deploy an enterprise-wide data governance program.