Data Governance and Snowflake Dynamic Data Masking

November 9, 2021

Defining Data Governance

What’s data governance? Well, much like the blind men and the elephant, if you ask 3 different people, you will get three different answers. One person might say, “Data cataloging.” Another will say, “Securing sensitive data.” The third says, “Data lineage mapping.” A fourth person (an uninvolved bystander with an opinion, because of course there is one) chimes in, “Formalizing data stewardship and authority over data!”

And like with the elephant, it’s all of these things (and more!). But (I hear you say) each one of these things is an entire discipline on it’s own! How do you take something so big and, let’s face it, nebulously overarching, and execute it? And this is a valid concern – data governance can seem opaque and overwhelming as a concept. It doesn’t help that there are factual yet completely unhelpful definitions to be found online. Additionally, most data governance advice is aimed at large corporations who have the resources to throw entire departments into creating a framework and organizational structure squarely focused on data. That’s great, if you have thousands of employees and the cash to spend. But what if you’re not a big-time corp, but you still want or need to ensure that your data is fully governed?

My favorite way to attack this elephant is one step at a time (or one bite at a time, if you’re into eating elephants (boy I hope you’re not)). Each of the facets or activities of data governance can be broken down into relatively simple, straightforward tasks. And most of them involve just documenting, or writing down, information about processes, data, and requirements.

In the data governance world, there are some very basic questions that you are trying to answer with your DG program – what data do I have? Where is my data located? Who (and/or what systems) are using it and for what purpose? Does that use meet all regulatory and business requirements? Is the quality acceptable for the use? If you can answer all of these questions, in detail, you have data governance.

Snowflake’s Data Governance Features

Now, if you’re a Snowflake customer, you probably know about the data governance features that they provide. As a data governance professional, this is pretty cool stuff! These tools simplify implementing data governance concepts on a technical level. Specifically, you can use these features to answer the questions above about where my data is, and who is allowed to use it for what purpose. But, these features rely on you already having a framework of data governance policies and standards to work within. What if you don’t? Or what if you do but you are new to Snowflake and need a starting point for working with them?

That’s when we start breaking it down into steps, and take it one at a time.

Snowflake has four main features they promote – column level masking, row access policies, user access auditing, and sensitive data tagging.

Let’s start with column level masking today – I’ll cover the others in future posts. The first thing you need to do is decide which type of masking you need: dynamic data masking, or external tokenization. It’s possible you need both – in some cases dynamic data masking, in others, tokenization. So what’s the difference?

Dynamic data masking means you’re loading the data as-is into Snowflake and masking it at query time, ie, when the user tries to view it. This means that you can selectively mask who can see what. It’s role-based masking and it means you can choose what level of masking each role needs because Snowflake allows full or partial masking of data fields. For example, your sales reps need to see phone numbers for customers, but report analysts never do, so you create a role for each and assign your users to the role that can see the data they need to do their job, but not the data they don’t. This keeps both users and owners of data safe.

Tokenization, which in Snowflake is accomplished using external functions, takes a different approach. The data is replaced before being loaded into Snowflake with a token, and can be de-tokenized at query time based on role. The difference is that the real data value is never loaded into the platform, so users, even high-level account administrators, can’t accidentally (or intentionally) view the data unless they are specifically granted the privileges to do so. Also, and this is a big deal – tokens allow you to group and analyze values without ever viewing them. Since the tokens replace a unique value for a given set of characters, if the original values are the same, the tokens are the same, so users can still filter, group, and aggregate on them without ever seeing the real value.

So how do you start implementing these column-level masking features? At first it might seem like playing 3-D chess, but you just need to have three things: A set of roles, a classification system for your data, and a clear vision of what your goal is for masking.

The easy way to do this is to start with your list of roles. Snowflake recommends you create custom roles in addition to their standard pre-defined roles. Data governance provides a perfect use case for creating custom roles. It’s one of our basic DG questions: Who needs to see what?

Classification System

Ok, so you’ve defined your set of roles within Snowflake. Now pull out your classification system. It’s common to have four classifications for data, but your company may have different levels or names for them based on the laws they need to follow. You’ll mostly be concerned with data that falls into the Confidential or Restricted categories. If that data has been tagged properly, you should now have a list of columns or keys that need protection. Write down what type of masking each one needs based on the applicable laws and your expected use:

Data Asset	Masking Method
SSN	Dynamic Masking
FName/LName	Dynamic Masking
DOB	External Tokenization
Diagnosis	External Tokenization
…

Now create a matrix, laying out your roles and data, and specify where and how much masking should be applied when that role queries the data in Snowflake. It can be as simple as an Excel spreadsheet, a very basic template will suffice!

This is where you apply what you know about the type of use each role makes of the data. Is full masking required, or is partial masking (aka, showing only the last 4 digits of an SSN, for example) acceptable? Fill in your matrix with your answers.

	SSN	FName/LName	DOB	Diagnosis
Role 1	Full masking	Full masking	Full masking	Full masking
Role 2	Partial – show last 4 digits	N/A	N/A	Full masking
Role 3	Full masking	N/A	Partial – show year	Full masking
Role 4	Full masking	N/A	N/A	N/A

Once your matrix has received the necessary sign-off from the data steward or owner (assuming you yourself aren’t the steward or data owner), hand off that matrix to your Snowflake admin and you’ve got column-level data masking based on your data governance rules.

Next time: In upcoming posts we’ll walk through row access policies, user auditing, and sensitive data tagging. Each of these works hand-in-hand with the column-level masking features, but we’re eating this elephant one bite at a time!