Daanalytics

Five days inside Snowflake Data Governance – Know your Data

In the previous blog I mentioned that Snowflake’s Data Governance Framework focuses on three key areas:

  • Know your Data
  • Protect your data
  • Connect your Ecosystem
Five days inside Snowflake Data Governance - Know your Data

Know your Data

In this blog we will focus on the key area; ‘Know your Data’. This aspect of Snowflake’s Data Governance Framework is all about understanding your data and its quality, lineage, and usage. It answers questions about the What & the Where (Classification & Object Tagging) and the Who (Account Usage and Access History) of your data.

For this blog series we will use the Tasty Bytes Quistart dataset.

Classification

Snowflake has built-in Automatic Data Classification, which makes it easier for customers to know and then protect their data.

There are several key use cases are for Data Classification

  • Data Stewards and Privacy Officers that need to know what sensitive personal data their organization has,
  • Data Security Admins that need to make sure the right protection is in place for this data (including anonymization which classification helps streamline),
  • Compliance Officers need to produce audit reports of who has accessed sensitive personal data.

Automatic Data Classification analyzes columns in structured data for personal information and provides customers with a set of pre-defined Snowflake System Tags to automatically categorize this data according to the supported types.

These System Tags automate the process of classifying data.

The System Tags are:

  • semantic_category – tells you what the cells of the column contain
    • some possible values are: name, gender, age, us_postal_code, phone
  • privacy_category – tells you what kind of personal information it is
    • some possible values are: identifier, quasi_identifier, insensitive

Now that the System Tags have been applied an admin can define policies based on their values.  

In this way the admin can control who gets access to what columns in which tables thereby protecting your data and they can also use the System Tags to audit who has accessed columns that contain personal data.

Classification in Snowflake is a multi-step process where Snowflake-defined tags kan be applied to table columns. This is a three-step process:

  1. Analyze
  2. Review
  3. Apply

Analyze

First step is to run the EXTRACT_SEMANTIC_CATEGORIES-function in Snowflake, which takes a specific table as input.

This statement returns a VARIANT in JSON format.

Review

This statement returns a VARIANT in JSON format like part of the output below.

By using the FLATTEN table function readability can be enhanced:

If you review the output of the above query, you will see that e.g. the column; ‘BIRTHDAY_DATE’ has the following output:

  • Secondary Category – ‘DATE_OF_BIRTH’
  • Privacy Category – ‘QUASI_IDENTIFIER’
  • Probability – 1
  • Alternates – ‘[]’

Check the documentation for further details about the above output and their relationship(s).

Apply

In the last Apply-step, the Calssification System tags can be assigned to the columns. This can both be done manual as well as automated;

The automated proces is a call to the following procedure:

If the proces runs successfully, the procedure ends with a message like; ‘Applied tag semantic_category to <n> columns. Applied tag privacy_category to <n> columns.’

Object Tagging

One of the Data Governance operations is Object Tagging. This operation enables us to categorize Data Objects and / or describe data stored in these objects. As we could see while reviewing the example above, there are several PII-columns. These columns are eligible for object tagging.

  • FIRST_NAME
  • LAST_NAME
  • PHONE_NUMBER
  • E_MAIL

Create tags

“A tag is a schema-level object that can be assigned to another Snowflake object. A tag can be assigned an arbitrary string value upon assigning the tag to a Snowflake object. Snowflake stores the tag and its string value as a key-value pair. The tag must be unique for your schema, and the tag value is always a string.”

We will create three different tags:

  • pii_name_tag (FIRST_NAME, LAST_NAME)
  • pii_phone_number_tag (PHONE_NUMBER)
  • pii_email_tag (E_MAIL)

Creating tags can e.g. be done by the following statement:

You can verify whether everything is been done correctly via the below statements where you can verify the DDL and see the allowed values for the specified tag.

Apply tags

Tags need to be assigned to (existing or new) Snowflake objects to be able to use them.

Track the tags

To validate the tags we have created and applied we can run e.g. the following query to track the tags.

This one way to identify the tag assignments. There are more. Follow the documentation to find out.

Object Dependencies

Identifying how Snowflake’s Data Objects are related can provide insight into various Data Tracking use cases, like Identifying dependencies and downstream impact. This can be achieved by querying the Account Usage view OBJECT_DEPENDENCIES. Check out the following query to see how the CUSTOMER_LOYALTY_V-view is related to the CUSTOMER_LOYALTY-table.

For more information, check out the documentation.

Who – ACCOUNT_USAGE & ACCESS_HISTORY

Up until now we covered the ‘What’ and the ‘Where’ about knowing your data. Understanding ‘Who’ has access to your data and how they are using it is a critical aspect of Data Governance.

Snowflake provides a set of built-in views, including the views in the ACCOUNT_USAGE-schema and the ACCESS_HISTORY-view, that enable you to monitor who is accessing your data and how they are using it.

The ACCOUNT_USAGE-schema provides information about user activity in your Snowflake account, such as who is logging in, when they are logging in, and which objects they are accessing. This schema also provides information about queries executed by users, including the SQL statement, execution time, and resource usage.

The ACCESS_HISTORY-view in Snowflake provides a detailed history of all access to databases, schemas, tables, views, and other database objects within a Snowflake account. This view allows administrators to monitor and audit user access to sensitive data, and identify potential security risks or violations.

By analyzing the information provided by these views, you can gain insights into who is accessing your data, how they are using it, and whether they are complying with Data Governance policies and regulations. This information can help you identify potential risks and vulnerabilities, and take appropriate actions to mitigate them.

Till next time.

Director Data & AI at Pong and Snowflake Data Superhero. Online better known as; DaAnalytics.

Daan Bakboord

Bekijk ook:

Snowflake’s Data Classification in Snowsight

Snowflake Data Governance directly from Snowsight

Last year I blogged about how to use Snowflake functionality to; “Know your Data”. Especially in these times where Generative AI becomes more and more mainstream, it’s essential to know what data is input for the LLM’s. Now Snowflake has made this a few clicks easier, offering classifying functionalities directly from Snowsight.

Lees verder »