Introduction into Snowflake

Last month I attended the first Snowflake Summit in San Francisco. I will try to share things I learned at the Summit and give a short introduction into Snowflake. But first; “What is Snowflake?”. One of the keynotes at the Snowflake Summit was; “The origin Story of Snowflake” by Benoît Dageville (current President of Products). I wrote a blog; “The origin Story of Snowflake” about that keynote earlier. Don’t confuse the name Snowflake with the database modelling technique. Snowflake explains their name as follows; Behind the Snowflake Name.

“Simply load and query data”

The founders of Snowflake wanted to solve three challenges with Snowflake. Firstly, the challenge of analysing Machine generated (Big) Data with sometimes an enormous volume (scale) and great variety (structure). Secondly, elasticity and compute on demand. Simplicity as in Software-as-a-Service (SaaS). Thirdly, all the good things from the RDBMS (e.g. SQL).

Architecture

Snowflake’s architecture is built on three important pillars on top of a Cloud Agnostic Layer for AWS, Azure and later this year; GCP. Snowflake will make use of the specific capabilities of the chosen Cloud. Nevertheless it’s possible to switch clouds. You won’t be locked into a Cloud provider.

Snowflake - Multi-Cluster, Shared Data Architecture

I will try to compare these components with a PC.

  • Storage     – The Hard drive. No limit on storage, logically separated by databases.
  • Compute  – The CPU, the processing power of the machine. Different virtual warehouses per workload. Automatically or programmatically suspend and resume.
  • Service      – This is basically the software which tells the computer what to do. The Service Layer is the Brain of the System.

Snowflake’s unique architecture is based on physical separation of storage, compute and services. While these three layers ar physically separated, the are designed to work perfectly together logically; “Multi-Cluster, Shared Data”.

Storage

Data is centrally stored and optimized. No data silos. Storage is in a Snowflake proprietary columnar format in the clouds’ blob storage (AWS S3, Azure Blob Storage or Google Cloud Storage. Snowflake manages (data replication, scaling and availability) how and where the data is stored. There is basically no limit on storage. On top of that, Snowflake is capable to store and query all kinds of data, both structured and semi-structured (JSON, AVRO).

The costs of the storage depends on the chosen Cloud. This means that the Cloud provider’s storage costs will be charged 1-on-1 (pass-through). Because of Snowflake’s compression  techniques you will save on storage costs.

Compute

Snowflake Compute can be seen as various independent MPP-clusters. In Snowflake terms; Virtual Warehouses. Depending on the workload you can choose from various T-shirt sizes; from XS-Small to 4X-Large. Virtual Warehouse work independently from each other. There is no compete for resources. In other words,  performance of each workload (e.g. loading, querying or machine learning)  is not affected by others.

Compute is charged on a per-second basis with a minimum of 60 seconds. To keep cost under control, Snowflake has the capability to automatically or programmatically suspend and resume a Virtual Warehouse. Next to that you can scale-up or down depending on the demand.

Service

Snowflake is a true Software as a Service offering. In Dutch, DWaaS may sound silly, but Snowflake is DataWarehouse as a Service. The “Brain of the System” takes care of a series of Cloud Services like; Authentication, Infrastructure management, Metadata management, Query parsing & Optimization and Access control. No need for manual interference. Snowflake takes care of it all.

Why love Snowflake?

Apart from the Snowflake capabilities mentioned above, there are some additional things which makes you love Snowflake.

ANSI SQL support. SQL is the language to query data. Snowflake is capable of querying both structured as semi-structured data. The latter makes it possible to query JSON directly via the VARIANT-datatype.

Data Sharing. The Snowflake Secure Data Sharehouse. No data movement, live access and ready to use. The recently announced Data Exchange is based on the Data Sharing technology.

Time Travel – Revert back your databases, tables, and schemas to any point in time in the past; e.g. UNDROP-feature

Zero-copy cloning – Multiple copies of the data (e.g. Test, Q&A), without extra storage costs.

Last but nat least, Snowflake is Secure by Design. All data is automatically encrypted. The various security options depend on the chosen Snowflake edition; Standard, Premier, Enterprise, Enterprise for Sensitive Data(ESD) or Virtual Private Snowflake (VPS).

What’s next?

This was only a short introduction into Snowflake. Want to have a look yourself? Navigate over here and get started. Just start by getting yourself a 30-day, $400 worth trial.

If you want to discuss the architecture of Snowflake in more detail, please let me know. I am happy to discuss things in more detail.

Thanks for reading.

Cheers,

Daan Bakboord

DaAnalytics

DaAnalytics

Hi, ik ben Daan Bakboord. Ik ben een ondernemende Data & Analytics consultant. Mijn passie is het verzamelen, verwerken, opslaan en presenteren van data. Ik help organisaties deze interne en externe data zodanig in te zetten, dat ze in actie kunnen komen. Door het inzetten van betrouwbare data kunnen ondernemers hun ‘gut-feeling’ onderbouwen met feiten.

Leave a Reply

Do NOT follow this link or you will be banned from the site!