The Origin of Snowflake
Benoît Dageville (current President of Products) and Thierry Cruanes (current CTO) are the founders of Snowflake. What was the idea when they started Snowflake? Benoît and Thierry had a pretty clear and simple vision; “Simply load and query data”. In his keynote, Benoît shared the Origin of Snowflake. He walked us through the Snowflake Secret Sauce. Some notes from my side to summarise Benoît’s talk.
Simply load and query data
August 2012 – In this period there were 3 challenges when it came to loading and querying data. The founders of Snowflake wanted to solve this with Snowflake.
The challenge of analysing Machine generated (Big) Data. The volume (scale) and variety (structure) of this data could not easily be handled by traditional RDBMS databases. Hadoop was the answer at that time. Benoît and Thierry thought it would fail. There were two important reasons for that.
Firstly, the complexity. Hadoop is built by engineers for engineers. Try to easily setup an Hadoop-environment with ‘all’ required functionality yourself. It quickly becomes a spaghetti of all different tools and projects. That’s probably one of the reasons why the Big Data Appliances, like Cloudera, Hortonworks and MapR became so popular. They handle a big part of the complexity.
Elasticity and Compute on demand. Simplicity as in Software-as-a-Service (SaaS). Take all the complexity away from the user. The reverse of Hadoop. Everybody should be able to use data, not only the people skilled enough to find their way through the Command Line.
All the good things from the RDBMS. And… SQL is the language to query data. According to Benoît SQL is not dead.
“Built the best Datawarehouse for the Cloud”
So the challenge was there; “Built the best Datawarehouse for the Cloud”. Therefore they started White-boarding. In other words, re-inventing an architecture based on Micro-partitioning (sub-second response time) as-a-Service (simplicity).
Traditional architecture are based on a Single Cluster. This often is a bottleneck because it’s not elastic and cannot scale. Therefore the Snowflake architecture should be Multi-Cluster. This means, as many compute clusters should be able to independently query the same data. There should be no compete for resources. Data should be shared and centralised data. Above all, no data silo’s.
Snowflake’s Architecture consists of three different components on top of a Cloud Agnostic Layer for AWS, Azure and recently announced; GCP.
Centralised Elastic Storage – No limit on storage, logically separated by databases
Multi-cluster Compute – Different virtual warehouses per workload. Automatically or programmatically suspend and resume
Scale out Services – The Brain of the System
There are different use cases to leverage the architecture of Snowflake. Various Warehouses to run the Workloads depending on (environments (DEV/TEST/PROD), consumer types or business units (for chargebacks))
As many Warehouses as you need and scale-up or down depending on the demand. Various T-shirt sizes (varies from X-Small to 4XL (may depend on SLA)). Scale for (individual query) performance. Speed doubles between t-shirt sizes. Start conservatively until you truly understand your workload. Multi clusters for more concurrent users (peak time reporting – Scale for concurrency). Compute on demand.
Other interesting features like:
- Instantly Clone Database (eg. Q&A and DEV)
- Secure Data Share – Create, as a provider, a secure view on your data for consumers to select data from. Therefore Snowflakes announcement about the Data Exchange.
- Severless capabilities like SnowPipe
- Micro-Partitioning – Automatically created at runtime (background re-clustering service). Small, Columnar, Structured/semi-Structured, Partition Map Index ( find that partition relevant to your query). Challenges; Blob Storage is immutable and performance
The Snowflake Approach
Try to eliminate the problems and solve them at runtime. At that time you know which data is in use for that workload. Normally you would have to setup the following:
- infrastructure (setup, upgrades, patching, availability, backups, etc.)
- physical design (index, partitioning, etc)
- query tuning (statistics, workload management, etc.).
Snowflake just wants you to; “Simply load and query data”. As-a-Service.
Now we now the Origin of Snowflake, the question is; What’s next?
- Global – One single system connected for the whole world. Which means, one virtual datacenter, cross cloud and cross area. No boundries.
- All the data – Including non-structured data and all types of access
- Real-Time – The time from when the data is born to the time you can see the data in Snowflake. End-to-End latency. As little as possible.
- Beyond SQL – To overcome the challenges SQL cannot solve. Having an extensible framework.
- Data Services – Data Sharing + add value. Data enrichment and potentially share back.
Some interesting times lie ahed for us.
Thanks for reading.