Stream of Consciousness

Stream of Consciousness. Terkadang keinginan harus ditertibkan, cita-cita harus diperiksa kembali, dan kekhawatiran harus ditelan mentah-mentah.

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转

High Concurrency and Low Latency on Data Lakes with Lakehouse

As far back as I can remember, users were always looking to get what they are looking for quickly. While patience is a virtue, seldom have the patience to wait for a cluster to boot. Data platform and cloud vendors got crafty by creating a new era of Serverless products where they house and manage warmed compute resources in their account, and provide customers slices of resources on demand. This alleviates customers having to manage and maintain compute resources, and provides instant access to compute as they need it, even with lucrative per-second billing. While researching cloud data platforms over the past decade, there have been common requirements from businesses that have taken priority. Being able to support a large number of users at the same time is crucial for medium to large enterprises, where they could have hundreds to thousands of analysts and users reaching out for data from the system. This type of capability is called concurrency. If the system in question can serve a large number of concurrent users at the same time, then it is referred to as High Concurrency. Responding to those requests is another aspect called latency, the time it takes to round trip a request from the client to the server. Being able to respond to a user's request quickly is referred to as Low Latency.

One of the largest pain points of on-premises data platforms is that you have to plan ahead for scaling concurrency, and increasing capacity often requires months of planning to acquire, build, provision, and integrate more hardware. The cloud came along and solved some of these pain points by offering infrastructure services that allow organizations to scale out faster, without having to plan in advance for increased demand. Alleviated from having to invest massive amounts of capital and planning, business leaders are now able to use pay-as-you-go consumption-based pricing in the cloud that allows them the flexibility to experiment with new ideas and scale them quickly when the return-on-investment is appealing. One aspect that often goes unnoticed at first, is that while the infrastructure costs have gotten more flexible in the cloud, it doesn't seem that the premium for data warehouse compute has seen the same amount of commoditization. Most vendors still charge a heavy premium to enable highly concurrent and low latency workloads, locking your data into their proprietary file formats that can only be read with their software.

Modern data warehouses have enabled this type of functionality over the years, lowering latency by leveraging statistics and indexes and increasing concurrency by scaling out infrastructure. Most people know Apache Spark as the defacto standard for processing big data. It's great at performing well on long-running interactive jobs very efficiently. Something that Open Source Apache Spark was never great at, was serving many users small data very fast. While this is still true today, the team at Databricks has cracked the code on how to deliver this type of capability using open technologies. Most people never thought this could happen, and most cloud data warehouse vendors will tell you all the reasons why it can’t be done. Data Lakes in the clouds are great, they provide vast amounts of storage for all data types, at amazingly low rates. Unfortunately while cheap and abundant, data access to this layer is very slow, missing the mark on low latency. While they can parallelize connections, given the right technology, they can be made to be highly concurrent. But there is a slew of issues that come out of the woodwork that cause headaches for data architects and engineers. Most of these issues were solved in the data warehousing era, like ACID transactions (Atomicity, Consistency, Isolation, and Durability). Essentially, ACID transactions guarantee that when you issue a command you can rely on the system to ensure that your data never falls into an unpredictable state because of an operation that only partially finishes. Modern Data Lake transaction systems like Delta Lake solve for this by providing an API layer that handles the transactions on a data lake. With Delta Lake, you can create tables on a data lake with SQL or Python, and it handles all the file operations for you. Delta Lake is the foundation of a Lakehouse, as it provides the table-based ACID properties to build data warehouses with.

With Databricks SQL Endpoints, businesses can now take advantage of cheap commodity compute, and storage in the cloud, to deliver data to massive amounts of users, really fast. There are various flavors of Endpoints, there is Classic which has two variants, Cost Optimized and Reliability Optimized, and Serverless. Cost Optimized uses Spot instances for the workers, and Reliability Optimized uses On-Demand. Basically, Spot instances offer up to 80% cost savings by using unused capacity in the cloud. Spot instances can be preempted if a user comes along willing to pay full price, but Apache Spark was built to be resilient and can replace a node at any time. Unfortunately, Classic Endpoints can take anywhere from 3 to 10 minutes to boot, depending on the cloud and capacity. Serverless on the other hand can boot instantly, but the compute is in Databricks account, compared to compute in your account with Classic. There are various pros and cons to where the compute is, you can negotiate discounts with the cloud directly with your compute, but you have to manage it. Serverless provides instant access to compute without having to manage it, along with not paying for idle time. Most people I talk to would rather not have to manage to compute resources, and would gladly take advantage of the benefits of Serverless. We think Serverless can provide amazing value by reacting faster to user demand, booting, and scaling near instantaneously.

Well, the results are interesting. It seems that when the dbstress tool does not get a response on a query during a certain period of timeout, it fails the scenario. Since Databricks SQL Classic Endpoints boot in the customer's account, which has certain cloud discount and security architecture benefits, it takes 3 to 10 minutes to boot depending on the weather in your cloud at the time, they didn’t respond within the timeout, and the scenarios failed. The Serverless Endpoints however were able to boot within seconds and started serving queries right away.

When the SQL Endpoint got the initial rush of commands, it queues up the queries, so you can see a wall of query durations when the experiment first executes. As the Endpoint warms up additional resources to handle the concurrency demand, query durations decrease during the middle of the run and start averaging out toward the end of the run. Query response times average for the entire run was about 15 seconds, with 33 queries responding in 1 second or less.

Finally, the cost to execute this experiment, which took around 7 minutes to serve about 1500 queries, cost about $22 in total. Serverless is $.70 per DBU, and the Large Endpoint scaled up to 7 clusters at its peak. If you were running this same workload on the best cloud data warehouse on the market, Snowflake, it would probably cost around $37. Why pay more for just warehousing, when you can get a data platform that has ETL orchestration and visibility with Delta Live Tables, Machine Learning and AI built right in with Auto ML capabilities, and economical Warehouse serving, all on one copy of your data in Delta Lake?

Stream of Consciousness

High Concurrency and Low Latency on Data Lakes with Lakehouse

Add a comment

Related posts:

OKC Dodgers Announce 2019 Promo Schedule

La ricetta universale verso X

Iconography Guidelines