1 yottabyte [YB] = 1,000,000,000,000,000,000,000,000 bytes
or 1000^8 bytes = 10^24 bytes = 1K zettabytes = 1M exabytes = 1B petabytes.
It is truly remarkable how BigData is at the peak of its hype cycle these days, yet how few do understand the issues at hand when dealing with really large platforms.
BigData comes in so many definitions and shapes, that one could argue for a while the 'proper' definition of BigData. I am gonna take a shortcut here and simply pretend its at the very least the following:
BigData Simplified: Too much data to fit into a single server.
That was easy. It is not about being a complete definition, more about the core of what I am getting to. BigData runs on big distributed platforms, commonly massive parallel platforms (MPP) where work is performed in parallel to scale out the ability of servers to process vast amounts of data. Many enterprises have arrived at 100s of terabytes of data - not all are storing and processing all of it, some are deep into the petabytes and beyond. So its really about parallel processing and how to make that easy.
Whether you look at commercial implementations or open source platforms like Hadoop, they all share one universal principle: parallel processing.
So commonly BigData and parallel processing go very close together. There are only very few exceptions and for now I will simply pretend they don't matter.
Many of us have grown up with the understanding that parallel processing is hard to come by, hard to write and hard to understand. This is where various parallel processing platforms like Hadoop, Teradata, Greenplum, Netezza, Vertica, ... and so many more come into play. The promise of all of them is to make parallel processing easy and simple to use. So much so that 99% of people using these platforms really have no understanding of what parallel processing means. As powerful as it is in a perfect world, as nasty it gets when things go wrong. And wrong it goes very quickly, but since most users lack the understanding of the foundational basics, it is often ignored until you simply run out of scalability.
Unless you are only doing straight roll ups of web- or other logs, real world use-cases are quick to kill the biggest and fastest parallel super computers. We are talking data and processing skew. If ignored and unmanaged it can eat away on vast amounts of processing capacity and it is more often than not delaying delivery and processing of the majority of systems out there. Think 1000 node Hadoop cluster that performs as if it only had 10 nodes or a 100 node instances that looks like 5 nodes. The larger it gets, the worse it usually is.
A 1000 node Hadoop cluster can run at the capacity of 10 nodes or less for workloads with low Parallel Efficiency [PE].
Having a 1000 node cluster means nothing if you are not extremely careful about managing it. There are other factors like networking interconnect that will impact larger clusters as well, but nothing kills a massively parallel system faster than skew or lack of Parallel Efficiency (PE). I will provide the definitions of PE and some other key metrics in a followup article in the near future. For now lets just assume that a PE of 100% is a perfectly parallel operating task. It's what everybody assumes when they run a job on one of these large systems.
Now lets take a look into the reality of a large system. Below you find a snapshop of how jobs and tasks distribute when you plot PE against the CPU consumed by a job. In order to see the entire spectrum of workloads and how they cluster, CPU is plotted on a logarithmic scale. I remember the first time I saw this S-curve years ago and how everything finally made sense. We have been looking at PE for many years (since the late 1990s) on every system with more than 2 units of parallelism. Given that workloads can run 100-1000x slower when PE drops, it is more important than almost anything else you monitor.
You might be throwing away 99% of your n node cluster for certain workloads without knowing it. Really.
The plot shows the S-curve of PE for a given time period (in this case a single day and all its jobs we executed). It turns out to be a universal curve that ALL parallel systems follow. Only the scale of the x-axis (CPU) shifts depending on technology. Parallel databases like Teradata shift to the left, while batch platforms like Hadoop shift to the right. Its simply a matter of the size of jobs that are being executed. On Hadoop it takes a bit of digging to get to the PE of individual jobs, namely exploring some of the data on the name node. On the parallel database side systems like Teradata come fully instrumented to measure PE at the individual job and SQL level.
What you are looking at is a plot of individual jobs that ran on a given day. Without much surprise you will not find a single job that achieved 100% PE. The larger the job the better the conditions to get closer to the optimum. In order to make the real story of the graph very visible we color code the plot with a simple 5-Bin Red-Yellow-Green spectrum. PE Bin 5 is dark green down to PE Bin 1 in dark red. 5 is good 1 is bad. As simple as that.
As you can see the Bins are sub curves that follow the shape of the theoretical optimum. A 50,000 CPU second job at 35% PE is dark red while a 5 CPU second job at 35% is dark green. No such thing as a meaningful average of PE for a system, hence the Bins we overlay. In the end we only care in what Bin a job ended up. You will find that we use the 5 Bin approach a lot with all BigData System KPIs. It keeps things very simple and even with that simplicity, a population of developers and users will still struggle to understand why they should care.
In addition I have overlaid the length of a job as the size of the bubble. As expect you will find longer running jobs on the right side of the CPU axis. A perfectly linear system would also show many of the red dots to be larger, but workload management helps to smooth things out a bit - not that this is necessarily a good thing as you will soon see. You can also see a few outliers throughout the chart and for that and to better understand the Bin 1 and 2 populations we are going to take another look at that chart with different data overlaid.
In order to better understand the impact of the PE Bin 1 and 2 jobs, we track another metric in our repertoire of BigData KPIs: Skew Overhead [SO]. Without spending too much time on how we calculate Skew Overhead [SO], it is basically a measure of how much larger a job or query 'felt' to the system based on its lack of Parallel Efficiency. Lets take the past chart and overlay Skew Overhead instead of the runtime of the jobs.
Overlaying SkewOverhead to the PE Curve starts telling a slightly different story - the one that gets missed all the time. In this view we start getting the real picture of what is happening to a BigData system with bad PE workload. The largest impact from individual jobs and workloads is now clearly visible in the PE Bin 1 and 2 portion of the plot.
If you look closely you will find all kinds of patterns revealing themselves and without going into much more detail in this article, there are clearly similarity patterns visible that can be leveraged in other algorithms and analysis that go beyond the scope of the PE work we are looking at here.
So how is it that these larger SO jobs dont also become the longest running jobs on the system - the prior chart? For that we need to take another step and overlay yet another metric to the PE curve: Burn Rate [BR] - The % of the system that was applied to an individual job at the time when it was executed. As you might imagine, this one takes you deep into workload management territory - oh so many topics to talk about, yet so little time... ;-)
Lets take a look at what happens with BR as an overlay to PE. The blue dots receive as little a <0.1% of the system, while orange jobs get in excess of 20-40% of the system. Size is again the runtime of the individual jobs. What you are seeing is very typical behavior of a deployed BigData system.
Large and inefficient jobs get higher priorities to makes them run faster. "I need High Priority" is a very common statement thrown out there to make things happen faster. 'Good' (aka efficient) jobs we almost starved of resources, while the ugly ones we crank through the system at record rates. Some of these workloads see as much as 40% of system resources for a single job or query, so it finishes in the time we would like to see.
Here is the dilema: Not only are these bad jobs, with bad PE, because we throw so much of a system at them, we starve everything else by that same amount. If a single bad PE job gets 40% of the entire system, it means that the rest has 40% less available to them. This is where the overal system impact becomes very real. A handful of bad PE jobs receiving 'preferred' treatment can impact an entire system far beyond the obvious.
These X/Y plots are great to show us the patterns at work but they are horrible in conveying the volumetrics of the problem. For that lets take some good old bar charts to take a look at how much of a problem we are actually dealing with.
This should make things very obvious. The perfect PE Bin 5 of our plots contains in fact 99.5% of all jobs. Don't expect this from your average Hadoop installation, this is a very well managed system, where PE has been monitored and measured.
In this example only 0.03% or the queries ended up in Bin 1 of the PE distribution. If you slide one chart to the right and look at the CPU distribution you will find PE Bin 5 still holds a commanding 85% of the workload and Bin 1 only makes up for about 3% of the CPU consumed. The third or middel chart calculates the Effective CPU [EC] sometimes also referred to as Impact CPU. This is how the workload really 'feels' to the system.
The impact of what we are about to look at is starting to become clear: 0.03% of queries produce 17% of the Effective CPU [EC] of the system. Now move over to the 4th charts and look at the Skew Overhead [SO] we discussed before. 0.03% of queries make up for 36% of the Skew Overhead of the system or more than 99.5% of all the great queries on the system combined do.
Just for completeness I have added the average Burn Rate [BR] on the very right side. On average the system highlighted puts 0.2% of its resources at individual jobs and queries in the PE Bin 5 and almost 20% of BR at its worst PE Bin-1 jobs. Again that is reflective of very common behavior where high priorities are being thrown at bad workload to make them complete faster.
If you combine PE Bins 1 and 2 you will see that 0.1% of queries in this example make up for 53% of the Overhead of the system.
0.1% of jobs are responsible for 53% of the PE overhead of the system.
The numbers in this article come from well managed platforms that have been optimized for throughput and to minimize the impact of loss of PE. In reality most BigData systems out there that have not been measured and optimized will perform way worse. I have seen statistics of systems that had 0.1% of jobs make up 98+% of system overhead. That is millions of dollars out the window every year.
The question is: Do you measure PE of your BigData implementations? If not - as I expect 99% of the readers of this article - you are in for a potentially HUGE surprise.
The main takeaway of this article is that you really have to focus on PE of BigData platforms. If you spend your time on 0.1% or less of the jobs and fix them, the return is equal to about 30-50% of overal capacity while at the same time making some of these jobs run 100x or more faster without the need for ridiculous Burn Rates.
A very special Thanks goes out to Nachum Shacham - my former Chief Data Scientist at eBay - who helped me evolve some of these metrics and their applications specifically for our Hadoop and Teradata clusters (some of the worlds largest). He presented his findings at the XLDB 2011 event in Stanford.
Coming Soon: Big Data System KPIs - A deep dive into the KPIs every BigData system should be measured with.
Looking forward to your thoughts and comments!