July 25, 2012 | By Rey Villar
“Let’s get this right. One user accounts for more than half of our ad hoc usage?” The data warehouse manager was incredulous. The warehouse had hundreds of ad hoc users, but it appeared that a tiny group of users – precisely one – accounted for the vast majority of ad hoc usage.
Observation of unexpected patterns like this is nothing new. Vilfredo Pareto was a mathematician who observed similar unexpected distributions in 19th century Italy. Amongst other things, he was surprised to discover that 80% of the land and wealth in Italy was owned by just 20% of the population. Pareto introduced us to the counter intuitive concept that there can be a large divide between inputs and expected results.
Pareto’s observation would later become the ‘80/20 principle.’ This timeless concept has profound implications in our data wielding efforts – for things like resource usage, report utilization, and data consumption. In the data arena, Pareto’s 80/20 rule might suggest that 80% of our resources are consumed by just 20% of our users, 80% of business demand is satisfied by 20% of the reports, or that 80% of the database needs are satisfied by just 20% of the tables in a database. Expect percentages to vary.
Observations in a world without data skew are highly predictable and user-friendly. A unit of work input yields a unit of work output. Doubling the data warehouse staff cuts the project duration in half. Much to our frustration, Pareto’s Principle eats our simple linear model for lunch. Data skew moves the markers, disses assumptions, and explains periods of spectacular gains and similarly spectacular failures.
Data quality efforts owe a particular debt to Pareto’s line of thinking. We can tackle the majority of the data quality issues at the lowest cost and effort by concentrating on the most frequent offenders. Ridding the data of the final fifth may absorb an immense effort that can be difficult to justify (or fund). Concentrate efforts on the big issues that bubble up out of the froth – and do Pareto proud.
We need to regularly vet our work against Pareto-type benchmarks. Are we providing the 20% of the data that will satisfy 80% of the needs? Do we have complete requirements for the 20% of the requested reports that will satisfy 80% of our users? Conversely, where can we concentrate our efforts to deliver the 80% of output with just 20% of the effort?
The brave think through the smoke and noise to observe and measure, leveraging the Pareto Principle. (Beware that the foolhardy may also do the same.) Awareness of the Principle is just the first step. Experience and knowing how to ask the right questions can help create those magical moments when leverage is king.
Another arena where Pareto would run in mad circles would be in project management. Project managers need to identify the 20% of the effort (and resources) that may generate 80% of the output (and results). Your management team may be spending about 80% of their time in much less productive tracks. Inputs to our analytical and management efforts can become voluminous to the point of ridiculousness – to which one blogger noted “you can always find more data”. Our job in both project and data management is to filter these copious inputs to what is essential and important. Pareto thinking plays well to this space.
Before embarking on a massive IT effort, or even a modest one, review the Pareto principle with your team and ask them to identify the 20% of the effort that will yield 80% of the gains. Ask your team where their individual efforts may possibly yield Pareto-style returns.
This so-called ‘low hanging fruit’ is often openly concealed in the data. Be ever mindful of the single user consuming over 50% of your resources or the one discovery that can be leveraged to steer your project to a ridiculous level of success. Find the 20% that will drive remarkable gains.
Pareto provided us with a model of analysis that exploits the obvious before getting tied down in details. His 19th century advice continues to be timely.