October 7, 2011 | By Rey Villar
Hadoop is a powerful data delivery alternative to traditional relationship database management systems (RDBMS). Unlike the RDBMS, Hadoop does not store data in tables in columns and rows. Hadoop does not use set manipulation (SQL) to process data. Hadoop offers open source, the flexibility to operate on commodity platforms, lightning fast output and virtual linear scalability. No wonder Hadoop is quickly gaining momentum in the BI world.
Hadoop, which is short for Hadoop Distributed File System (HDFS), can quickly scale data distribution based on the number of available resources. The power of Hadoop lies in the ability to distribute, process, and rapidly return analysis on very large data sets.
Traditional ETL and RDBMS systems process data using set-based transformations. These platforms store data in tabular form for end-user access. These SQL type operations demand high intensity data processing before the data is ready for consumption. Such massively parallel platforms, while fast, are challenged to process and deliver data at a low enough latency to meet near real-time aggregation needs.
Unlike traditional RDBMS, Hadoop stores data in key and value pairs. Each data point contains a key and an associated value/metric. The key is used to access the data value. Incremental data can be made available for near real-time access.
Scalability is embedded in the architecture of a Hadoop cluster. A cluster is comprised of a group of nodes managed by the HDFS to parse data into chunks distributed to and processed at each node. Each node is functionally independent from the rest and can transform, translate, cleanse, filter and apply business rules on the input data it receives. Nodes work in isolation to process and produce output. Individual nodes execute a “mapper” program to produce key and value pairs that are stored locally on the node. The output of the mapper process is then supplied as the input to the “reducer” process.
The reducer process merges and combines the value results within a specific key. Aggregates like sum, min and max are applied on the values against unique sets of keys. The output of the reducer process is then stored locally on each of the nodes. The data dispersed across the nodes can then be gathered for presentation of the desired data analytics or metrics.
The mapper and reducer process together are known as a “mapreduce” model. Mapreduce programs for Hadoop are written in many languages, including Java, Python, PHP, and PERL.
A mapreduce process is tested with a small chunk of data and executed on a cluster after debugging is completed. Scaling up or down the infrastructure and reprioritizing of nodes in a cluster is very easy. Little to no programming change is required to scale up a mapreduce process from one node to n nodes.
During run time the Hadoop infrastructure manages distribution of data onto all available nodes. The number of nodes can be scaled up based on temporal needs with near linear scalability. For example, a mapreduce program that uses twice the number of nodes will cut processing time in half. This is perfect for dealing with peak processing times – like end of month, quarter, and year. This is a cost-effective alternative to the ramp-up costs associated with peak demand management of traditional ETL data processing.
The Hadoop ecosystem contains four open source projects that handle major tasks familiar to the BI crowd:
- Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.
- Pig: A high-level data-flow language and execution framework for parallel computation.
- HBase: A scalable, distributed database that supports structured data storage for large tables.
- ZooKeeper: A high-performance service that manages data amongst distributed applications.
Hadoop/HDFS has some powerful advantages:
- Fast performance and low latency on large volumes of data.
- Development and analysis of additional sets of data points are very straightforward.
- Resources devoted to data processing can be scaled up incrementally and even temporarily.
- Hadoop allows for incremental ETL development. It is not necessary to have the data models, business rules, and all data points flushed out for the entire initiative. File handling allows for more versatile development without the need to run through the entire SDLC.
- Hadoop/HDFS is an open source infrastructure. The developer community is continuously updating and improving the code base. This means you may see significant updates to the Hadoop project over time compared to the more rigid release schedules of traditional software vendors. You may need to wait for a patch/bug resolution for vendor controlled software.
- Scalability of data and performance enables addition of temporary nodes leveraging cloud infrastructure to meet peak demands.
As well as some disadvantages:
- Data may be subject to duplication into individual repositories by different groups and analysts. A governance policy should be put in place if this is a concern.
- Hadoop mapreduce program outputs are files. You can bring Hadoop into the world of SQL using a combination of Hadoop mapreduce and SQL. One solution is to process large volumes of data using Hadoop then bulk load aggregated output files of the mapreduce programs into an RDBMS/ appliance database. Analysts can then use their SQL based tools and skills for analysis and reporting.
I had the opportunity to use Hadoop at a recent engagement, and was eager to see the benefits applied to the health care arena. The results, anticipated and at the same time expected, were astounding. My next blog, titled, “Using Hadoop for BI: Healthcare datasets,” will recount my experiences.
If you have experiences with Hadoop, please reach out and share them as well.