July 14, 2013
The following is the most comprehensive and up-to-date information that I have found on Hadoop. It was written by Kai Wähner (see bio at the end) on July 9, 2013 right after the Hadoop Summit.
Big data becomes a relevant topic in many companies this year. Although there is no standard definition of the term „big data“, Hadoop is the de facto standard for processing big data. Almost all big software vendors such as IBM, Oracle, SAP, or even Microsoft use it. However, when you have decided to use Hadoop, the first question is how to start and which product to choose for your big data processes. Several alternatives exist for installing a version of Hadoop and realizing big data processes. This article discusses different alternatives and recommends when to use which one.
Alternatives for Hadoop Platforms
The following picture shows different alternatives for Hadoop platforms. You can either install just the Apache release, choose one of several distributions of different vendors, or you can decide to use a big data suite. It is important to understand that every distribution contains Apache Hadoop, and that almost every big data suite contains or uses a distribution.
Let’s now take a closer look at the different alternatives, beginning with Apache Hadoop in the next section.
The current Apache Hadoop project (version 2.0) includes these modules:
- Hadoop Common: The common utilities that support the other Hadoop modules.
- Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.
- Hadoop YARN: A framework for job scheduling and cluster resource management.
- Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
It is very easy to install it as standalone installation on a local system (just unzip, set some environment variable, and start using it). However, this is just for getting started and doing some basic tutorials.
If you want to install it on one or more „real nodes“, installation becomes more complex.
Problem 1: Complex Cluster Setup
Pseudo-distributed mode installation helps you to simulate a multi node installation on a single node. Instead of installing Hadoop on different servers, you can simulate it on a single server. Even in this mode, you already have to do a lot of configuration. If you want to setup a cluster with several nodes, it becomes more complex, of course. If you are no experienced administrator, you have to struggle a lot with user rights, problems with access rights, and such stuff.
Problem 2: Usage of Hadoop Ecosystem
In Apache all projects are independent. This is a good thing! However, the Hadoop ecosystem contains not just Hadoop, but many other Apache projects such as
- Pig: A platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.
- Hive: A data warehouse system for Hadoop that offers a SQL-like query language to facilitate easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems.
- Hbase: A distributed, scalable, big data store with random, real time read/write access.
- Sqoop: A tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores such as relational databases.
- Flume: A distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.
- Zookeeper: A centralized service for maintaining configuration information, naming,
- providing distributed synchronization, and providing group services.
- and many others.
These projects have to be installed and integrated manually into Hadoop.
You have to care about different versions and releases by yourself. Unfortunately, not all releases work together perfectly. You have to compare release notes and figure out by yourself. Already Hadoop itself offers so many different versions, branches, features, etc. There is not just a version 1.0, 1.1, 2.0, etc. – as you know it from other projects. See the article „Genealogy of elephants“ for more details about “Hadoop versioning hell”.
Problem 3: Commercial Support
Apache Hadoop is „just“ an open source project. This has a lot of benefits. You can access and change the source code. Several companies use and extend the code base and add features. Discussions, articles, blog posts and mailing lists offer a lot of information.
However, a real problem is to get commercial support for an open source project such as Apache Hadoop. Companies usually just offer support for their products, not for an open source project (this is not just a problem for Hadoop, but for many open source projects).
When to use Apache Hadoop?
Apache Hadoop is good for a first try due to its 10min install on a local system in standalone mode. You can try out the WordCount example (which is the „hello world“ example of Hadoop) and take a look at some MapReduce Java code.
If you do not intend to use a „real“ hadoop distribution (see next section), Apache Hadoop is also the right choice. However, I wonder if there is any reason not to use a hadoop distribution – as they are also available in a free, non-commercial edition.
So, for real Hadoop projects, I really recommend to use a Hadoop distribution instead of just Apache Hadoop. The advantages are explained in the upcoming section.
A Hadoop distribution solves the problems mentioned in the previous section. The business model of these vendors relies for hundred percent on their Hadoop distributions. They offer packages, tooling and commercial support. This reduces efforts a lot, not just for development, but also for operations.
A distribution packages different projects of the Hadoop ecosystem. This assures that all used versions work together smoothly. There are regular releases with updated version of different projects.
On top of the packaging, vendors of distributions offer graphical tooling for deployment, administration and monitoring of Hadoop clusters. This way, it is a lot easier to setup, manage and monitor complex clusters. Effort is reduced a lot.
As mentioned, it is also difficult to get support for plain Apache Hadoop, while vendors provide commercial support for their own Hadoop distribution.
Vendors of Hadoop Distributions
Besides Apache Hadoop, it is more or less a three horse race for Hadoop distribution between HortonWorks, Cloudera and MapR at the moment. Though, other Hadoop distributions arise in the meantime, too. For example, there is Pivotal HD by EMC Corporation or IBM InfoSphere BigInsights. With Amazion Elastic MapReduce (EMR), Amazon even offers a hosted, preconfigured solution in its cloud.
Many other software vendors do not develop their own Hadoop distribution, but work together with one of the existing vendors. For example, Microsoft partners with Hortonworks, especially to
bring Apache Hadoop to its operating system Windows Server and to its cloud service Windows Azure. Another example is Oracle offering a big data appliance which combines hardware and software of Oracle with Cloudera’s Hadoop distribution. Some vendors such as SAP or Talend offer support for several different distributions.
How to choose the right Hadoop Distribution?
The evaluation of Hadoop distributions is out of scope of this article. Nevertheless, the major players shall be described shortly. Often, there are just subtle differences between different distributions, which vendors consider a secret sauce and their main differentiators. The following listing explains the differences:
- Cloudera: The most established distribution by far with most number of referenced deployments. Powerful tooling for deployment, management and monitoring are available. Impala is developed and contributed by Cloudera to offer real time processing of big data.
- Hortonworks: The only vendor which uses 100% open source Apache Hadoop without own (non-open) modifications. Hortonworks is the first vendor to use Apache HCatalog functionality for metadata services. Besides, their Stinger initiative optimizes the Hive project massively. Hortonworks offers a very good, easy-to-use sandbox for getting started. Hortonworks developed and committed enhancements into the core trunk that make Apache Hadoop run natively on the Microsoft Windows platforms including Windows Server and Windows Azure.
- MapR: Uses some different concepts than its competitors, especially support for a native Unix file system instead of HDFS (with non-open-source components) for better performance and ease of use. Native Unix commands can be used instead of Hadoop commands. Besides, MapR differentiates from its competitors with high availability features such as snapshots, mirroring or stateful failover. The company is also spearheading the Apache Drill project, an open-source re-envisioning of Google’s Dremel for SQL-like queries on Hadoop data for offering real time processing.
- Amazon Elastic Map Reduce (EMR): Differs from others as it is a hosted solution running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3). Besides Amazon’s distribution, you can also use MapR on EMR. A major use case is ephemeral clusters. If you need one-time or infrequent big data processing, EMR might save you a lot of money. However, there are some disadvantages, too. Only Pig and Hive are included of the Hadoop ecosystem, so many others are missing by default. Besides, EMR is highly tuned for working with data in S3, which has a higher latency and does not locate the data on your computational nodes. So file IO on EMR is slower and more latent than IO on your own Hadoop cluster or on your own EC2 cluster.
The above distributions have in common that they can be used in a flexible way just by themselves or in combination with different big data suites. Some other distributions, which arise these days, are not as flexible and bind you to a specific software and / or hardware stack. For example, EMC’s Pivotal HD was natively fused with Greenplum’s analytic database to offer real SQL queries and very good performance on top of Hadoop, or Intel Distribution for Apache Hadoop, which has optimized its Hadoop distribution for solid-state drives, something that other Hadoop companies haven’t done so far.
So, if you already have a specific vendor stack in your enterprise, be sure to check which Hadoop distributions are supported. For example, if you use Greenplum database, then Pivotal HD might be a perfect match, while in other cases more flexible solutions might be more appropriate. For instance, if you are already using Talend ESB, and you want to start your big data project with Talend Big Data, then you can choose your desired Hadoop distributions, as Talend does not rely on a specific vendor of a Hadoop distribution.
To make the right choice, read about their concepts and try out different distributions. Check out the tooling and analyze costs for enterprise versions plus commercial support. Afterwards, you can decide which distribution is the right one for you.
When to use a Hadoop distribution?
Due to its advantages such as packaging, tooling and commercial support, a Hadoop distribution should be used in most use cases. There are rare use cases where it is a good idea to use the plan Apache Hadoop release and build your own distribution on top of this. You would have to test your packaging, build own tooling, and write patches by yourself. Some other people already had the problems you would have. So be sure there are good reasons not to use a Hadoop distribution!
However, even a Hadoop distribution requires a lot of efforts. You still need to write a lot of code for your MapReduce jobs, and for integrating all your different data sources into Hadoop. This is where big data suites come in.
Big Data Suite
On top of Apache Hadoop or a Hadoop distribution, you can use a big data suite. A big data suite often supports different Hadoop distributions under the hood. However, some vendors implement their own Hadoop solution. Either way, a big data suite adds several further features to distributions for processing big data:
- Tooling: Usually, a big data suite is based on top of an IDE such as Eclipse. Additional plugins ease the development of big data applications. You can create, build and deploy big data services within your familiar development environment.
- Modeling: Apache Hadoop or a Hadoop distribution offer the infrastructure for Hadoop clusters. However, you still have to write a lot of complex code to build your MapReduce program. You can write this code in plain Java, or you can use optimized languages such as PigLatin or the Hive Query Language (HQL), which generate MapReduce code. A big data suite offers graphical tooling to model your big data services. All required code is generated. You just have to configure your jobs (i.e. define any parameters). Realizing big data jobs is much easier and more efficient.
- Code Generation: All code is generated. You do not have to write, debug, analyze and optimize your MapReduce code.
- Scheduling: Execution of big data jobs has to be scheduled and monitored. Instead of writing cron jobs or other code for scheduling, you can use the big data suite for defining and managing execution plans easily.
- Integration: Hadoop needs to integrate data of all different kinds of technologies and products. Besides files and SQL databases, you also have to integrate NoSQL databases, social media such as Twitter or Facebook, messages from messaging middleware or data from B2B products such as Salesforce or SAP. A big data suite helps a lot by offering connectors from all these different interfaces to Hadoop and back. You do not have to write the glue code by hand, you just use the graphical tooling to integrate and map all this data. Integration capabilities often also include data quality features such as data cleansing to improve the quality of imported data.
Vendors of Big Data Suites
The number of big data suites increases permanently. You can choose between several open source and proprietary vendors. Most big software vendors such as IBM, Oracle, Microsoft, and so on, integrate some kind of big data suite into their software portfolio. Most of them just support one Hadoop distribution, either their own one or they work together with a vendor of a Hadoop distribution.
On the other side, there are vendors which are specialized on processing data. They offer products for data integration, data quality, enterprise service bus, business process management, and further integration components. There are proprietary vendors such as Informatica, and open source vendors such as Talend or Pentaho. Some of these vendors support not just one Hadoop distribution, but different ones. For instance, at the time of writing this article, Talend can be used together with Apache Hadoop, Cloudera, Hortonworks, MapR, Amazon Elastic MapReduce or a custom self-created distribution (for example to use EMC’s Pivotal HD).
How to choose the right Big Data Suite?
The evaluation of big data suites is out of scope of this article. There are several aspects which should be considered when choosing a big data suite. The following aspects should help you making the right choice for your big data problem:
- Simplicity: Try out the big data suite by yourself. This means: install it, connect it to your Hadoop installation, integrate your interfaces (files, DBs, B2B, etc.), and finally model, deploy and execute some big data jobs. Learn how easy it is to use the big data suite by yourself – it is not enough to let some consultants of a vendor show you how it works. Do a proof of concept by yourself!
- Prevalence: Does the big data suite support widely used open standards – not just Apache Hadoop and its ecosystem, but also integration of data via SOAP and REST web services, etc. Is it open source and easy to change or extend regarding your specific problems? Is there a large community including documentation, forums, articles, blog posts and conference talks?
- Features: Are all required features supported? The Hadoop distribution (if you are already using one)? All parts of the Hadoop ecosystem you want to use? All interfaces / technologies / products you have to integrate? Be aware that too many features might increase complexity and costs a lot. So also check out if you really need a very heavyweight solution. Do you really need all of its features?
- Pitfalls: Be aware of several pitfalls. Some big data suites apply data-driven costs („data tax“), i.e. you have to pay for every data row which you process. This can get very expensive as we are talking about BIG DATA. Not all big data suites generate native Apache Hadoop code, often a proprietary engine must be installed on each server of the Hadoop cluster. This increases license costs, and removes vendor independence. Also think about what you really want to do with the big data suite. Some solutions just support Hadoop for ETL to populate data to a data warehouse, others also offer features such as post-processing, transforming or analyzing big data on Hadoop clusters. ETL is just one use case of Apache Hadoop and its ecosystem.
Decision Tree: Framework vs. Distribution vs. Suite
Now, you know the different Hadoop alternatives. Finally, let’s summarize and discuss when to choose the Apache Hadoop framework, a Hadoop distribution, or a big data suite.
The following “decision tree” will help you choosing the right one:
- Learn and understand low level details?
- Expert? Choose and configure by yourself?
- Easy setup?
- Learning (newbie)?
- Tooling for deployment?
- Commercial support needed?
- Integration of several different sources?
- Commercial support needed?
- Code generation instead of coding?
- Graphical scheduling of big data jobs?
- Realizing big data processes (integration, manipulation, analysis)?
Several alternatives exist for Hadoop installations. You can use just the Apache Hadoop project and create your own distribution out of the Hadoop ecosystem. Vendors of Hadoop distributions such as Cloudera, Hortonworks or MapR add several features on top of Apache Hadoop such as tooling or commercial support to reduce efforts. On top of Hadoop distributions, you can use a big data suite for additional features such as modeling, code generation and scheduling of big data jobs plus integration of all kinds of different data sources. Be sure to evaluate different alternatives to make the right decision for your big data project.
About the Author
Kai Wähner works as Principal Consultant at Talend. His main area of expertise lies within the fields of Java EE, SOA, Cloud Computing, BPM, Big Data, and Enterprise Architecture Management. He is speaker at international IT conferences such as JavaOne, ApacheCon or OOP, writes articles for professional journals, and shares his experiences with new technologies on his blog. Find more details and references (presentations, articles, blog posts) on his website, you can contact him here or on Twitter: @KaiWaehner.