To Hadoop or Not to Hadoop?
Hadoop is often positioned as the one framework your business needs to solve nearly all your problems. Mention “Big Data” or “Analytics” and pat comes the reply: Hadoop! Hadoop, however, was purpose-built for a clear set of problems; for some it is, at best, a poor fit and others, even worse, a mistake. While data transformation (or, broadly, ETL operations) benefit significantly from a Hadoop setup, if your business needs fall into any of the following five categories, Hadoop might be a misfit.
1. Big Data cravings
While businesses like to believe that they have a Big Data dataset, sadly, it seems that is often not the case. Regarding data volume and common perceptions that one possesses “Big Data”, a research article, Nobody Ever Got Fired For Buying a Cluster, reveals that while Hadoop was designed for tera/petabyte scale computation, majority of real world jobs process less than 100 GB of input (with median jobs at Microsoft & Yahoo under 14 GB and 90% of jobs at Facebook being well under 100GB) and hence, puts forth the case for a single “scale-up” server over a “scale-out” setup running Hadoop.
Ask Yourself:
- Do I have several terrabytes of data or more?
- Do I have a steady, huge influx of data?
- How much of my data am I going to operate on?
2. You are in the queue
When submitting jobs, Hadoop's minimum latency is about a minute. This means that it takes the system a minute or more to respond, and provide recommendations, to the customer’s purchase. It would be a loyal and patient customer who would stare at the screen for 60+ seconds waiting for a response. An option is to pre-compute related items for every item in the inventory a priori using Hadoop, and provide the web site or mobile app immediate, one-second-or-less access to the stored result. Hadoop is an excellent Big Data pre-computation engine. Of course, as the nature of your response gets more complicated complete pre-computation is very inefficient.
Ask Yourself:
- What are user expectations around response time?
- Which of my jobs can be batched up?
3. Your call will be answered in...
Hadoop has not served businesses requiring real-time responses to their queries. Jobs which go through the map-reduce cycle also spend time in the shuffle cycle. None of these are time-bound making developing real-time applications on top of Hadoop, very difficult. Volume-weighted average price trading is an example where responses need to be time-bound to place buys.
Analysts sorely miss SQL. Hadoop doesn’t function well for random access to its datasets (even with Hive, which basically makes MapReduce jobs of your query). Google’s Dremel (and by extension, BigQuery) architecture is designed to support ad-hoc queries over huge row-sets in under seconds. And SQL lets you do joins. Shark from University of California, Berkeley’s AmpLab and the Stinger initiative led by Hortonworks are other alternatives to look out for.
Ask Yourself:
- What is the level of interaction users/analysts expect with my data?
- Do they wish to have interactivity with terabytes of data or just a subset?
Let’s say it together: Hadoop works in batch mode. That means as new data is added the jobs need to run over the entire set again. Hence, analyses time keeps increasing. Chunks of fresh data, mere updates or small changes might flow in real-time. Often, businesses need to make decisions based on these events. However rapidly the incoming data is ingested Hadoop would still process them in batch mode. YARN promises to address this in the future. Twitter’s Storm is already popular & an available alternative. The case for combining Storm with a distributed messaging system like Kafka opens up a variety of use cases for stream aggregation and processing. But load balancing is sorely missing in Storm while available in Yahoo’s S4.
Ask Yourself:
- What is the shelf-life of my data?
- How rapidly should my business produce value from incoming data?
- How important is it for my business to respond to live changes or updates?
Real-time advertisements and monitoring sensor data mandate real-time processing of streaming input. But Hadoop or tools built on tops of them are not the only alternatives. SAP’s HANA in-memory database was used in the McClaren team’s ATLAS suite of analytics tools during the recent Indy 500 along with MATLAB to run simulations and respond to telemetry during the race. Many analysts opine that the future of Hadoop is interactive and real-time.
4. I Just Broke Up With My Social Network
Hadoop, especially MapReduce, is best suited for data that can be decomposed to key-value pairs without fear of losing context or any implicit relationship. Graphs possess implicit relationships (edges, sub-trees, child and parent relationships, weights, etc.) and not all of them will exist on a node. This attribute requires most graph algorithms to carry a portion or the entire graph through each iteration. This is often not feasible or at least convoluted to realize in MapReduce. There is also the problem of strategy of data partitioning across nodes. If your primary data structure is a graph or a network, then you are probably better off using a graph database like Neo4J or Dex or you could explore recent entries on the scene like Google’s Pregel or Apache Giraph.
Ask Yourself:
- Is the underlying structure of my data as vital as the data itself?
- Is the insight I wish to gain reflective of the structure as much as or more than the data?
5. The Mold of MapReduce
Some tasks/jobs/algorithms simply do not yield to the programming model of MapReduce. One such set of problems was touched upon in the previous paragraph. Tasks that need the results of intermediate steps to compute results of current step would be another category (an academic example is the Fibonacci series computation). Some machine learning algorithms (gradient-based learning or expectation maximization) too do not fall well into the MapReduce paradigm. There are specific optimisations/strategies (global state, passing along data structures for reference, etc.) for each of these issues that have been suggested by researchers but it still makes the implementation more non-intuitive & complicated than is necessary.
Ask Yourself:
- Does my business places great emphasis on highly specialised algorithms or domain specific processes?
- Wouldn’t the technical team be better equipped to analyse if the algorithms are MapReducible or not?
Added to these are business cases where the data is not significantly large or the total data set is large but made up of billions of small files (e.g. many image files which need to be scanned for a particular shape) which can’t be concatenated. As we already mentioned, jobs which do not lend themselves to the MapReduce paradigm of divide and aggregate also make adopting Hadoop contrived.
Now that we have explored when Hadoop might be a misfit, let's look at when it might make sense.
Ask Yourself:
Does your organization...
- Want to extract information from piles of text logs?
- Want to transform largely unstructured or semi-structured data into some other useable and structured format?
- Have tasks that can run over the entire set of data, overnight (like credit card companies do with the day’s transactions)?
- Treat conclusions drawn from a single processing of data as valid till the next scheduled processing (unlike stock market prices which definitely change between end of day values)?
Then, most certainly you should explore Hadoop.
These represent a sizeable list of categories of business problems which fit well into the Hadoop model (although reports suggest that even on those, taking it to production is a non-trivial challenge). Typical jobs that have to go over huge quantities of unstructured or semi-structured data and either summarise the contents or transform relevant observations into a structured form to be utilised by other components in the system, are very well suited for the Hadoop model. If your collected data has elements that can easily be captured as an identifier with its corresponding value (which in Hadoop-speak is key-value pairs), you can utilise that simple association to perform several kinds of aggregations.
At the end of the day, the key is to recognise the business resources available and understand the nature of the problem you wish to solve. That and the elaboration above would help you choose the best tools for your business.
And it may very well be Hadoop.
What has been your experience? Share in the comments section.
Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.