Hadoop & Spark are Dominating Big Data, But the Market Demands Even More

According to industry insider and InfoWorld columnist Andy Oliver, what you need to know about Hadoop is that it is no longer Hadoop. At least, it isn’t the Hadoop that everyone once knew and may or may not have loved. Hadoop’s co-creator Doug Cutting believes that the changes are a direct result of the open source roots of Hadoop and related projects, most notably Spark. Together, Hadoop and Spark are dominating the big data marketplace, with Hadoop commanding half of big data’s $100 billion annual market value, and Spark surpassing MapReduce in terms of popularity (at least among those searching for big data products on Google).

While Hadoop is the go-to big data framework and Spark reigns supreme when it comes to processing engines, these two do not comprise the sum total of what businesses need for big data analytics. As organizations familiarize themselves with the tools and techniques, there is ever more demand for products to meet certain needs that aren’t addressed with the basic frameworks and engines (though, as mentioned, Hadoop has come light-years from its inception). What else do big data pros demand to get the job done?


Apache Solr

Hadoop and spark

Apache Solr addresses Hadoop users’ needs for a full-text search engine.

Spark is not capable of serving as Hadoop’s full-text search engine. That job falls to Solr, an enterprise-grade search server that can work within the Hadoop framework as a stand-alone product.

Apache Impala

Spark is not made to act as an interactive query engine, either. For that activity, Hadoop users turn to Impala. Impala is still in the Apache Incubator, but it is already integrated into CDA (Cloudera Distribution Including Apache Hadoop) and Cloudera Enterprise. Incubator is an analytical MPP (massively parallel processing) database and interactive query engine designed for use with Hadoop that is most notable for delivering extremely fast analytics.

Apache Kafka

When it comes to message brokering, Kafka outpaces the alternatives by significant margins. Kafka is a publish-subscribe messaging service that is redesigned as a distributed commit log, and delivers far faster service that is more scalable and durable. One Kafka broker is capable of managing hundreds of megabytes of reads and writes delivered from thousands of different clients every second. One Kafka cluster is able to accommodate the largest of enterprises, and can be scaled elastically and transparently without incurring any downtime.

Apache Phoenix

Hadoop and spark

Phoenix is designed to provide operational and OLTP analysis within a Hadoop infrastructure.

Apache Phoenix delivers operational and OLTP (on line transactional processing) analytics in a Hadoop environment. It’s designed to work with low-latency applications by combining the power of a traditional SQL database and JDBC APIs with a complete compliment of ACID (Atomicity, Consistency, Isolation, Durability) transaction capabilities with all of the flexibility of late-bound schema-on-read capabilities from a NoSQL environment, utilizing HBase as its storage backing. Phoenix is completely integrated with Spark, MapReduce, Hive, Pig, Flume, and other Hadoop products.

Apache Falcon

Finally, Falcon meets the needs of feed processing and management, making it easier for Hadoop users to offload feed processing and management onto their Hadoop clusters. It forges an established relationship between the data and processing elements within the Hadoop environment and handles feed management services like feed retention, replications across the various clusters, archiving, and more. Falcon is also fully integrated with Hive’s HCatalog.

While there are other niche needs that different big data tools (particularly the open source Apache products) meet, this body of solutions rounds out the features and functionality that Spark and Hadoop lack. Are you ready to delve more deeply into Hadoop and big data? Get more information on hosting your own Big Data Week event now.

Leave a Reply