How does Clouderas Impala work

Cloud computing

The first contact with Cloudera usually begins with a misunderstanding: "No, we are not a cloud company," explains Matt Elson, Vice President Technology Alliances, in a CW conversation. "We're in the big data business." The company's founders may have originally had the idea of ​​selling their framework as a cloud variant in 2008, the manager suspects. If so, the company founders have meanwhile dropped the goal. Today Cloudera sells the core product "Enterprise Data Hub" either through partners (in Germany so far exclusively through T-Systems as integration partner) or directly to companies, preferably from the finance and insurance as well as retail and consumer goods industries.

EDH - the Hadoop distribution from Cloudera

The basis for all Cloudera projects is always its own distribution of the Apache Hadoop framework, which the company enriches with services and sells as an "Enterprise Data Hub". The platform only offers the infrastructure for big data handling, i.e. file system, libraries, tools as well as query procedures, filter and processing mechanisms such as MapReduce. There are no analysis or business intelligence applications from Cloudera; according to Elsen, these functions are provided by around 700 certified software partners.

Most of the time, Cloudera customers start with small, internal installations for evaluating structured data, says Elson, describing the user's strategy. Other projects often follow quickly in which unstructured data (e.g. from social media platforms or financial information) is processed. Cloudera does not have and does not plan to have its own public cloud infrastructure in which users can process big data queries, stresses Elson.

Amazon is setting the pace in the cloud

That could be a problem, at least says Jeff Kelly. The Wikibon analyst complains that Cloudera is "a one-product company". "You have to care more about the cloud." Today's Cloudera installations run almost all in internal environments behind a firewall, thus excluding a large potential user group.

The trigger for the criticism was the announcement of the big data streaming service "Kinesis" from Amazon. The innovation attracted enormous attention in the industry because the service is supposed to allow real-time data analysis. Due to the cloud operating model, it is easy to use, quickly deployable and largely scalable. In cooperation with "Redshift" (data warehousing), "Elastic MapReduce" (processing based on Hadoop) and "DynamoDB" (database), Amazon can now process the entire big data workload in the cloud, praised Kelly.

Cloudera vs. Amazon: Better to cooperate than compete

The market watchers' request that Cloudera should plow the cloud environment more intensively is countered by Elson with reference to an established Amazon cooperation: If necessary, the company's own Hadoop framework runs on Amazon's cloud platform "EC2" and can be opened with the help of Open- Quickly heave source packages such as OpenStack and Cloudstack into the cloud. Each partner is free to do so. At the moment, however, the demand for public cloud environments is limited because companies prefer to operate their big data projects internally.

The strength of the proprietary Hadoop framework should be the simple implementation of a technical environment that allows processing of large amounts of data. "Hadoop is a complex technology," warns the manager. "We want to make it easier to use." With its own product, Hadoop is no longer a solution for a few thousand specialists, but for millions of users who have neither knowledge of Java programming nor of database administration. "Anyone can write text-based queries," advertises Elson for its own solution.

Hadoop mastermind on board

It seems certain, however, that Cloudera has the necessary specialist knowledge to be able to master and develop Hadoop. The three founders all have extensive experience in big data projects. Amr Awadallah, the company's chief technology officer, is from Yahoo! The internet company is an intensive user of the Hadoop framework. Mike Olson, Chief Strategy Officer, was previously with Oracle and Sleepycat Software, maker of "Berkeley DB". Jeff Hammerbacher, now Chief Scientist at Cloudera, comes from Facebook. The world's largest Hadoop cluster is to be in operation there.

Doug Cutting, who made the famous yellow elephant the Hadoop mascot, was later added. He is considered the Spiritus Rector of the Hadoop framework because he initiated it and pushed it so far that the Apache Software Foundation raised it to the status of a top-level project. Today Cutting is Chief Architect at Cloudera.

Impala does real-time analysis

For Cloudera, the concentrated expertise has paid off, among other things in the product "Impala", which came onto the market around a year ago. It extends Hadoop's traditional batch-based processing to include real-time processing. With the knowledge advantage of one year in this segment, Cloudera is therefore looking relaxed at Amazon's efforts around big data streaming. At least Wibikon analyst Kelly doubts whether the serenity will last: "Amazon has reversed the proportions and thus changed the big data game. Cloudera has transformed itself from a big player into a small player."