Top tools for big data sets

Brian Gentile talks about choosing the right approach to taking on a customer's big data

To make data more usable and standard, we have implemented relational database technology, which has become less capable over the years with the shift to more modern data types. To generate new value and insight from the data, we have implemented data warehouses, which provide insufficient agility for today's fast-paced businesses.

The result is that many organisations today are not able to generate genuine insight from the variety of data available to them, leaving a huge opportunity to put this data to productive use. Creating new value from data, therefore, remains a thorn in the sides of enterprises across the globe.

The value from mastering big data stems from being able to adapt to all kinds of change, from an alteration in price of a product to an acquisition, and being able to get the information on it in a timely manner.

Forbes magazine recently reported on how competing based on time and information will drive the next major economic era. If you are a business analyst or technologist responsible for mapping data to decisions, the variety, velocity, and volume of data available to you today has never been richer, and the responsibility never greater.

Begin with an understanding of the different classes of data source technologies that can legitimately be used to harness or tame big data.

Hadoop is the most popular software framework associated with this rising trend. Others include NoSQL databases, MPP data stores and even ETL/data integration approaches, for moving big data by the batch into a more usable format. Each aligns with an appropriate use case.

Live exploration is the most dynamic approach because it involves native connectivity directly from the BI tool to the big data source and can yield reports and analyses in near real time. Hadoop HBase, Hadoop HDFS and MongoDB are just three of the most popular data sources to which this direct connection would be an advantage.

Data scientists and data analysts especially will thrive in this environment where their deep understanding of the data domain enables fast, rich interaction and possibly even faster decisions.

Direct batch reporting is an important and mainstream approach, especially this early in the game, which relies on tried-and-true SQL access to big data. Hadoop Hive is the best-known example, but Cassandra offers CQL access that delivers similar results and functionality.

Although this technique introduces greater latency, the added analytic flexibility creates value for a broader class of user.

Batch ETL, which means using extract, transform and load techniques to create a more usable subset of the big data, is also popular – especially when the insight being sought is less urgent, probably in the order of hours or days after data capture.

Almost every ETL tool has been improved to connect to and transform big data. Some even integrate nicely with underlying Hadoop technologies, such as Pig, making the data steward's life potentially easier.

Ultimately, this offers the most flexibility for big data, including combining and correlating it with any variety of traditional data, which potentially delivers value to many in an enterprise.

Nearly all our conversations with enterprise customers beyond the proof-of-concept stage with big data would suit one of these three routes. Each approach has many pros and cons, though, that should be considered before tumbling down the path.

By understanding the need and planned use for the data in advance, the most logical choice can be made. Businesses really do have a chance to master big data.

Brian Gentile is chief executive of Jaspersoft