The BI Implications of Highly Customizable Data

By December 11, 2012BI Innovation, Big Data, Tips

As seen on TDWI.org

Although storage prices are dropping, you must still consider how your customized data set will work for self-service or ad hoc reporting, especially in a real-time environment.

Modern DBMS systems were designed in an era where someone was in charge of all data structures. A DBA, or sometimes a small committee of them, dictated what could be stored and how it should be structured.

At the time, data storage was unbelievably expensive by today’s standards. Imagine oil being $10,000 a barrel, and consider how carefully it might be controlled and utilized. In a world where billion-dollar companies could not even store as much data as you have on your cell phone, relational data structures represented efficient ways to store various types of data such as strings, dates and numbers in a structured way that enabled quick look-up through indexing.

The idea of storing a new piece of information had to be considered very seriously because adding a column to a table would mean every single record would then take up more space, even if the new field for that record was blank (NULL). The incredible advantages of structured data disappeared with low-density information. Additionally, arbitrarily adding new types of data was an involved process.

Today’s knowledge workers are blessed to have access to technological resources that were once unimagined. Storage capacity and availability has exploded — the scarcity is now in people who have the technical skills to manipulate the data. But, with the cloud emerged the idea that anyone could simply launch and deploy applications without any IT resources. Businesses quickly wanted the ability to store any available and relevant information. Without a committee of DBAs in the way, custom fields flourish and the business world probably won’t be going back any time soon.

Lower storage prices mean that the overhead of custom fields is insignificant and you can have as many as you like.

NoSQL Databases

I’m a strong proponent the SQL language even though I think the underlying data architectures of SQL database servers need an overhaul. Progress is coming from next-generation Big Data storage vendors delivering technologies such as Hadoop that support schema-less operations where every bit of data is treated as a custom field. The big challenge, however, is the engineering cost of working with such systems. Even basic aggregate operations require Java programmers and many companies resort to making a relational copy of the Hadoop data on a nightly basis for their analytical needs.

Next-generation relational databases such as SQL 2012 integrate natively with Hadoop, giving you the best of both worlds. Just as hybrid cars prove the benefits of electric technology while keeping you compatible with gasoline infrastructure, SQL 2012’s hybrid approach offers easy access and provides the performance and flexibility of Hadoop. With such technologies emerging, NoSQL is transitioning from meaning “No SQL” to “Not Only SQL.”

SQL-Based Entity Attribute Value (EVA)

A more traditional variation of a hybrid approach that can be implemented without next-generation data storage is EVA. Essentially, this means creating tables that store the names of virtual entities and attributes. These long, skinny tables store data in a flexible manner and the PIVOT operations provided by the database let you convert them on-the-fly into more traditional data structures. This is the approach many enterprises use with Fusion, a product from Izenda.  The only disadvantage of this approach is somewhat limited scalability. The PIVOT operations create challenges when you have more than a few gigabytes of data.

Custom Field Tables

One of the best solutions to the custom field’s challenges, if you only have a few dozen fields per object, is to create a single table with fields that get used in purpose-specific ways. So each system object gets linked to a single custom field’s table that has names such as Field1 and Field2. One must use software aliasing to display these in a context-specific way that hides the data names from the user. This approach works well for very large data sets containing billions of records and is very easy to manage and very reportable.

A Final Word

When considering a data storage methodology, it’s important to consider the reporting and BI ramifications. As storage becomes more affordable, analyzing all the data is the true business cost. It’s important to consider how the  data set will work in a self-service or ad hoc reporting environment, especially one that is real-time.