As seen on TDWI.org
With so much data in so many places, how can you quickly connect to the sources you need? Data virtualization may be the answer.
In a world where yesterday’s data is like yesterday’s news and users are accustomed to finding out “what’s happening right now” via social media drive-by platforms such as Twitter, virtualization is quickly becoming the ideal data management strategy.
Traditional data integration requires building a data warehouse and moving enterprise data to it on a periodic basis. Today’s modern data virtualization capabilities connect directly to live databases and pull information “just in time.” This solution delivers real-time results without the costs and delays of a complex ETL (extract, transform, and load) project. Business users now accustomed to instant real-time access to data in their consumer lives expect the same experience from their enterprise systems. Generation Y employees who have accessed the Internet from an early age are often surprised that ETL technologies are used to move data to a data warehouse before it can be analyzed, so business reports are delivering yesterday’s data.
OLTP versus OLAP
The current thinking is that online transaction processing (OLTP) and online analytical processing (OLAP) require radically different technology stacks running on separate physical servers. Before you can analyze data, you must use ETL to copy it from your OLTP to your OLAP system. If users want a dashboard that displays key metrics from different systems, a data warehouse must be built with ETL technologies to move all this data into one place. Although this assumption was absolutely true in the 90s, the 1,000-fold increase in processing power we’ve experienced gives us the opportunity to evaluate new possibilities.
What We Can Learn From GM’s EV1
In the late 90s, General Motors introduced an electric car called the EV1 capable of cruising at highway speeds. It was a revolutionary design that could have changed the world, kept GM as the top global car manufacturer, and perhaps prevented the need for a government bail-out a decade later. After selling over 400 million gas-powered cars, GM was afraid of disrupting its own market. Not only did the auto maker abandon the program, they actually rounded up the cars from their owners and crushed them to eliminate the idea from our collective memory. This innovation had so much potential to disrupt the status quo that the then No.1 car manufacturer could not take the risk. Several years later, Toyota rolled out the Prius Hybrid and went on to become the No.1 global car manufacturer. Hybrid cars are now paving a path to zero-emission vehicles running on electric engines that move us toward a future where we no longer need to burn fossil fuels to get to work.
Data Virtualization Provides a Hybrid Approach
Hybrid cars use both electric and gas engines to give us long range and electric efficiency yet are still compatible with existing infrastructure. Data virtualization delivers OLAP capabilities on top of existing OLTP databases to deliver something amazing. In addition to delivering results in real time, there are several other advantages. A virtualized approach makes data integration agile. You can avoid the costs and risk of a long-term, waterfall-style systems development life cycle that might deliver the wrong thing too late. In contrast, you can deliver a virtualized data mart over the weekend with a limited set of data and add more information based on feedback, thus providing a minimally viable product with little investment that will please users by delivering a solution in record time.
Mankind tends to overestimate what can be done in two years but underestimate what can be done in 10. The current generation of OLAP technologies was designed at a time when a typical server came with 32 megabytes of memory. A server can now be purchased with 32 gigabytes of memory for less than $2000. This vast increase in capacity means options are now affordable that would have been unimaginable in years past. Because most departmental systems have OLAP workloads of only a few gigabytes, a 64-bit system can do an aggregate analysis of a hundred million records in a few seconds.
Next-Gen Technologies Giving Us Another Quantum Leap
Commodity 64-bit processors and operating systems mean that your data can be accessed in one chunk. This works like one massive library that contains every book ever written. In contrast, a 32-bit system is limited to about 2 gigabytes of data and, in our analogy, has to put in a “special order” to another library if it does not have the book you need. Consumer devices such as tablets and the MacBook Air provide amazing performance because the spinning platter is the main bottleneck on modern systems. Critical technology that lets you scale to over a terabyte of data is now found in solid-state storage. Indeed, some SSD hardware can now deliver over a gigabyte per second over a sustained time that works very well with the random access patterns OLAP queries have.
Unfortunately, databases grow as business requirements change and are rarely refactored or redesigned. Doing so would break application code that relies on a specific structure. Many legacy databases contain indecipherable field names and legacy fields that should not be shown. A great strategy to hide some of this complexity from the user is to use views to join, alias, and hide data sources. This can be done with a CREATE VIEW statement or utilizing an API to present the user with a simplified human-readable structure.
The Power of Dynamic Pivots, Virtualization
One of the primary uses of OLAP cubes is cross-tab reporting. Modern databases directly support generating pivot data without having to copy or pre-aggregate into a cube. This has the added benefit of giving users the ability to quickly make ad hoc changes to their reports. Traditional ETL-based integration techniques require complex scripts to be created to move the data to a single database or specialized OLAP cube. These technologies generally involve steep learning curves for staff or investing in consulting time (many billable hours). The long-term costs can be staggering because these scripts need to be updated whenever new fields are added.
Additionally, such scripts are frustrating for many new developers who try to understand a script they did not write. Because new products and M&A activity dictate that there will always be new data, ETL involves an ongoing investment. In contrast, today’s data virtualization capabilities instantly adapt to changes in data structure without any manual intervention.
Although storage capacities and data volumes are growing exponentially, the performance of data transfer technologies are lagging behind. Networks and storage connection technologies are just not keeping up. We may have 1,000 times the storage of a few years ago, but it takes much longer to copy all that data. At some point it takes more than 24 hours to transfer all the data via ETL, and batch processing becomes unworkable. Additionally, the 24-hour global economy means there is no “down time” anymore. Just because employees in New York are sleeping does not mean that suppliers in China can wait for the ETL to finish. Data virtualization means not having to shut down or strain the system during a certain part of the day or the weekend.
There is still a place for traditional data warehouses, but affordable 64-bit servers packed with loads of memory are making data virtualization possible for larger databases. Today, up to 100-gigabyte databases are a great fit for virtualization, and SSD technology is bringing this to the 10-terabyte range. Luckily, most systems fit into these windows and virtualization is ready to be the number-one choice for data management.