Tag Archives: Star Schema

Death of the Star Schema?

With the release of the next version of PowerPivot around the corner (mid March I think), I’ve been re-acquainting myself with its new features. Most of the current version’s annoyances have been remedied (no drill-thru, no hierarchy support for example); and the additional enhancements to the DAX language (crossjoins, alternate relationships etc) make modelling m0st any problem possible (and generally easy).

The more I come to know PowerPivot, the more I believe that modelled data warehouses‘ days are numbered. I didn’t say data warehouses per se, rather those that attempt to centrally model end user reporting structures (usually as star-schemas).

There will continue to be a need for centrally controlled data warehouses (or at least simplified data views (and/or copies) of operational datasets, either provided by system vendors of by in-house IT) to bridge the raw-to-actionable data gap. But I suspect the emphasis will change from providing finished goods to providing semi-processed raw materials.

So, will the star-schema become redundant? No, as it’s still a valid method of modelling a reporting requirement in order to make many queries simpler to phrase (this obviously applies to SQL , but also to DAX queries). But, those who build them will be doing so closer to the problem at hand, and specific to that problem (I’ve discussed this before in http://blog.gobansaor.com/2011/01/08/slowly-changing-dimensions-time-to-stop-worrying/).

For many reports the barely modified operational data model will be all that’s required (for example, DAX doesn’t require “fact” header/detail tables to be flattened to detail level, as would be the case with a classic star).

“Good Enough” models will become the norm; classic “Everything You Ever Wanted to Know” centralised models a luxury for most (especially as such models tend to “age” very quickly).

If you’re about to invest or re-furbish your data warehouse or your reporting data sub-systems, don’t do so without first taking a serious look at PowerPivot. This is a game-changer, not just for full-stack Microsoft BI shops, but for any business that finds that their reporting datasets invariably end-up in Excel.

If you need any help evaluating PowerPivot or modelling your reporting needs in PowerPivot,  I’m for hire.

Update:

Just in case you think I’m an dimensional-model un-believer or likely to abandon my star-schema roots read this….

Star Schemas: to explore strange new conformed dimensions, to seek out new measures, to boldly go where no Excel spreadsheet has gone before.

Micro ETL in the PowerPivot age

Although PowerPivot has many of the characteristics of an ETL tool, i.e. the ability to connect to disparate datasources, to filter that data and to transform it, it will still hit a brick wall when confronted by the typical data spewed out by operational systems. I’m sure this is by design as a sophisticated ETL tool is both complex to design and, probably even more relevant, is difficult to use.

Mind you, a few years back we IT pros would have said the same about front-end BI cube configuration, and behold today we have tools such as PowerPivot that prove that this doesn’t always hold true. Perhaps subsequent versions of PowerPivot will do the same for ETL as it has for BI cubes. In the mean time much of the necessary ETL will have to take place prior to loading into PowerPivot.  But where?

First off, what’s ETL?

The term ETL applies to one of the trinity of activities that have, over the last two decades or so, been at the heart of reporting/decision support systems. The other two terms: DW (data warehousing)  & BI (business intelligence), are sometimes used to refer to the whole process but can also be used to refer to two distinct sub-processes. Confused? Well, so you should be; these terms have been abused and redefined by scores of vendors over the years but for our purpose here we’ll stick to their roles as acronyms for the two of the  processes involved in the preparation & presentation of reporting data.

BI is the term now most commonly associated by non-IT folks with decision support systems, as it’s role is the most obvious i.e. front-end presentation and manipulation of data; the dashboards, pivots, charts, summary lists etc…

DW, data warehousing, is the term that most IT people who’ve been in the business for a while would use to describe the techniques, best practice etc. associated with this area. The heart of traditional DW was the data warehouse itself, a mighty repository of historical data optimized for reporting purposes. When DW as a concept started it was very rare indeed for operational (OLTP) systems to hold transactional data for more that a few weeks, usually just long enough to get through month-end.  Such specially built datastores’ days may be numbered as the data capacity of operational systems grow and the data munching ability of new ETL techniques (MapReduce for example) to transform vast amounts of data continues to increase.

ETL stands for Extract,Transform and Load; sometimes also styled, ELT, extract load & transform (PowerPivot would fall into this catergory). This is the process which traditionally swallowed most of the development budget in DW/BI projects (and kept me gainfully employed for years). It was the area where the dark arts of datasmiths collided with the often frightening reality of raw untamed data, with the added venom of corporate-politics-driven “data ownership” battles.  A messy business, and continues to be, even in these days of open data and open APIs.

ETL tools vary from text-editor written SQL to hugely expensive point’n'click ETL packages. Packaged ETL vendors promised (and continue to do so) that their tools would vanquish the dark arts of datasmiths with products that were so easy to use that the CEO would chip in with a few scripts to get the project finished. The reality was that IT types  found they had to learn yet another sub-optimal “language” and more often than not had to drop-down to “proper” languages to actually drive the thing to completion. ETL was (and still largely is) the preserve of IT.

ETL is easier than herding cats but just about ...

The tools have improved a lot since those early days and open source has at least removed for some the 6 figure licensing costs from the equation. ETL, like programming in general, is hard, so get over it. Tools, basic knowledge of SQL and data modelling skills can help to make ETL approachable to non-IT types, but it still has the potential to make your head hurt.

So what’s a PowerPivot’r to do?

If your organisation already has a data warehouse in place you’re in luck as it’s quite likely a lot of the data you require will exist in the optimal PowerPivot import format, i.e. a star schema. You might be out of luck though, a significant percentage of DWs will not have used dimensional modeling and you could find yourself looking a complex OLTP like data model. In that case, and in the case of pulling the data directly from an operational system, you’re in the micro ETL business. Even if your IT infrastructure provides you with cleansed and understandable data, you’ll be faced with integrating external or shadow-IT data (probably one of the main reasons why PowerPivot appeals); again you’ll either need IT support or else you must learn how to do it yourself.

Long before the likes of PowerPivot appeared I regularly found myself  in need of a micro ETL toobox i.e. a set of tools that would enable me to quickly and cost-effectively prepare data for loading into some system or other. Nine out of ten times that system was an Excel PivotTable (the rest of the time it was usually a master data take-on task or some variation of systems’ commissioning). Although the consumers of my datasmithing services would most likely assume that I used Excel alone to perform these works of wonder, I usually had an Oracle database (along with its data loaders and superb PL/SQL language) as my secret ingredient. This combination of Excel and Oracle served me (and my clients) well, but it wasn’t the Oracle bit that gave me the edge; I could, and did, substitute SQL Server and MS Access for the SQL layer. The real trick was the combination of Excel’s flexibility/presentational strengths with SQL’s list handling power.

The problem with this approach was that the interface between the SQL engines & the spreadsheets often involved quite a number of manual steps, and the presence of database software (even MS Access) could not always be depended on. It took my discovery of SQLite to enabled me to finally combine the two worlds; microETL was born!

This combination of Excel, and an in-process SQL engine (provided by SQLite) with the added optional ability to call either VBA or Python scripts, has provided me with a hugely flexible and powerful micro ETL tool. Now, with the arrival of PowerPivot, I have both the micro ETL and micro BI tools to build cost-effective Excel-based decision support systems.

Being Excel based, means that the end result is delivered in a format that many business people are comfortable with; PowerPivot is designed very much with “civilian” datasmiths in mind. Likewise, microETL’s ability to use VBA, simple SQL and Excel formula to perform data transformations makes a large part (if not all) of the ETL process “civilian friendly”.

I’m not saying that everything I can do with microETL will be as easy for an non-IT datasmith; many datasources are either too difficult and/or extremely time consuming for end-users to navigate; but much of the business logic can be expressed in Excel terms with the highly technical or time-consuming tasks handled by SQL or VBA/Python/JavaScript. MicroETL is not only for once-off transformations but can also be used to automate ETL, report generation and refresh tasks (including refreshing PowerPivot itself).

So, if you’re thinking about utilising PowerPivot, but need help in preparing your data and automating the tasks involved, perhaps we should talk.

UPDATE:

I’m in the process of developing an off-shoot of mciroETL called HAMMER which I believe will be even more suitable as a micro ETL tool for the PowerPivot age. You can follow it’s progress here …

Star Schemas – to boldly go where no Excel spreadsheet has gone before

Need help designing PowerPivot models 
see  www.gobansaor.com/powerpivot

One of the many things that delights me about PowerPivot is the central role played by the Star Schema. Those of you reading with a data-warehousing background would shrug your shoulders and say: “So what, what else would you expect to find at the core of a BI tool?”.

Those from an Excel PivotTable background would ask: “What’s a Star Schema, why do we need one,what’s wrong with a the good old-fashioned single flattened table?”.

Those from a classic MOLAP background (Essbase, TM1, Palo) might also ask: “Why do we need this extra layer? Load the cube directly from the operational data model and get on with it!”.

A quick Q&A is perhaps the best way for me to explain why star schema design is a powerful skill in a datasmith’s toolset.

First off, what’s a Star Schema?

A Star Schema (also know as the dimensional model) is a denormalised (flattened) data model used to simplify an operational (OLTP) data model to better accommodate reporting and what-if analysis.

At its simplest, it consists of a central fact table with links back to a “surrounding” set of dimensional tables, hence the star name. A variation is the snow-flake schema, where the dimensional tables are not fully denormalised (e.g. Product Category->Product->Fact instead of  Product->Fact).

The role of the fact table (besides being the table that hosts most of the measure fields) is to create linkages between dimensions (such as Customer, Product, Date) usually based on an actual transactional event (e.g. Invoice) or a proposed event (such as a Budget or Forecasted Sale). In effect, simplifying  the often complex work-flow-driving connections of a typical operational system by using a single many-to-many relationship (modern ERP/CRM systems’ data models consist of scores of configurable many-to-many relationships).

Many wrongly believe the star-schema was adopted for performance reasons and now that in-memory OLAP is becoming the norm it’s no longer necessary to use dimensional modelling techniques. In fact, in the early days of data-warehousing, RDBMs had great difficulty efficiently handling star-queries (and some such as MySQL and SQLite, still do).

The original primary purpose of the star schema was to simplify the SQL required to access reporting data; to make the model more approachable to non-technical users. Of course, even simple SQL was beyond the knowledge or interest of most end-users but a sizeable proportion were happy to do so (often helped by SQL “generators” such as MicroStratery or Business Objects). But even in situations where SQL-wielding civilians were not to be found, the simplicity of the dimensional models  proved to be a valuable aid when establishing and developing the warehouse data requirements. PowerPivot requires no SQL knowledge to manipulate the dimensional model which brings the original concept full-circle but this time opening its possibilities to a much wider audience.

But surely, concentrating on the actual reports would be a more valuable requirements gathering exercise?

A so called “bottom-up approach” is often the best way to approach a reporting request particularly if the reports are simple one-off “traditional” reports. But for self-service BI, this needs to be combined with a top-down dimensional design. The idea is not to build out each and every report or indeed cube but to build a structure that’ll support likely queries. The process of building a star schema provides both a logical model and a physical implementation of that model against which potential queries can be tested. I’ve worked on several POCs destined for implementation in Essbase where the star-schema was built and potential cubes mocked up using Excel PivotTables that subsequently never went any further (except for the star-schema ETL process). The end-users derived sufficient value from the denormalised star-schema pivoted and reported in Excel.

In traditional ROLAP data-warehouses where the cubes were built directly against star-schemas, the pure logical approach to the data model often had to take a back-seat to the necessity of fine-tuning it to make response times (be that ETL or user-pivoting) acceptable. This is why I much preferred situations where the star acted as a logical model from which MOLAP cubes were built.

With PowerPivot, ROLAP has a new champion. The column-oriented high-compression in-memory architecture means that the compromises of the past are no longer necessary. The fact table reverts back to it primary role as a many-to-many connector. In a pure hypercube, measures are just another dimension (the approach that Palo takes), this is also now true of PowerPivot models; measures can be sourced from dimension tables and dimensions from fact tables as it logically should be, but without the performance hit of old.

But what’s the advantage of a star schema over a flattened table when using PowerPivot?

It is true that the same flattened table model as used to backend a PivotTable can be used within PowerPivot. But doing so would limit the potential of the DAX language to construct measures such as average sales spread over potential customers (rather than actual customers that would typically be represented on a flattened table). Also, by creating “conformed dimensions” (single cross-business views of Customer, Product etc.) and using such tables as dimensional sources for multiple fact tables, “virtual cubes” that combine values from multiple fact tables can be built.

If you’re new to dimensional modelling I’d recommend the books & articles of Ralph Kimbal as good starting point. You do have to be aware that some of the advice regarding efficiency trade-offs, surrogate keys etc. do not  apply in a PowerPivot scenario (even though other performance issues still apply) but the logical design tips still apply.

Star Schemas: to explore strange new conformed dimensions, to seek out new measures, to boldly go where no Excel spreadsheet has gone before.

More on data modelling for PowerPivot …

Need help designing PowerPivot models 
see  www.gobansaor.com/powerpivot