Category Archives: BI

PowerPivot & The Parable of the Snow Sock

For those of us in North West Europe this winter has been one of the coldest on record; not only cold but early, in this part of the world winter usually only kicks in after Christmas. The beauty of a perfect White Christmas Day did make up for the previous month of extreme cold and inconvenience but only just; and would not have at all if I had not invested in a simple piece of relatively new technology; a pair of snow socks.

For it was this pair of snow socks that enabled our household to continue to get to work and school, to Christmas shop, visit a sick relative and drive on Christmas Day to the winter wonderland that was a frozen Blessington Lake.

As the cold spell continued the main roads (and eventually the motorways) became hazardous to travel on; minor roads and housing estates became close to impassable for many normal cars. As the ability of the county council to keep roads functioning decreased (due to an early winter catching it on the hop with insufficient salt stocks and no Plan B) so my family and our neighbours realised than self-help was our best bet. People started investing in snow shovels, sand and salt, and hiring JCBs; gritting hills, clearing school and community hall carparks, helping dig out neighbours’ and strangers’ cars. And I invested in a pair of Weissenfels WeissSock Snow Socks (from http://www.micksgarage.ie/tyre-snow-socks.aspx).

Snow chains or even winter tyres are not seen as necessary by most drivers in Ireland or the UK; our winters (including this and last year’s nearly as cold one) are not that severe. Most are unaware of the alternative, the snow sock, a relatively new concept; in effect, a fabric snow chain. A fraction of the cost of winter tyres; easier to use (and nearly as effective) as snow chains.

Having a pair meant we had our own self-service road treating device, insuring we could safely get to the local town or nearest motorway and  when the M7 motorway was backed-up being able to use the old un-treated N7 road to bypass it.

PowerPivot is to Excel what a Snow Sock is to a car in a snowy winter. The ideas and a lot of the technology used in both products have been around for some time, but the accessibility, low-cost and “good enough” packaging are game changers.

Like our neighbourhood waiting on an over-stretched, under-funded county council to solve our “last mile” problems, countless small companies, departments and individuals in large organisations are likewise awaiting their IT Godots often in vain and if delivered, often too little and/or too late.

Two technologies that I’ve been introduced to in the last 12 months:

both liberating, both having the ability to make aspects of my life much better and much easier. I expect to see a lot more people adopt both in the year to come.

So if you’re stuck for some New Years’s resolutions this year:

  • Buy some snow socks now; i.e. do NOT wait until the next “snow event”!
  • Learn PowerPivot.

Happy New Year ;)

 

UPDATE: Looks like Marco & Alberto are doing one of their PowerPivot courses in Dublin this March, so no excuses for not learning PowerPivot http://sqlblog.com/blogs/marco_russo/archive/2011/02/15/new-powerpivot-workshop-dates-copenhagen-dublin-and-zurich.aspx

The 64-bit question, and the birth of MicroETL

Like the original $64 question, I have for a long time being pondering the Excel 64-bit question. Whether “To take it, or leave it”?

When first announced, I believed 64bit Excel would only be of interest to a minority of demented quants in investment banks hacking their way to yet another evil model to bankrupt the world. The problem of incompatible add-ins, COM controls etc. would also make its widespread adoption less likely.

But three things changed my mind:

  • first, the appearance of Windows 7 powered 64-bit PCs on many worker’s desks
  • second, my embedding of SQLite into Excel which enables me to effectively use the extra memory available to a 64-bit instance to hold, analyse and transform a huge amount of raw relational data (in the form of a “:memory:” database)
  • and finally, the arrival on the scene of PowerPivot.

Although a PowerPivot workbook doesn’t allow more than 4Gig of memory to be addressed (this is a SharePoint restriction, as a 4G workbook is likely to compress down to <= the 2G upload limit imposed by SharePoint), this represents a doubling of the 32bit limits in memory terms, but much more in raw data terms as PowerPivot can achieve up to 10x-20x times data compression.

So, I decided to bite the bullet; fired up an AWS 64-bit image and started to upgrade my xLite code-base to handle this new world.

I decided on the TDN-GCC MinGW-w64 distro to re-compile 64-bit versions of my SQLite wrapper  libraries, and set about converting  my VBA code to handle both 64-bit and 32-bit dlls. So far, easy peasy.

The problems started when I went to convert my JavaScript & Python wrappers. I couldn’t get JSDB to compile to either 32-bit or 64-bit using MinGW-w64, so I abandoned the attempt as my main use of JSDB is as a “command-line data-crunching tool” similar to, and used along side, CSVFix and SFK (Swiss File Knife). Having the ability to call JavaScript in-process was relatively new and I could live without it.

More worrying was the problem with Python embedding, in-process Python would be harder to live without. The APSW SQLite library which I had used is 32-bit only. Luckily the Python 2.7 version of its SQLite3 standard module  (aka pysqlite) now allows the loading of SQLite C extensions. This enables me to load databases under Python and extract the C handle using a specially written extension, thus enabling Excel/VBA to attach to a Python opened SQLite shared memory.

Without this new  loaded extension trick I would have had to modify the SQLite3 library itself as the previous 2.6 version would not divulge SQLite’s handle, while APSW did, which was  the reason I picked APSW in the first place.  As a result, xLite’s Python functionality now requires Python 2.7.

So the end result is a somewhat reduced-functionality xLite  (no in-process JavaScript, missing APSW functionality such as Python-based SQLite virtual tables), but running on the ultimate data-smithing platform:

  • Excel 64-bit combining with
  • the relational power of SQLite 64bit,
  • the fast and easy scripting power of Python 64-bit,
  • and topped off with the king of pivots, 64-bit PowerPivot.

A week well spent, I think!

Oh, I’ve also decided to rename the add-in to MicroETL. This is mainly to reflect the fact that xLite (SQLite embedded in Excel) is just one element within the tool. The  pure-VBA detached-ADO recordset functionality, embedded Python and xLiteScript elements can exist independently of xLite.  That, and I’m fed-up getting enquiries about X-Lite http://www.counterpath.com/x-lite.html and I own the http://www.microETL.com URL ;-)

Micro ETL in the PowerPivot age

Although PowerPivot has many of the characteristics of an ETL tool, i.e. the ability to connect to disparate datasources, to filter that data and to transform it, it will still hit a brick wall when confronted by the typical data spewed out by operational systems. I’m sure this is by design as a sophisticated ETL tool is both complex to design and, probably even more relevant, is difficult to use.

Mind you, a few years back we IT pros would have said the same about front-end BI cube configuration, and behold today we have tools such as PowerPivot that prove that this doesn’t always hold true. Perhaps subsequent versions of PowerPivot will do the same for ETL as it has for BI cubes. In the mean time much of the necessary ETL will have to take place prior to loading into PowerPivot.  But where?

First off, what’s ETL?

The term ETL applies to one of the trinity of activities that have, over the last two decades or so, been at the heart of reporting/decision support systems. The other two terms: DW (data warehousing)  & BI (business intelligence), are sometimes used to refer to the whole process but can also be used to refer to two distinct sub-processes. Confused? Well, so you should be; these terms have been abused and redefined by scores of vendors over the years but for our purpose here we’ll stick to their roles as acronyms for the two of the  processes involved in the preparation & presentation of reporting data.

BI is the term now most commonly associated by non-IT folks with decision support systems, as it’s role is the most obvious i.e. front-end presentation and manipulation of data; the dashboards, pivots, charts, summary lists etc…

DW, data warehousing, is the term that most IT people who’ve been in the business for a while would use to describe the techniques, best practice etc. associated with this area. The heart of traditional DW was the data warehouse itself, a mighty repository of historical data optimized for reporting purposes. When DW as a concept started it was very rare indeed for operational (OLTP) systems to hold transactional data for more that a few weeks, usually just long enough to get through month-end.  Such specially built datastores’ days may be numbered as the data capacity of operational systems grow and the data munching ability of new ETL techniques (MapReduce for example) to transform vast amounts of data continues to increase.

ETL stands for Extract,Transform and Load; sometimes also styled, ELT, extract load & transform (PowerPivot would fall into this catergory). This is the process which traditionally swallowed most of the development budget in DW/BI projects (and kept me gainfully employed for years). It was the area where the dark arts of datasmiths collided with the often frightening reality of raw untamed data, with the added venom of corporate-politics-driven “data ownership” battles.  A messy business, and continues to be, even in these days of open data and open APIs.

ETL tools vary from text-editor written SQL to hugely expensive point’n'click ETL packages. Packaged ETL vendors promised (and continue to do so) that their tools would vanquish the dark arts of datasmiths with products that were so easy to use that the CEO would chip in with a few scripts to get the project finished. The reality was that IT types  found they had to learn yet another sub-optimal “language” and more often than not had to drop-down to “proper” languages to actually drive the thing to completion. ETL was (and still largely is) the preserve of IT.

ETL is easier than herding cats but just about ...

The tools have improved a lot since those early days and open source has at least removed for some the 6 figure licensing costs from the equation. ETL, like programming in general, is hard, so get over it. Tools, basic knowledge of SQL and data modelling skills can help to make ETL approachable to non-IT types, but it still has the potential to make your head hurt.

So what’s a PowerPivot’r to do?

If your organisation already has a data warehouse in place you’re in luck as it’s quite likely a lot of the data you require will exist in the optimal PowerPivot import format, i.e. a star schema. You might be out of luck though, a significant percentage of DWs will not have used dimensional modeling and you could find yourself looking a complex OLTP like data model. In that case, and in the case of pulling the data directly from an operational system, you’re in the micro ETL business. Even if your IT infrastructure provides you with cleansed and understandable data, you’ll be faced with integrating external or shadow-IT data (probably one of the main reasons why PowerPivot appeals); again you’ll either need IT support or else you must learn how to do it yourself.

Long before the likes of PowerPivot appeared I regularly found myself  in need of a micro ETL toobox i.e. a set of tools that would enable me to quickly and cost-effectively prepare data for loading into some system or other. Nine out of ten times that system was an Excel PivotTable (the rest of the time it was usually a master data take-on task or some variation of systems’ commissioning). Although the consumers of my datasmithing services would most likely assume that I used Excel alone to perform these works of wonder, I usually had an Oracle database (along with its data loaders and superb PL/SQL language) as my secret ingredient. This combination of Excel and Oracle served me (and my clients) well, but it wasn’t the Oracle bit that gave me the edge; I could, and did, substitute SQL Server and MS Access for the SQL layer. The real trick was the combination of Excel’s flexibility/presentational strengths with SQL’s list handling power.

The problem with this approach was that the interface between the SQL engines & the spreadsheets often involved quite a number of manual steps, and the presence of database software (even MS Access) could not always be depended on. It took my discovery of SQLite to enabled me to finally combine the two worlds; microETL was born!

This combination of Excel, and an in-process SQL engine (provided by SQLite) with the added optional ability to call either VBA or Python scripts, has provided me with a hugely flexible and powerful micro ETL tool. Now, with the arrival of PowerPivot, I have both the micro ETL and micro BI tools to build cost-effective Excel-based decision support systems.

Being Excel based, means that the end result is delivered in a format that many business people are comfortable with; PowerPivot is designed very much with “civilian” datasmiths in mind. Likewise, microETL’s ability to use VBA, simple SQL and Excel formula to perform data transformations makes a large part (if not all) of the ETL process “civilian friendly”.

I’m not saying that everything I can do with microETL will be as easy for an non-IT datasmith; many datasources are either too difficult and/or extremely time consuming for end-users to navigate; but much of the business logic can be expressed in Excel terms with the highly technical or time-consuming tasks handled by SQL or VBA/Python/JavaScript. MicroETL is not only for once-off transformations but can also be used to automate ETL, report generation and refresh tasks (including refreshing PowerPivot itself).

So, if you’re thinking about utilising PowerPivot, but need help in preparing your data and automating the tasks involved, perhaps we should talk.

UPDATE:

I’m in the process of developing an off-shoot of mciroETL called HAMMER which I believe will be even more suitable as a micro ETL tool for the PowerPivot age. You can follow it’s progress here …

PowerPivot – Show Detail not allowed!

Last week, I at long last set aside some time to put PowerPivot through its paces, triggered by my purchasing of Excel 2010 (in itself a momentous occasion as without the attraction of PowerPivot I would have followed my, and most other Office users’, usual pattern of waiting 3-5 years or so before investigating the ‘latest’ edition).

The verdict?

In general, still impressed, the concept demo’d to me over a year ago has evolved into an impressive first version. Some things such as hierarchies did not make it (hierarchies are constructed from cross-joins of field /attribute sets as per a normal pivot table) so no hierarchy rules as would be the case with standard OLAP cubes; but perhaps for many end users the ‘traditional’ construct-on-the-fly hierarchies will be more approachable.

The DAX functionality is better than I expected; easier for non-techies than MDX but still powerful.

Importing data is hassle free and intuitive, the VertiPaq engine does a wonderful job of compressing the imported data and the resulting in-memory column-store is certainly very fast. I like the ‘linked-table’ option which allows for normal Excel tables (the one useful new feature that Excel 2007 introduced) to be added to the PowerPivot star-schema.  Being able to import any datasource publishable in AtomPub format (such as Google Docs spreadsheets!) is also nice.

So it’s all good then? Yes, except for one really annoying missing; no drill-through (aka Show Detail, aka drill-thru) allowed.

What? surely some mistake!

Afraid not, I initially thought the measures I attempted to Show Detail on were too complex (as the error message “Show Details cannot be completed on a calculated cell” suggested). Then I assumed it had been cut to meet a delivery deadline and would appear in a subsequent version. But no, Donald Farmer confirmed that is was intentionally removed as the feedback from IT organisations was not to allow end-users the ability to drill-through to potentiality millions of rows when running under a SharePoint server. As for those of us running PowerPivot on the client, we already have all the data, so no need for show-detail!

Okay, I can understand IT’s reluctance to allow a multi-million row drill-through but surely that should be the decision of individual IT groups to allow or not, and if allowed, to provide the ability to limit the amount of data returned.

A million row result is of limited use, most Show Details are a few thousand at max, and typically sub-1000, so sensible limits can easily be enforced.

The client-side (the side that matters most to me) is a very different story. Here the excuse that the data is already there, is exactly that, an excuse. Using the same logic the drill-through on a normal pivot-table should be unnecessary. Yet, if you watch end-users construct pivots they use it constantly; not just to discover the detail behind a figure but as often as not as a way of validating that the model they’ve constructed is correct.

This spot-checking of figures is the main ‘test methodology’ used in the wild. Spreadsheet ‘developers’ do not construct sophisticated test harnesses and procedures. You might argue they should, but they don’t and likely never will. And as for the ‘multi-million row result’ problem, end-users are not idiots they’re just end-users, they’ll do it once, and learn to be more careful the next time (or they’ll use the limit-rows option).

This lack of drill-though will definitely mean I will continue to use normal pivot tables for situations that would otherwise be better solved using PowerPivot. As many such models will be based on relatively small datasets ( sub-100,000  ’facts’) it might be suggested there’s no need for PowerPivot. But this is to miss the ‘intellectual’ power (as opposed to the massive data crunching power of VeriPaq) at the heart of PowerPivot; the star-schema.

Most of  the commentary on PowerPivot has focused on its ability to handle really large datasets but this emphasis on ‘big-data’ (something the rest of the BI industry share) often ignores the power of human-scale small-data (i.e. the world of the spreadsheet jockeys). The power of a star-schema to model BI problems (be they small or large) is something I’ll return to in a later post. (UPDATE: http://blog.gobansaor.com/2010/07/09/star-schemas-to-boldly-go-where-no-excel-spreadsheet-has-gone-before/)

Star-schema models (particularly when the speed of access worries are removed by an in-memory column-store) are superior in many respects to the fully denormalised flattened tables that we currently build to support pivot-tables and are also more flexible than the multi-dimensional cell approach of pure MOLAP cubes. Combining such data models with the user-friendliness of spreadsheets,alongside the added magic of a modelling language such as DAX (and some MDX where necessary) on a datasmith’s laptop is the true beauty of PowerPivot.

So, lack of drill-through aside, “Well done Microsoft!”

UPDATE:

For a semi-acceptable alternative to no drill-thru see http://blog.gobansaor.com/2011/03/22/powerpivot-show-details/

JavaScript as an Excel scripting language via JSDB

A few years back I posted about JavaScript as an Excel scripting language via ExcelDNA. That involved using JavaScript (in the guise of JScript.NET) as an ExcelDNA scripting language. It was purely an academic exercise to prove it could be done, I continued to use C# (or increasingly VB.NET) to build .NET user defined functions. This time however, I’ve managed to embed JavaScript (in the guise of Mozilla Foundation’s SpiderMonkey) directly via a native C interface not to prove I could do it (even though there’s a definite satisfaction in simply doing it) but to use it.

Why add another scripting language to xLite, hasn’t it already got Python?

True, Python is and remains a very powerful add-on to xLite. It’s a mature and long-established language, popular amongst IT professionals and “citizen programmers” alike. But, it’s a bit of a monster and can be awkward to package, particularly on Windows. By using Py2Exe and after a lot of digging on the issue of manifest files I have managed to package and isolate xLite’s Pythonic bits so that it can be used on a PC without first installing the required Python version (I’ve only tested against V2.6, Python’s lack of a side-by-side Windows installation capability is a major pain-in-the-butt bad news: tested against Python 2.5 & it doesn’t work; good news: side-by-side is possible; simply change the system path to reflect which-ever version you wish to run at the command line; xLite will (must) continue to use V2.6). This “version-hell” mitigates against using Python as a core-element of xLite, fine for those of us who are comfortable with and require the full power of Python, but not as the tool’s primary scripting environment.

No, what I need is:

  • a light (single EXE or DLL preferably) and an approachable popular language,
  • with native SQLite support,
  • runable as standalone executable (on both Windows & Linux),
  • embeddable (is that a word?) in Excel via VBA-friendly DLL.

Add to that essential list some nice-to-haves such as:

  • native COM-interface support (for the likes of ADO etc.),
  • native networking support, for HTTP, raw TCP sockets etc.,
  • native (and easy to use) XML and JSON parsers and emitters,
  • ability to spawn detached/attached command line processes and the ability to stream data to and from such processes; allowing me to easily orchestrate & provide a “grid” of processes (scripted, command line executables, Excel instances) both local and remote (with remote being either traditional servers, http servers or Hadoop Streaming grids).

While looking at the various implementations of Javascipt as a server/shell language, I came across JSDB – JavaScript for Databases. It’s a C++ wrapper around Mozilla’s SpiderMonkey, with lots of useful data related utility classes added. To make JSDB a perfect fit it simply required:

  • a few minor changes to the SQlite class (allow the loading of Virtual Table extensions, add the ability to pass in the address of already open SQLite memory structures);
  • a linker change to use the DLL version of SQLite;
  • plus a VBA-friendly DLL wrapper ( & VBA declares, to call the DLL) to replace the JSDB shell when embedding in Excel.

From my SQLite as the MP3 of Data post:  ”Just as “fractional horsepower” electrical motors revolutionised manufacturing and eventually all our lives (car starter-motors, fridge motors, washing machines etc.), “fractional horsepower” databases can do the same for data. Distributing data to where it is needed.” I can now add a distributed “fractional horsepower” processing engine for that distributed data. This transforms xLite from a micro-ETL platform into one capable of handling (or at least orchestrating) practically any ETL (Extract, Transform & Load),DI (data integration) or “Time Asset” (see this post) process.

UPDATE:

Here’s a VBA source with examples of using the embedded JSDB engine:

http://www2.gobansaor.com/share/example_jsdb_vb_embed.bas

… and the source code for the DLL wrapper:

http://www2.gobansaor.com/share/wrap_sqlite.cpp

… plus the VBA Declares and helper Windows APIs:

http://www2.gobansaor.com/share/jsdb.bas

UPDATE:

See JSDB being used to front-end Excel as a web service http://blog.gobansaor.com/2011/02/06/excel-as-a-diy-web-service/