Category Archives: BI

Micro ETL in the PowerPivot age

Although PowerPivot has many of the characteristics of an ETL tool, i.e. the ability to connect to disparate datasources, to filter that data and to transform it, it will still hit a brick wall when confronted by the typical data spewed out by operational systems. I’m sure this is by design as a sophisticated ETL tool is both complex to design and, probably even more relevant, is difficult to use.

Mind you, a few years back we IT pros would have said the same about front-end BI cube configuration, and behold today we have tools such as PowerPivot that prove that this doesn’t always hold true. Perhaps subsequent versions of PowerPivot will do the same for ETL as it has for BI cubes. In the mean time much of the necessary ETL will have to take place prior to loading into PowerPivot.  But where?

First off, what’s ETL?

The term ETL applies to one of the trinity of activities that have, over the last two decades or so, been at the heart of reporting/decision support systems. The other two terms: DW (data warehousing)  & BI (business intelligence), are sometimes used to refer to the whole process but can also be used to refer to two distinct sub-processes. Confused? Well, so you should be; these terms have been abused and redefined by scores of vendors over the years but for our purpose here we’ll stick to their roles as acronyms for the two of the  processes involved in the preparation & presentation of reporting data.

BI is the term now most commonly associated by non-IT folks with decision support systems, as it’s role is the most obvious i.e. front-end presentation and manipulation of data; the dashboards, pivots, charts, summary lists etc…

DW, data warehousing, is the term that most IT people who’ve been in the business for a while would use to describe the techniques, best practice etc. associated with this area. The heart of traditional DW was the data warehouse itself, a mighty repository of historical data optimized for reporting purposes. When DW as a concept started it was very rare indeed for operational (OLTP) systems to hold transactional data for more that a few weeks, usually just long enough to get through month-end.  Such specially built datastores’ days may be numbered as the data capacity of operational systems grow and the data munching ability of new ETL techniques (MapReduce for example) to transform vast amounts of data continues to increase.

ETL stands for Extract,Transform and Load; sometimes also styled, ELT, extract load & transform (PowerPivot would fall into this catergory). This is the process which traditionally swallowed most of the development budget in DW/BI projects (and kept me gainfully employed for years). It was the area where the dark arts of datasmiths collided with the often frightening reality of raw untamed data, with the added venom of corporate-politics-driven “data ownership” battles.  A messy business, and continues to be, even in these days of open data and open APIs.

ETL tools vary from text-editor written SQL to hugely expensive point’n'click ETL packages. Packaged ETL vendors promised (and continue to do so) that their tools would vanquish the dark arts of datasmiths with products that were so easy to use that the CEO would chip in with a few scripts to get the project finished. The reality was that IT types  found they had to learn yet another sub-optimal “language” and more often than not had to drop-down to “proper” languages to actually drive the thing to completion. ETL was (and still largely is) the preserve of IT.

ETL is easier than herding cats but just about ...

The tools have improved a lot since those early days and open source has at least removed for some the 6 figure licensing costs from the equation. ETL, like programming in general, is hard, so get over it. Tools, basic knowledge of SQL and data modelling skills can help to make ETL approachable to non-IT types, but it still has the potential to make your head hurt.

So what’s a PowerPivot’r to do?

If your organisation already has a data warehouse in place you’re in luck as it’s quite likely a lot of the data you require will exist in the optimal PowerPivot import format, i.e. a star schema. You might be out of luck though, a significant percentage of DWs will not have used dimensional modeling and you could find yourself looking a complex OLTP like data model. In that case, and in the case of pulling the data directly from an operational system, you’re in the micro ETL business. Even if your IT infrastructure provides you with cleansed and understandable data, you’ll be faced with integrating external or shadow-IT data (probably one of the main reasons why PowerPivot appeals); again you’ll either need IT support or else you must learn how to do it yourself.

Long before the likes of PowerPivot appeared I regularly found myself  in need of a micro ETL toobox i.e. a set of tools that would enable me to quickly and cost-effectively prepare data for loading into some system or other. Nine out of ten times that system was an Excel PivotTable (the rest of the time it was usually a master data take-on task or some variation of systems’ commissioning). Although the consumers of my datasmithing services would most likely assume that I used Excel alone to perform these works of wonder, I usually had an Oracle database (along with its data loaders and superb PL/SQL language) as my secret ingredient. This combination of Excel and Oracle served me (and my clients) well, but it wasn’t the Oracle bit that gave me the edge; I could, and did, substitute SQL Server and MS Access for the SQL layer. The real trick was the combination of Excel’s flexibility/presentational strengths with SQL’s list handling power.

The problem with this approach was that the interface between the SQL engines & the spreadsheets often involved quite a number of manual steps, and the presence of database software (even MS Access) could not always be depended on. It took my discovery of SQLite to enabled me to finally combine the two worlds; xLite was born!

This combination of Excel, and an in-process SQL engine (provided by SQLite) with the added optional ability to call either VBA, JavaScript or Python scripts, has provided me with a hugely flexible and powerful micro ETL tool. Now, with the arrival of PowerPivot, I have both the micro ETL and micro BI tools to build cost-effective Excel-based decision support systems.

Being Excel based, means that the end result is delivered in a format that many business people are comfortable with; PowerPivot is designed very much with “civilian” datasmiths in mind. Likewise, xLite’s ability to use VBA, simple SQL and Excel formula to perform data transformations makes a large part (if not all) of the ETL process “civilian friendly”.

I’m not saying that everything I can do with xLite will be as easy for an non-IT datasmith; many datasources are either too difficult and/or extremely time consuming for end-users to navigate; but much of the business logic can be expressed in Excel terms with the highly technical or time-consuming tasks  handled by SQL or VBA/Python/JavaScript. xLite is not only for once-off transformations but can also be used to automate ETL, report generation and refresh tasks (including refreshing PowerPivot itself).

So, if you’re thinking about utilising PowerPivot, but need help in preparing your data and automating the tasks involved, perhaps we should talk.

PowerPivot – Show Detail not allowed!

Last week, I at long last set aside some time to put PowerPivot through its paces, triggered by my purchasing of Excel 2010 (in itself a momentous occasion as without the attraction of PowerPivot I would have followed my, and most other Office users’, usual pattern of waiting 3-5 years or so before investigating the ‘latest’ edition).

The verdict?

In general, still impressed, the concept demo’d to me over a year ago has evolved into an impressive first version. Some things such as hierarchies did not make it (hierarchies are constructed from cross-joins of field /attribute sets as per a normal pivot table) so no hierarchy rules as would be the case with standard OLAP cubes; but perhaps for many end users the ‘traditional’ construct-on-the-fly hierarchies will be more approachable.

The DAX functionality is better than I expected; easier for non-techies than MDX but still powerful.

Importing data is hassle free and intuitive, the VertiPaq engine does a wonderful job of compressing the imported data and the resulting in-memory column-store is certainly very fast. I like the ‘linked-table’ option which allows for normal Excel tables (the one useful new feature that Excel 2007 introduced) to be added to the PowerPivot star-schema.  Being able to import any datasource publishable in AtomPub format (such as Google Docs spreadsheets!) is also nice.

So it’s all good then? Yes, except for one really annoying missing; no drill-through (aka Show Detail, aka drill-thru) allowed.

What? surely some mistake!

Afraid not, I initially thought the measures I attempted to Show Detail on were too complex (as the error message “Show Details cannot be completed on a calculated cell” suggested). Then I assumed it had been cut to meet a delivery deadline and would appear in a subsequent version. But no, Donald Farmer confirmed that is was intentionally removed as the feedback from IT organisations was not to allow end-users the ability to drill-through to potentiality millions of rows when running under a SharePoint server. As for those of us running PowerPivot on the client, we already have all the data, so no need for show-detail!

Okay, I can understand IT’s reluctance to allow a multi-million row drill-through but surely that should be the decision of individual IT groups to allow or not, and if allowed, to provide the ability to limit the amount of data returned.

A million row result is of limited use, most Show Details are a few thousand at max, and typically sub-1000, so sensible limits can easily be enforced.

The client-side (the side that matters most to me) is a very different story. Here the excuse that the data is already there, is exactly that, an excuse. Using the same logic the drill-through on a normal pivot-table should be unnecessary. Yet, if you watch end-users construct pivots they use it constantly; not just to discover the detail behind a figure but as often as not as a way of validating that the model they’ve constructed is correct.

This spot-checking of figures is the main ‘test methodology’ used in the wild. Spreadsheet ‘developers’ do not construct sophisticated test harnesses and procedures. You might argue they should, but they don’t and likely never will. And as for the ‘multi-million row result’ problem, end-users are not idiots they’re just end-users, they’ll do it once, and learn to be more careful the next time (or they’ll use the limit-rows option).

This lack of drill-though will definitely mean I will continue to use normal pivot tables for situations that would otherwise be better solved using PowerPivot. As many such models will be based on relatively small datasets ( sub-100,000  ’facts’) it might be suggested there’s no need for PowerPivot. But this is to miss the ‘intellectual’ power (as opposed to the massive data crunching power of VeriPaq) at the heart of PowerPivot; the star-schema.

Most of  the commentary on PowerPivot has focused on its ability to handle really large datasets but this emphasis on ‘big-data’ (something the rest of the BI industry share) often ignores the power of human-scale small-data (i.e. the world of the spreadsheet jockeys). The power of a star-schema to model BI problems (be they small or large) is something I’ll return to in a later post. (UPDATE: http://blog.gobansaor.com/2010/07/09/star-schemas-to-boldly-go-where-no-excel-spreadsheet-has-gone-before/)

Star-schema models (particularly when the speed of access worries are removed by an in-memory column-store) are superior in many respects to the fully denormalised flattened tables that we currently build to support pivot-tables and are also more flexible than the multi-dimensional cell approach of pure MOLAP cubes. Combining such data models with the user-friendliness of spreadsheets,alongside the added magic of a modelling language such as DAX (and some MDX where necessary) on a datasmith’s laptop is the true beauty of PowerPivot.

So, lack of drill-through aside, “Well done Microsoft!”

JavaScript as an Excel scripting language via JSDB

A few years back I posted about JavaScript as an Excel scripting language via ExcelDNA. That involved using JavaScript (in the guise of JScript.NET) as an ExcelDNA scripting language. It was purely an academic exercise to prove it could be done, I continued to use C# (or increasingly VB.NET) to build .NET user defined functions. This time however, I’ve managed to embed JavaScript (in the guise of Mozilla Foundation’s SpiderMonkey) directly via a native C interface not to prove I could do it (even though there’s a definite satisfaction in simply doing it) but to use it.

Why add another scripting language to xLite, hasn’t it already got Python?

True, Python is and remains a very powerful add-on to xLite. It’s a mature and long-established language, popular amongst IT professionals and “citizen programmers” alike. But, it’s a bit of a monster and can be awkward to package, particularly on Windows. By using Py2Exe and after a lot of digging on the issue of manifest files I have managed to package and isolate xLite’s Pythonic bits so that it can be used on a PC without first installing the required Python version (I’ve only tested against V2.6, Python’s lack of a side-by-side Windows installation capability is a major pain-in-the-butt   bad news: tested against Python 2.5 & it doesn’t work; good news: side-by-side is possible; simply change the system path to reflect which-ever version you wish to run at the command line; xLite will (must) continue to use V2.6). This “version-hell” mitigates against using Python as a core-element of xLite, fine for those of us who are comfortable with and require the full power of Python, but not as the tool’s primary scripting environment.

No, what I need is:

  • a light (single EXE or DLL preferably) and an approachable popular language,
  • with native SQLite support,
  • runable as standalone executable (on both Windows & Linux),
  • embeddable (is that a word?) in Excel via VBA-friendly DLL.

Add to that essential list some nice-to-haves such as:

  • native COM-interface support (for the likes of ADO etc.),
  • native networking support, for HTTP, raw TCP sockets etc.,
  • native (and easy to use) XML and JSON parsers and emitters,
  • ability to spawn detached/attached command line processes and the ability to stream data to and from such processes; allowing me to easily orchestrate & provide a “grid” of processes (scripted, command line executables, Excel instances) both local and remote (with remote being either traditional servers, http servers or Hadoop Streaming grids).

While looking at the various implementations of Javascipt as a server/shell language, I came across JSDB – JavaScript for Databases. It’s a C++ wrapper around Mozilla’s SpiderMonkey, with lots of useful data related utility classes added. To make JSDB a perfect fit it simply required:

  • a few minor changes to the SQlite class (allow the loading of Virtual Table extensions, add the ability to pass in the address of already open SQLite memory structures);
  • a linker change to use the DLL version of SQLite;
  • plus a VBA-friendly DLL wrapper ( & VBA declares, to call the DLL) to replace the JSDB shell when embedding in Excel.

From my SQLite as the MP3 of Data post:  ”Just as “fractional horsepower” electrical motors revolutionised manufacturing and eventually all our lives (car starter-motors, fridge motors, washing machines etc.), “fractional horsepower” databases can do the same for data. Distributing data to where it is needed.” I can now add a distributed “fractional horsepower” processing engine for that distributed data. This transforms xLite from a micro-ETL platform into one capable of handling (or at least orchestrating) practically any ETL (Extract, Transform & Load),DI (data integration) or “Time Asset” (see this post) process.

SAX and Bugs and XBRuLe

Okay, the XBRuLe is a bit laboured, should be SAX & bugs & XBRL, but any excuse to play some Ian Dury :-)

Bugs (the programming type, not the creepy-crawlies), Simple API for XML and Extended Business Reporting Language;  these represented the trinity of my concerns for the last three weeks  or so.

First, the bugs:

Several weeks back, I decided that the C portion of xLite needed an overhaul. The codebase contained a lot of stuff that I no longer used and also contained code that I’d written when I first re-started using C (after a lapse of 20 years or so, the phrase “I’d forgotten more than I’d ever known” sums up the experience best); some of this code was memory-leaking like a sieve.

Also, the original Pivotal Solutions code was not UTF-8 enabled, instead it used the host pc’s default character set codepage, this needed to change (if you don’t know what I’m talking about see Joel Spolsky’s lecture to the developers of the world – well actually, primarily to those of us in the Anglo-Saxon “ascii-will-do-fine” world).

The bugs I introduced as a result of this upgrade were not of the logical kind but of a much nastier type, peculiar to the low-level world of C, “bad free()” bugs!

Excel was no longer leaking memory (well no more than it normally does) but it was crashing randomly (usually in a DDL called VB7, 1st upgrade to classic VB in over a decade!), a sure sign I was freeing memory that was not mine to free. Two days later, I’d tracked the bugs down, but only by a painful line by line code walk-through. If you’ve no idea what I’m talking about here, count your blessings and move on.

The other major change I added to xLite is the ability to code SQLite Virtual Tables in VBA. The Python side of xLite has always had that facility, but I look on Python as a nice-to-have add-on, not as a core component. The growing need for “core” virtual tables meant either coding them in C or in VBA, see previous paragraphs for why VBA won the day.

The immediate driver for adding both UTF-8 support and quick-to-build virtual tables was the need to better handle XML data within xLite.

… then the SAX:

For small XML/HTML datasets, I, like the rest of the world, use DOM manipulation; but for larger sets I’ve tended to go down the brute force and ignorance approach of hand coded File I/O combined with regular expressions to efficiently parse out the data required.

Last week, an email from a datasmith name Cathy prompted me to look into using Sax for loading XML. Cathy, like most datasmiths, is not a professional programmer, she has a “real job”; part of that job is analysing large datasets, she’s learned enough programming (mainly Access & Excel) to do that job more efficiently. The data she needed to parse this time was encoded in XML, but being very large and built on a schema that constantly changed, the default DOM approach overpowered both Access and Excel.

Cathy had originally contacted me looking for information on using Talend to read the data and it looked like she was about to start a new side-career as a Java programmer. I figured there must be a way for her to leverage her existing skill-set (VBA) & this led me to MSXML’s implementation of SAX2. She was delighted; although many of the concepts would have been new to her, at least they were bounded within a world she was already comfortable with (the basis of how we all manage to incrementally expand our knowledge).

The only problem was, the example code no longer existed (it would have been in VB6, but who uses that these days, other than a few million VBA para-programmers, let them eat the .NET cake). So I coded up the first example in Excel/VBA and here it is if you need a quick-start to the joys of SAX2.

Which leads me on to XBRuLe:

“One XML to rule them all, One XML to find them, One XML to bring them all and in the darkness bind them…”

A former colleague of mine when explaining his computer science studies to Meath farmers whom he regularly met while hitch-hiking home from college (mid-1970s) was usually met with the response: “Ah computers, dere de comint’ing. XBRL has been the coming thing for quite a while now.

Being in the business of Business Reporting, XBRL has always been on the radar, and of late, the radar is showing incoming fire. First the SEC, and now the UK’s HMRC, are mandating it as a filing method. Whether this is a good thing or not, is open to question. As this article puts it: “XBRL is a  case study in complexity” http://bit.ly/951oEY “the producer of the sample must have suffered a polymorphic recursive brain meltdown”.

But needs be; I’m in the business of shaping difficult data, so I’ve started to re-acquaint myself with the subject (last time I looked at XBRL in any depth was 2004). Part of that process will be to beef-up xLite’s XML capability, which, with me being on the table side of the “Tables Vs. XML; the data lingua franca debate“, will involve getting the data into a relational form at the earliest possible moment. For example, for discovery I would use a classic parent-child recursive structure, but not having something like Oracle’s Start-with, adding  a virtual table to make navigating such hierarchies easier with SQLite.

If anybody is peddling to you the concept that this brave new word of XBRL powered reporting will make your business reporting life easier, they’re either lying or don’t fully understand what they’re selling. As 19th century industrialists were wont to say: “Where there’s muck, there’s brass”; and with XBRL, you’ll be up to your oxters in muck, but with the brass all flowing to others, perhaps even to me!


xLite Beta Updated – adds Python as an Excel Scripting Language

I’ve updated the xLite Beta with bug fixes and added a new page introducing xLite’s Excel/VBA and Python extensions to SQLite.

See http://www.gobansaor.com/xlite

The u() function allows any VBA UDF (user defined functions) to be called from SQLite.

The x() function allows an inbuilt function or indeed most any formula (but not a UDF, use u() instead) to be called from SQLite.

The f() function allows for standard worksheet cascading formulas to be referenced by SQL, in effect, worksheet user defined functions.  Really useful in building & testing workbook code/models.

Finally, xLitePyScript is a UDF that allows Python to be used as an Excel scripting language.  Can either be inserted into a SQL statement wrapped by the u() function or called like a regular function from VBA or as a cell formula.

Have fun …

Palo BI Suite Community Edition

Jedox have finally published a roadmap for the Palo BI Suite Community Edition, having caused considerable confusion by pre-announcing its availability last April. See here for the details.  The headline dates are, a beta version due 1st of July with a Release Candidate due 1st September.

Although the announcement in April was essentially vapour-ware (no Worksheet Server V3, no Amazon EC2 images), one very significant actual deliverable was the addition to SourceForge of the Palo Excel Add-in sources (GPL licence).  This, at least for me, is very welcome as it now means that Palo is truly open source.  Prior to this, the server sources were GPL’d but not the main front-end tool used by the vast majority of end-users. In fact, the deep Excel integration offered by the Add-In is Palo’s main attraction to the business-focused “datasmiths” who make up the bulk of the product’s user-base.

The GPL’ing of the Add-in has removed the last barrier that had stopped me, as an independent consultant, from committing to the platform.  While the ‘freeness’ of open source is nice, it’s the source that really attracts me.  With access (and rights) to the source, I have no worries that the terms of use or indeed the product’s core functionality can be arbitrarily changed.  Having the source also means I can delve as deep or as shallow as I need to into the inners of the product, improving my understanding of the technology (both bugs and functionality) as needs dictate.

What has Palo BI Suite to offer, besides being open source?  Well, even if Jedox’s offering consisted of mediocre products, being open source as I explained above is in itself a huge advantage. Having an agnostic FOSS pivot engine that can be shared across many technologies, from Excel to Open Office to a PHP based website, is extremely useful.

However, Jedox’s BI suite is far from mediocre.  Palo is now a very polished and powerful in-memory MOLAP server with excellent integration with Excel (through the Add-In, or if you take out a support contract, via ODBO/MDX powered Pivot Tables).  The addition of a browser delivered spreadsheet (Worksheet Server V3) will add significantly to the product’s street appeal.  Version 3 differs significantly from previous WSS versions; being open source is one, but the entire product was also completely redesigned to meet the challenge posed by web-based products from the likes of Zoho, EditGrid and of course Google Docs (not to mention the ever-present threat of a MS response). Web-based spreadsheets are becoming a commodity, so Jedox’s response was to open source the product but at the same time make it more usable for real-world business analytics.

Current browser-delivered spreadsheets suffer from two shortcomings;

  • Spreadsheets with large numbers of inter-related cells (typical of business models ) tend to perform poorly, in many cases being unusable compared with Excel or Open Office.
  • Only available as hosted SaaS; not a major problem for some businesses, but for others, services outside the corporate firewall, especially for sensitive information such as what-if, budgeting and sales analysis models, are a no-no.

WWS V3 gets around both problems.  Performance is improved by the use of Palo as the spreadsheet’s pivot engine but also by the “lazy calculation” of related cells i.e. a cell that’s not visible, and itself not yet referenced by other visible cells, remains uncalculated, saving on the constant churning that can effect large models.  This combined with a DynaRange concept means templates and models react dynamically and efficiently to the ever changing datasets being presented to the sheets from the Palo OLAP server.   The look’n’feel is very similar to Excel with even array-formulae being fully supported.

The second problem of only-behind-the-firewall access is solved by the open source GPL licence and by the front-end being coded in PHP (very approachable to most in-house support staff and even the odd accountant).  The core (the bit not yet released) is, as far as I know, C++, so is likely to join Palo Server as being highly efficient and well engineered but perhaps beyond the technical skills of most.

The other elements to the BI Suite are the web-based OLAP-centric ETL Server (now, I see, with Groovy and Javascript scripting support) and the Supervision Server (only in paid Enterprise version) which offers fine-tuned access control and monitoring, plus drill-through functionality from Palo cells, back to the ETL fact tables. The Enterprise Edition also offers a multi-core version of the Palo server along with SAP and ODBO/MDX connectivity.

If multi-dimensional analysis and budgeting could benefit your business and spreadsheets are your preferred method of communicating and working with such analysis, you need to check this out.  Palo is a well kept secret (at least outside of Germany), hardly ever mentioned by the mainstream BI community, but don’t let that put you off; this is one of the best solutions out there, it’s open source but also comes with the backup of a professional company that can offer not just technical support but also implementation know-how (Jedox eats its own dog-food, being both a BI consultancy and development house).

Update July 4th 2009:

Beta Community Edition is now available.   I downloaded and installed WWS V3 and gave it a quick test-drive; looks good, Palo interface has the look’n'feel of the Excel Add-in and the general spreadsheet functionality is very Excel-like, incluing CTRL-Shift-Enter to assign array formulae.  Overall, the Palo BI suite offers a intuitive end-user-friendly interface; from download to effective use in less than 60 minutes, how many BI tools could you say that about?

Also, in two weeks time a pivot-table friendly ODBO driver will be included for free with the Palo Excel Add-in (previously only available to those with a Jedox support contract).

Why not join me on Twitter at gobansaor?

LiteBI, Heavy ETL

Although my major BI interest is in micro-BI (or is that  workgroup-BI?)  i.e. data, perhaps cleansed and packaged elsewhere, available locally on a datasmith’s PC,with most likely an in-memory OLAP as the analysis tool; the possibilities of the “cloud” as a BI platform have not escaped me.

From a micro-BI perspective, the ability to act as a backup/mirroring tool or as ETL/marshaling tool (anybody for Hadoop and SQLite?) attracts. I’ve yet to make up my mind on BI delivered as a cloud PaaS but obviously many others believe it has a future.

My main worry with PaaS is not lock-in (which exists equally for in-house proprietary solutions) but the dangers of a Coghead-like lock-out.  My other doubts are more technical; believing, as I do, that in-memory offers significant advantages over traditional ROLAP (simplicity been the main one) and multi-tenant in-memory architectures are not yet a runner.  But last week I had a demo of new Spanish BI PaaS service, LiteBI, which might just change my mind.

Javier Giménez Aznar and his team previously worked on delivering Pentaho based datawarehouses to large Spanish corporations and government agencies, so they have a deep understanding of Mondrian ROLAP and are using that knowledge to build the LiteBI service, but this time with SMBs as the target customers rather than corporates. Pricing starts at €145 per month and is based on number of concurrent users, number of analytical spaces and the data volumes, so it’s not for very small firms more for the Medium in SMB.

Impressions? The cube designer, dashboard builders and the general UI are all very good and I would think would appeal to end-user datasmiths and, as such, will be a major up-front aid to selling this product.  But it was LiteBIs approach to the thorny issue of ETL and data loading that impressed me and also helped ease some of my Coghead-induced-fears.

BI technology stacks consist of three elements:

  • The “fancy” front-end; graphs,animated dashboads and so on.
  • The pivot engine; ROLAP or MOLAP or both.
  • The ETL process.
  • (Many would say there’s an important 4th, the data-warehouse, but not every BI effort requires one, but that’s another issue)

LiteBI is continuing to build yet more functionality into their UI and this “fancy” front-end is essential as it’s their “shop window”.

Mondrian provides their pivot engine, and again they continue to work on optimisations such as column-based datastores to increase speed and automate responsiveness tuning (end-users are very unforgiving of slow pivots).

But it’s in the 3rd area, that of the ETL process, that you realise the LiteBI team has real-world BI experience.  Data is loaded into LiteBI via an API, but with the ETL process itself happening on the customer side.

“Well,so what?” you may ask. The extraction of data has to obviously happen customer-side (even though not in the case of data being sourced from the likes of SalesForce.com). Yes, but it’s the transformations and data cleansing that adds true value to the ETL process and subsequently determines the quality and usefulness (as opposed to the speed or the “prettiness” of delivery) of the solution.

Part of the process of adopting LiteBI, is an ETL consultancy stage where a LiteBI partner company will provide on-site services to build this ETL layer, handling not just transformations but initial load and automating the subsequent delta uploads.

So the cost mounts up, but in reality you can’t do BI without this investment; there’s no ETL magic bullet.  Even still, Javier says the typical go-live time for a LiteBI project would be in the order of 3-4 weeks rather than the 3-4 months of similar on-site Pentaho projects.

The end-user ‘owning’ the ETL process makes the prospect of a service lock-out slightly less worrying as, at least, one would still have a good starting point for moving to another provider or back in-house. What I would really like to see would be the option to self-host LiteBI, which I guess would involve open sourcing large parts of the service (the automated optimisation strategies could, for example, be excluded from this open source version).

The load API comes packaged as a plugin to Kettle (aka PDI) and the intention is to offer a similar add-on for Talend in the near future. LiteBI also offers a white-label offering whereby 3rd party OLTP solution providers can use the service as their product’s BI suite.

Like the Skibbereen Eagle keeping its eye on the Czar of Russia, I too will be keeping a watchful eye on LiteBI and the march of on-demand BI in general.

Why not join me on Twitter at gobansaor?

PowerPivot – XXL, Excel on Steroids

In my last post about why I use SQLite in combination with Excel for datasmithing tasks, I listed the more traditional backends (Excel itself, MS Access, RDBMs & MOLAP cubes) that one would expect to “compete” with such an idea.   But I suspect that if that same post appeared  two years or so into the future, there would be a fifth contender, PowerPivot cubes.

PowerPivot (at the time it was called ProjectGemini) is due to be delivered as a free add-in to the next version of Excel (2010) ,like the Analysis ToolPak or the Data Mining add-ins for Excel 2003.  (See this OLAP Report Project Gemini, Microsoft’s Brillaint Trojan Horse for a good overview of the tool).

Donald Farmer ,who works on the project, having seen the SQLite as the MP3 of data post and recognising that the use cases behind combining SQLite with Excel were similar to those of Project Gemini, kindly offered me a demo of the product.  Well, the phrase “Excel on steroids” has been much used in the past (in particular of add-ins such as Essbase, Palo or TM1) but this “ya gotta see”, Donald likes to call it XXL.

Millions of rows of data in-memory on a 4GB PC being “modeled” using a “user-friendly” pivot-table-like interface. And when I say, modelled, the user isn’t being confronted with concepts such as dimensions, levels, attributes, facts and so on, but a classic star schema model is nevertheless being built behind the scenes.  And it’s this model that allows PowerPivot to escape some of the inadequacies of pivot tables, e.g. allowing for rules and hierarchies to be defined.  The resulting model can then be saved and shared as a file (keeping to the document-centric ethos of Excel) but it can also be posted to and managed by SharePoint.

SharePoint will be extended to allow the IT function to manage and audit shared models to whatever degree the organisation requires, but the single file format will also allow smaller groups to share without the need for IT involvement (essential if bottom-up adoption is to be encouraged).  SharePoint will also add the “Web2.0 collaboration layer”.

How will MS make money from this if it’s free?  The first clue is the SharePoint backend, more functionality means more reasons to purchase and use MS’s server stack and the same applies to Excel itself. I, like many others, are very happy using Excel 2003 and look on Excel 2007 the same way the market in general has looked on Vista; i.e. pretty, but lacking a strong enough reason to upgrade unless forced to do so. (Excel 2007 also has the ribbon issue, not one I find a major problem myself, but others do).  But I would upgrade to a version Excel that offered Project Gemini capabilities and I’m sure others would follow (and more importantly to MS’s revenues, thousands of corporate accounts would too).

PowerPivot offers proof that MS realises, what those of us on the ground have know for years, that BI projects are in the main, Excel-centric; all the ‘hard sums’  and awkward decisions end-up back on the desktop.  MS has decided to publicly recognise that fact and profit from it. The timing is both economically and technically opportune; PC speed and cheap memory means that a huge chunk of even a large corporation’s datasets can be analysed by a PC (according to this, the median size of original data in OLAP datasets is about 5GB); and there’s obvious cost-benefits for companies facing difficult times requiring more to be done with fewer resources.

What will the effect be on tools such as Essbase, TM1, Palo etc. ?  Well, let me put it this way, if their owners are making strategic plans for 2010 onwards and they’re not taking account of the PowerPivot effect perhaps they should.  Most likely PowerPivot will help increase the overall market for OLAP tools, with the incumbents tending to specialise in their existing niches (e.g. Palo in Budgeting, with the added value of being free and open source, which has a premium over just being ‘free’).

So will I put away my Excel-SQLite fixation then? No, for two reasons:

  • PowerPivot is not here yet, and the proof of the pudding will be in the eating. Also, when it does appear it will only apply to Excel 2010 (or whatever) and as many companies are still on Office 2000 (and a few on 97!), it’ll be at least  5 years before a significant percentage of sites upgrade.
  • The SQLite addition to Excel offers not just BI capabilities but also makes a nimble ETL and data integration engine. I’m also experimenting with Amazon S3 integration to enable simple work-flows for small distributed teams (or even same-office groups where the WAN is the new LAN).

Whether you agree or not in the validity of  ”workgroup BI“, be aware that MS does and it thinks that BI is about to enter a new phase,  for proof see MS’s Nic Smith’s The History of Business Intelligence video.

UPDATE: 19th Nov 2009

Last evening I downloaded for the 1st time both Excel 2010 Beta and the PowerPivot (new name for Gemini) add-in.  First impressions; yep, in the flesh it’s just as impressive as the above demo led me to believe it would be.  As I said on Twitter last night Datasmiths of the world; download the Excel 2010 Beta and PowerPivot add-in; this ya gotta see!!!

Why not join me on Twitter at gobansaor?

SQLite as the MP3 of data

… and Excel as its “mixing desk”.

When I tell people that I use SQLite in combination with Excel (via xLite) as my datasmithing platform, many ask why SQLite? (Many others ask why Excel?  but “sin scéal eile”, that’s another discussion – Excel as the iPod of Downloaded Data.) Those that question my use of SQLite tend to cluster into four camps:

  • Pure Excel jocks.
  • MS Access fans.
  • The client server database brigade (SQL Server,Oracle; or if FOSS fans; MySQL, PostrgeSQL).
  • The MOLAP folks (Essbase, TM1, Palo).

Now while I have used and will continue to use/encounter all four ‘approaches’, I’ve come to believe over the last couple of years that SQLite brings something special to the datasmithing game. When I look back over nearly 30 years in the data handling business I keep thinking – “If only I had SQLite then, how much easier/quicker/cheaper that task would have been!”.

Just as “fractional horsepower” electrical motors revolutionised manufacturing and eventually all our lives (car starter-motors, fridge motors, washing machines etc.), “fractional horsepower” databases can do the same for data. Distributing data to where it is needed.

As operational local caches, this use of SQLite is already far advanced. SQLite is embedded in lots of every day software tools, everything from McAfee anti-virus to TweetDeck Twitter clients (best one IMHO). But my interest is more in SQLite’s potential as a micro-BI (or maybe more correctly a distributed-BI) platform. A sort of MP3 format for distributed structured data, if you like.

But why SQLite (and in particular SQLite in combination with Excel) as my datasmithing tool rather than the four other approaches?  First, what’s a datasmith?

Managing and manipulating datasets has become an integral part of many people’s job, not just accountants (the original of the species) but marketing executives, sales staff, pricing analysts, process engineers; different job titles, different roles but using a skill that they’ve likely never been formally trained in, a skill without a name; a skill I call datasmithing. I like to think of  myself as a master datasmith, or a datamith’s datasmith.

If you consider yourself a datasmith then most likely the tool you use to manage your datasets is Excel. And before you apologise, don’t. Excel is by far the best and most flexible end-user data manipulation tool out there. Everything from the current financial crisis downwards has at some stage being blamed on Excel, but you know and I know that many tasks would remain undone or under-done were it not for end-user generated spreadsheets.

Spreadsheets are not however optimal for some tasks, linked spreadsheets in particular are data disasters in waiting. While fantastic for data transformations and presentations, as books-of-record they’re rarely suitable. Other tools such as SQL based relational databases and in-memory OLAP offer much better and potentially much more cost-effective data modelling functionality, but also at a cost of extra complexity and ongoing technical support.

MS Access (which like SQLite, is a document-centric, non-client-server database; but unlike it, is also a forms/reporting development environment) would appear to be the natural local store database. My problem with MS Access has been its tendency to try to be all things to all men, ending up not fully satisfying anybody. Professional developers think it’s too limiting, non-techs find it too intimidating, even reporting, where it once showed promise left a big enough opening for Crystal Reports to evolve. It is also limited to Windows which might not seem to be a problem if combining with Excel, but, as it’s often necessary, due to scale or complexity of the data,  to use ‘proper’ ETL tools such as Talend, having an OS agnostic database format than can act as a distribution media (think MP3s again) between “mixing desks” can be very useful.

The big difference to MS Access for me is SQLite’s open source code; code that’s a pleasure to browse and with an approachable API that even I, with my very rusty C skills, can manipulate. Having access to that code allows me to tightly integrate it with Excel, so much so, that I can use Excel functions (built-in functions, VBA user-defined functions and 3rd party add-in functions) directly from SQLite’s SQL; and vice-versa, access SQL functionality via Excel “formula” calls. It is  also possible to  load most datasets into memory using SQLite’s in-memory mode enabling very fast processing  and near zero-latency when passing data to and from Excel/VBA. In the near future, cheap, large SSDs will enable non-memory databases to offer similar speed but also handle extremely large datasets (see this for a glimpse of that future).

What about the big beasts of the data world, the client-server databases? Having spent most of my professional life working with such tools I’m aware of the power of a well designed relational database. If SQLite is the MP3, then these are the master tapes, the DDD recordings. Most of the data that eventual ends up in SQLite for analysis and/or transformation will have originated in data-warehouses or be directly sourced  from OLTP systems built using relational technology. But for close-up analysis and transformation, the pure simplicity and convenience of SQLite is hard to beat. That simplicity is primarily due to its Excel-like ‘document’ nature, all code and data can be housed in a single folder (or true-crypt container for added security), ensuring that the ‘problem domain’ can be easily archived and/or shared with others without the need for professional IT resources.

And yes, I hear you, isn’t that the basis of Excel-hell? Yes it is, but over the years I’ve found that this is rarely a problem for datasmiths, they deal day-in day-out with document work-flows, they understand the risks and the benefits (mainly the simplicity) of the approach. Where the nightmare truly happens is when this approach is used as an alternative to an OLTP system i.e. using Excel and other document-like datastores as books-of-record in large multi-user environments – “there be monsters for sure”.

How about MOLAP? Wasn’t Essbase’s name derived from “extended spreadsheet database” and doesn’t Palo offer a truly excel-friendly multi-user database back-end? Again, having worked with Essbase for many years and now being a big fan of the open source Palo MOLAP tool, I fully appreciate the power that such tools brings to analysis and multi-user planning tasks. But for many situations, an Excel Pivot Table is “good enough” and even when it’s not, it is possible by utilising what I call a tOLAP cube (essentially, a fact table indexed via tags enabled by Google’s great addition to SQLite, the FTS3 virtual table) to build and access  powerful, yet simple, cube-like data structures.

By integrating SQLite with Excel, datasmiths can have the best of both worlds, familiar spreadsheet front-end combined with a fast and powerful SQL engine and datastore, in fact, everything that MS Access should have been.

Why not join me on Twitter at gobansaor?

Spending time on Excel-SQLite, C, VBA Callbacks & Twitter

Haven’t posted here in a while as my spare time has been soaked up programing, well actually refactoring would be more exact.  My xLite “SQLite empowered Excel” codebase has grown over the years and required a serious makeover to get rid of stuff I no longer use and to generally make it more robust.  I also decided to add some extra functionality to my VBA friendly C wrapper for SQLite (based on Pivotal Solutions’ pssqlite.dll) which meant I had to re-acquaint myself with my long lost C skills, so doing reminded me how much I like C. Close to the metal programing if not exactly super-productive is nevertheless super-powerful.

The new improved xLiteSQLite.dll now has a built-in CSV loader (both file based and string based – handy for loading Palo HTTP API responses into a table). It also returns a one columned variant array of CSV values for quick rendering via “text-to-columns” code (by far the quickest way of handling large dataset pasting into Excel).

I’ve also added the ability to create SQlite UDFs (user defined functions) in VBA (thanks to http://stackoverflow.com/users/4007/rpetrich).  This is a very powerful feature as it allows SQLite selects to act as a “loop controller” calling back to  Excel/VBA functions to process each row, really useful for ETL tasks. And not just scalar UDFs but aggregating (aka group-by) functions too, allowing the use of Excel’s powerful array functions in SQLite statements.

All in all, the changes to the xLite VBA code and the C wrapper makes Excel backed by SQLite a seriously good micro-ETL tool. Combined with Palo, the result in a truly wonderful micro-BI platform; a cost-effective toolset for these recessionary times.

Of course I’d be lying if I said code was the only reason I’ve been neglecting my blogging duties, I’m afraid I’ve a confession to make, Twitter has hooked yet another sucker, me! 

I’ve found I’ve settled in to the whole micro-blogging thing with ease, and have managed to make contact with people I would not have encountered otherwise, as well as reconnecting with others that I’d lost contact with.  So if you too are all-a-twitter then do please follow gobansaor-on-twitter.

I’ll give up Excel Pivot Tables when you take ‘em from my cold, dead hands

Jedox, the company behind the open source MOLAP server Palo, has just announced an MDX driver. This means that it’s now possible to access Palo cubes using Excel Pivot Tables or indeed any tool that supports ODBO.  This is excellent news, as MOLAP to most Excel users IS a Pivot Table, and somewhat like the NRA, the NPTA’s (National Pivot Table Association’s) motto is “I’ll give up Excel Pivot Tables when you take ‘em from my cold, dead hands”.

MDX/XMLA is now a de facto standard for OLAP servers, supported not just by MS SQLServer but by SAP BW, Hyperion/ESSBase and by Pentaho’s Mondrian. The new driver is not open source, nor is it for sale but instead comes free to those with Jedox support contracts. I’m sure lots of organisations will be more than willing to enter a support contract (starting at €3000 per server) to get their hands on this; think of the savings in training alone!

UPDATE: 2nd July 2009

Kristian Raue has announced on his blog that the ODBO/MDX driver will now come free with latest Palo BI Suite (both community and enterprise versions). Excellent news!


Twitter – the penny drops!

I’m a fan of most things Web2.0, not just for personal use but as business tools.  Over the last four years or so I’ve enthusiastically embraced Wikis, IM (Google Talk), RSS Readers et al. I could see the benefit and attraction of social network sites such as Facebook even if I’ve not partaken as such. Heck, I’ve even joined the ranks of “those who blog”.

But one aspect of this Web 2.0 stuff that had until now not really grabbed me as particularly useful is micro-blogging i.e. Twitter, Jaiku etc.

This morning two things I read brought home to me the benefits of this technology, particularly in a business environment; the penny had dropped!

The first was this post  ”Ambient Awareness – The Cloud Killer-App” where this caught my attention …

To me, this is the essences of situational awareness. An ability to sense and understand your environment and the actions of others in that environment. Clive goes on to explain that sociologists have found that “weak ties”, such as those created by twittering, greatly expands an individual’s ability to solve problems.

Then I read that the winner of the top prize at TechCrunch50 is Yammer, yet another Twitter look a like, but this time with a difference; it’s designed to allow communication only between those within the same organisation.

Now that could be very useful, especially for organisations with a dispersed workforce or comprised mainly of teleworkers.  Such a tool could act not just as a means of keeping people in touch and aware of the general happenings with a company but could also be used a “lite command and control” tool where messages are used as a replacement for time-sheets and progress/activity reports.

As email was (and still is) the “internet as a wide-area-network” killer-app, micro-blogging may very well be the killer-app of the “always-connected internet”.

And in the spirit of sharing that is Web2.0, here’s some other things I discovered this week…

  • OutWit, a very useful Firefox extension if you need to automate the “harvesting ” of data (tables lists, photos,mp3s) from the web.
  • xlUnit – a unit testing framework for Excel VBA, now that’s something I could do with, OK it’s not quite there yet, but you can follow this Grumpy Old Programmer as he rolls it out.
  • Reverse Snowflake Joins Online, if you have a nasty bit of SQL that you need to visualise in a graphical format, then this online version of Alexandru Toth’s open source Python tool may be just what you need.
  • Quantivo, customer behaviour analytics in the cloud. If you’ve lots of sales data, but no in-house datawarehouse.
  • And if you’ve no sales data because you’ve no sales, then check-out Sales 101.

Cloudy skies, cloudy apps…

Just back from a break in Clifden, Connemara, summer is nearly over, the kids return to school today, back to work.

Aasleagh Falls, Co. Mayo

Aasleagh Falls, Co. Mayo

Counties Galway and Mayo were like the rest of the country last week, a tad wet, but unlike the developed east of the island, flooding was not a problem; a problematic drainage area is called a lake in the west.

This August has been the wettest and dullest I’ve ever experienced but at least I saw some sunshine earlier in the month thanks to Kristian Raue CEO of Jedox who kindly invited me to visit the company’s offices in Freiburg, Germany.  Freiburg is very green in both senses of the word, surrounded as it is by the Black Forest and its well deserved “eco-city” status.  Its also know as the warmest city in Germany, a reputation it thankfully lived up for this visitor from a rain-soaked Atlantic isle.

August morning, Frieburg Im Breisgau

August morning, Freiburg im Breisgau

If Freburg left a positive impression on my mind, so too did Jedox.  The overall impression is of a company which intends to use a combination of quality, vision and the judicious use of open-source to build the Jedox brand into one associated with best-of-breed products and consultancy.  This vision can be seen in the evolution of Palo, from its “good enough” beginnings to its current near-best-of-breed 2.5 version, and from talking to some of those working on the product, best-of-breed status is not that far off.

Likewise, ETL-Server which is currently a Palo only “loader”, is to be further  developed into a true ETL tool, while continuing to offer MOLAP-centric specialisms.

I also got a glimpse of the next version of Worksheet Server. “Wow!”, is all I can say.

Existing web based spreadsheet products are fine for simple data analysis or basic data capture purposes but cannot compete with their client-based elder cousins when serious datasmithing is required.  Well, from the demo I saw of Worksheet Server in action, that’s about to change.  The look and, more importantly, the feel is similar to that of traditional spreadsheets, its interface with Palo is identical to that of the existing Excel add-in, and here’s the big one, its open source!  Game-changing or what?

But …

That might enable me to move a lot of my spreadsheet applications to the cloud, but what about those applications that are more suited to an MS Access type solution?

Then try out WaveMaker. It’s open source and built on industry standards, Hibernate,Spring and the Javascript Dojo framework but has the ease of GUI database development more usually associated with MS tools. The resulting applications are packaged as a WAR file which can be hosted by any standards based Java server (e.g. Tomcat or Jetty).  The latest version makes developing Ajax-fronted database applications even easier with the addition of layout templates.  Its existing ability to automatically bind interfaces to SOAP web services has been extended to REST web services by means of a new WSDL auto-discover tool.  And Chris Keene CEO of WaveMaker also informs me that …

We are also releasing a cloud-based IDE in October with Amazon – stay tuned…

We launched in February and will be announcing our first 7 figure deal this month. We run on Mac, Linux and Windows and are currently the #1 developer download on Apple.com (http://www.apple.com/downloads/macosx/development_tools/)

Our goal is to make it easy to build rich internet applications without complex coding – kind of a MS Access for the Web.

Jedox and Wavemaker the new breed of open-source businesses

Talend + SQLite + Groovy the new Oracle …

… well, at least for me.  Let me explain.

For most of my datasmithing career, I’ve had access to corporate Oracle databases and now with the availability of  Oracle10g  Express I can even run my own Oracle instances at home or on EC2.  The combination of a powerful SQL engine, expressive scripting language (PL/SQL) ,OS independence, web front-end (App Express) and the ability to communicate with Excel (via OO4O) made Oracle a natural fit for heavy-duty data manipulation.   But there was always one major problem, Oracle doesn’t play well with other data sources, necessitating a separate ETL bolt-on, which led me to play around with the likes of Kettle and Talend.  But having been seduced by these new shiny (and open source) “toys” I’ve found that rather than just been incidental add-ons they had the potential to totally replace Oracle.  The combination of Talend, SQLite and Groovy, is proving to be particularly magic.

So how will these three tools enable you to leave behind your Oracle past?

Talend (in its Java form) is a superb ETL tool, via JDBC is can access every database type on the planet, it has built-in web-service capability and access to a  multitude of APIs via its Java component for non-database data sources.  The addition of  Groovy makes the use of such Java APIs simpler and quicker and the same Groovy acts as a replacement for PL/SQL when a bit of “if-then-else” logic is required.  And although Talend offers a built-in option to plublish an ETL job as a WAR file exposing a SOAP web service, Java/Groovy also allows for the integration of the powerful, yet simple, Jetty API to embed a web server within Talend itself.  And all this for free, and better than free, open source.

So where does SQLite come in? And, didn’t you say that Excel integration was important, how will Excel communicate with Talend?

As very little corporate data is held in SQLite format, and Talend allows access to every major commercial/free database, the usefulness of SQLite might not be at first obvious.  But if you think of SQLite as a data cache, a fast and efficient local tabular datastore, with a powerful but well understood DSL (i.e. SQL) and a drop-dead-simple setup and backup regime (basically copying and creating files), maybe then you can see its attraction. The ability to extend the DSL by easily creating SQLite user defined functions (UDFs) within Talend using either Java or Groovy is also another powerful feature.

For example…

select customer_id, name,customer, sales_region, getpalodata(“SALES”,customer_id,”All Products”.”Total Sales”,”Euros”,”YTD”)  as customer_YTD, getpalodata(“SALES”,sales_region,”All Products”.”Total Sales”,”Euros”,”YTD”)  as region_total_YTD from list_of_top_customers;

… where getapalodata is a UDF that wraps calls to a Palo cube.

With this type of setup I can easily mix and match list/tabular data with multidimensional data points using SQL (something that Oracle also supports but only if you hand over a large wad of currency). In fact I can create a mini data warehouse, with Palo providing the pivot, ( as SQLite lacks star-query (or even multi-index query) support.)  SQLite would still host the conformed dimensions and the fact tables, but with the fact tables acting as feeds to Palo cubes, supporting finer-grained drill-throughs from cubes or for ad-hoc queries. This is powerful stuff, simple, free, powerful stuff.

… and the spreadsheet access?

A Talend sub-job such as this…

Talend Groovy Jetty web server

Talend Groovy Jetty web server

Example of Groovy code calling Jetty API

Example of Groovy code calling Jetty API

…would provide a simple RESTful (rather than SOAP) web service which could be accessed either with an Excel Web Query or via a VBA macro which would parse the result and allow for more control.  For example …

http://localhost:1234/sqlgateway?sql=select customer_id,name from all_customers&type=HTMLTable

… this would return a list of customers wrapped in an HTML table, or …

http://localhost:1234/job/extractProspects?Rep=JonesTom&Month=JAN&SourceCompany=AXA&type=HTMLTable

…this might call a Talend job called extractProspects, passing in JonesTom, JAN and AXA as context parameters, which would then return a list of prospects extracted from a feed supplied by AXA’s system.

What would the Talend job look like?

The job might operate something like this:

  • It would run either on the client as a service or on a LAN based server (or on a remote server, with a SSH VPN (or Hamachi) to provide security).
  • At start-up, do a bunch of ETL tasks, pulling data from remote sources and databases, transforming and aggregating data etc. Storing the resulting data in local SQLite databases.  It might also build Palo cubes or update larger enterprise databases.
  • The job would then setup a Jetty web server and await requests for data.
  • The requests might be a mixture of raw SQL or requests to run specific Talend transformations which would return a dataset directly to the calling client or maybe just acknowledge the request, queue it up for processing later, sending the resulting dataset by EMail or RSS feed when finished.
  • At a fixed time the service would shut it self down and requeue itself for the next day’s workload.

… or nothing at all like that, and that’s the point, build what you need, add the levels of security (or none at all) that fits your situation, all within a open framework, with zero lock-in (okay, still using Excel, anyone for OpenOffice, Google Apps or Zoho?).  You don’t even need your own server, host it on an EC2 instance, (if you bring up an instance for 10/12 hours every working day, it would cost about $20/$25 a month).

Now tell me that doesn’t make sense?

OLAP Cube as a Mind Map

If you’ve worked with OLAP technologies for any length of time you’ll undoubtedly have been in the situation where you’ve had to explain the concept of an OLAP Cube to a “newbie”.  If the person in question has come across Excel pivot-tables, then you can probably short-circuit the conversation some what, explaining that a pivot table is in essence an OLAP cube, maybe highlighting the differences between it and whatever OLAP tool you’re proposing; ragged hierarchies, ability to update cells and ‘spread’ values down hierarchies etc.  Even better if you can show the user an working cube populated with hierarchies and elements that match the user’s business.

But what if the person is a complete novice to the world of analytics and you don’t have a relevant demo cube that you can demonstrate, what then?  I guess, you could start by first trying to fry the victim’s user’s brain by explaining the concepts behind multi-dimensional spaces and/or a quick intro into the world of de-normalised databases, with examples of snow and star schemas. The glazed look in the eyes of your audience may however suggest this doesn’t always work.  And it’s not just business-users who have problems getting their heads around cubes, many programmers also have difficulty the first time they’re exposed to OLAP concepts.

I’ve found that Mind Maps offer a good way to help both me and the client to visualise the domain model that a cube will eventually address; having worked through the mind mapping process it’s then easier to take the user with you as you translate this model to a physical cube or star-schema.  Well Hugo (who’s back blogging after a long absence) promises to take this method a step further, his modifications to FreeMind will allow for the export of a mindmap as a Palo cube!  Check it out …

Palo OLAP and sparse dimensions.

Last week I tried out both the latest Palo 2.5 release and its sister product, ETL-Server.  Although I’ve not done any proper benchmarks, 2.5 does appear to be faster than the previous release and the Excel add-in also behaves better when co-habiting with other add-ins and macros (the previous release’s use of, and response to, application level events meant it often caused the entire Excel session to grind to a halt when other macros were processing data).

There is however, a major memory leak problem when using cubes with sparse dimensions (such as the Biker database Invoice dimension), at one stage my Excel session had a working set size of 750M!

The failure of  MOLAP cubes to effectively handle sparse datasets has always been something that ROLAP star schema advocates (myself included) have pointed to as a major short-coming.  Over time, products such as Essbase have managed to get around this limitation (but often only with careful up front cube design by skilled staff). Palo hasn’t quite made it to that level yet.  But then, maybe it shouldn’t, perhaps simplicity of setup and use should be Palo’s goal, not the ability to handle a telco’s ‘grain-at the-level-of-call’ fact table.

Two of the new features in Palo 2.5 that I’d been looking forward to, Zero Suppression and Drill-through are both relevant to the handling of cube sparsity.

  • Drill-through; to reduce the sparseness of a dimension, by moving the ‘grain’ up a level in the consolidation hierarchy, e.g. have a base level of Invoice Type instead of Invoice Number or indeed, dropping the dimension entirely.
  • Zero Suppress; to filter out excessive elements, e.g. only show Invoice Numbers that were raised this month rather than showing every invoice number ever raised depending on a non-zero value to indicate that it belongs to this month.

It was my testing of Zero Suppression using the Biker database that appears to have caused Excel to go on a memory binge.  It doesn’t worry me that much as the right design can reduce the need for zero suppression e.g. in a customer dimension, ‘hide’  customers who trade with you infrequently under a “Others” consolidation.  Also, the use of drill-through would eliminate the need for many sparse dimensions, but, alas, the drill-through functionality is severely nobbled in Palo 2.5.

Firstly, the functionality is only available in Excel if you purchase the €8,000 Palo Supervision Server which is all fine and dandy if you have a need for all the good things that Supervision Server offers.  Having said that, I did managed to bypass this requirement by calling the ETL-Server’s SOAP API directly from Excel with the help of Simon Fell’s superb PocketSoap library, so all is not lost on that front. But …

… the other reason I’m not overly impressed with the feature, is its implementation; particularly for something Jedox is asking you to pay €8000 for!   Here’s what you get.

Within the ETL-Server’s cube ‘export’ you can specify “<info>” fields such as invoice number, that will not form part of the final cube coordinate list.  These coordinates and the info fields along with ‘value’ field are then written to a Derby (aka JavaDB) database table, where the schema is the ETL project name and the table name matches that of the cube.  So you end up with, in essence, a ‘fact table’ at a finer ‘grain’ than the cube and the role of degenerate dimensions being provided by the “info” fields.  The resulting table is not indexed, so large datasets will be a problem and for it to work, a cube can only be ‘owned’ by a single ETL project.

The other thing to note is that drill-thru only works for ‘N’ coordinates i.e. no consolidation elements.  This is unlike Excel’s Pivot table which allows you to double-click at any cell, to reveal the underlying dataset.

On the plus side, the new ETL-Server is actually very good and well put together.  Better off not thinking of it as a generic ETL tool but as a specialist Palo loader, like Oracle’s SQL*Loader.  And like SQL*Loader’s role on many Oracle projects, for a large percentage of Palo projects ETL-Server will  be the only ETL tool needed.

Palo ETL Server – Not for me …

Jedox have just released V1.0 of their Palo-centric ETL Server. I had been looking forward to this, not so much for its ETL ability (which is somewhat limited when compared to the likes of Pentaho PDI or Talend) but for the drill-through capability it would add to Palo. Alas, there’s a catch, you must purchase Palo Supervision Server (€8,000) to enable the Excel add-in to avail of this feature!

The thing that attracted me to Palo in the first place was its simplicity of approach and the primacy of Excel as the end-user view of the product, a modern day ESSBase. The fact that the Excel Add-in is closed source always worried me, as I felt that it would inhibit the thing that really sets open source apart (no not the cost) ,the formation of an active and innovative developer community. The sort of developers who have a need for, and an interest in, MOLAP tools tend to be more familiar with VBA, .NET, SQL, SAP Config etc. than with C/C++ or even Java development. The one area where such a community could add value, the Excel front-end, is closed to them. And I know there’s some non-Jedox involvement in the form of JPalo and the OO-Calc Add-in, but Excel is the key to Palo’s wide-spread adoption.

Also, the choice of .NET as the main client-side development platform was a mistake in my opinion, a VBA accessible object model would have been much more useful, it would also have removed the need for the current painful installation process.

This partial “open-source” model and the increasing complexity of the platform makes Palo, at least for me, a less attractive Micro-BI option. And remember, Excel already has a very powerful in-memory OLAP tool, the humble PivotTable, which is in most cases “good enough” for most analytical needs. So why use Palo rather than a PivotTable?

Palo’s advantages:

  • Can handle very large data sets (limited by the free memory available to the server).
  • Allows write-back and splash-down, both very useful for planning/budgeting applications.
  • Allows for ragged-hierarchies,
  • Server-side MOLAP rules e.g. [Budget],[2008] = [Actual],[2007]*0.035

Excel’s PivotTable advantages:

  • Pure Excel. with the object model available for VBA scripting.
  • Drill-through as standard.
  • Excel 2007 can now handle a 1,000,000 rows, for earlier versions use an Access/SQLite local database or an enterprise database to first “group” and summarise the data to be pivoted.
  • Can be used against SQL Server Analytical Services (SSAS) Cubes (and those of other providers such as Pentaho’s Mondrian).
  • Much easier to set-up and use.

Update: May 1st 13:30

Beta Version of Palo 2.5 (which you’ll need to use the Palo ETL Server’s drill-through functionality) has just been released ..

Python the new VBA ?

These last two weeks, Python has been on my mind. First off, last week I decided to make time to fully investigate Picalo, an open-source Python-based data analysis tool, and then, this week, Google announced their long awaited cloud-computing offering, Google Apps Engine, with the language at its core.

Python was the first of the “LAMP generation” scripting languages that I decided to learn in any detail ( I had used Perl before that but only on a per-task basis (similar to how I’d used AWK)). I then invested time in learning PHP, then Ruby and finally JavaScript. And here I am, back where I started, with Python.

But it’s not the same Python I learned three years ago, not that it has changed that much, but my appreciation of the language has, largely due to my deep dives into other languages. For example, JavaScript’s treatment of functions as first-class objects, highlighted the same functionality in Python, something I’d missed (or rather, not fully understood) the first time I encountered the language. Likewise, Ruby’s RoR introduced me to a “best of breed” approach to web application design, something that can be used as a comparison aid when approaching new web frameworks such as Django.

But of course the scripting language that continues to power most of my datasmithing activities is Excel VBA. That’s why I was so excited to see a tool such as Proto utilise VBA as its scripting language. But, Microsoft has abandoned VBA, there will be no more Protos.

Also, Excel VBA is now a Windows only language. Windows, however, is no longer the ‘only’ business client OS (see how many Apple laptops you can spot the next time you’re in a business-class airport lounge, a few years ago it would have been zero, not any more), and is currently nowhere to be seen as a cloud computing platform (but that’ll change).

I’m at heart a table-oriented programmer, and I, like Picalo’s author Conan Albrecht, believe “data analysis is best done through scripting”; but not just data analysis, the T in ETL (Extract, Transform and Load) and the I in DI (Data Integration) and SI (Systems Interfacing) also benefit from a scripting approach.

So, what to adopt as a successor/companion-in-her-old-age to VBA, will it be Ruby, JavaScript, Python, Perl, even PHP?

It looks like it’ll be Python because it’s …

The runner up is of course Ruby, but its poor integration with Windows is a major problem and the datasmithing “prior art” of Picalo and Resolver makes Python hard to beat.

UPDATE Jan 2010:

To experience the best of both worlds, VBA & Python, my xLite (Excel combined with SQLite) datasmithing platform now allows Python to be used in conjunction with VBA.  Check it out here http://www.gobansaor.com/xlite

UPDATE:

Also, as Dan pointed out in the comments below, I’d not included Jython in my list of reasons for embracing Python. I must add it to my list of things to try out particularly as both my “classic” ETL tools, Talend and Kettle are JVM based.

Another thing to add to the (ever growing) list is Mike Pitarro’s SnapLogic python-based ETL tool. They have …

…just released a 2.0 Beta version with some major architectural enhancements. The SnapLogic model is very different from traditional ETL systems. It takes an approach that’s more like the web, based on loose coupling and HTTP interactions. We model data source, sinks, and transformations as URI addressable endpoints, and have a model where than can be chained together in pipelines to build transformation logic. We use a plugin architecture to make it easy to add custom components.

Postgres Plus Cloud Edition is boring …

… and that’s good. That’s how I like my databases, boring, reliable, consistent, easy to use.

SimpleDB on the other hand is not boring, it’s an exciting new shiny thing that opens up a myriad of new possibilities; but first, I and the rest of the developer community, need to tool up and cast aside some of our cherished database design patterns (oh like, 3rd normal form, strong typing, joins, nothing major) and embrace a slightly different way of thinking, however, as much as I like a challenge, I also like to get things done.

That’s where EnterpriseDB’s new Postgres Plus Cloud Edition comes in, this is an Amazon Ec2/S3 hosted edition of their Oracle compatible PostgreSQL-based product that offers the scalability of SimpleDB but the familiarity of a traditional relational database. The “magic” is supplied by Elastra, who are also offering the same functionality against MySQL and standard PostgreSQL databases.

A Talend ETL job which I had been developing for a client, had been tested against a “normal” EnterpriseDB instance. This ETL job was part of a BI prototype trialling a Postgres Plus Cloud Edition (the new name for EnterpriseDB’s cloud offering) as the back-end database. So, I exported the job as a Java executable, fired up an EC2 instance, copied up the generated JAR files, changed the database’s hostname to that of the Postgres Plus “cloud” database, ran the ETL job and it worked. As I said, boring, nothing to report, it just worked.

Now you may be wondering what’s so special about these Elastra powered databases, surely EC2 is no different from any other Linux virtual machine, why not simply install a standard database? The problem with EC2, and it is a problem to those of us (i.e. practically every IT pro on the planet) who have come to expect highly reliable RAID backed disk storage, is the non-permanence of its disk systems.

When an EC2 instance is powered down or fails, the disk system is wiped!

That, combined with fixed (if generous) disk sizes (160GB, 850GB or 1690GB), means that often a clustered database environment is a necessity, adding considerably to the complexity. It’s this sort of complexity that SimpleDB and Elastra address.

The obvious use-case for both Elastra and SimpleDB is as data stores for OLTP applications but Elastra’s ability to handle S3-backed massive databases means the possibility of using EC2 as a data warehousing platform is also considerably strengthened. Although not obvious at first glance, SimpleDB could also act as an OLAP data store; SimpleDB massively indexed tuples as “sparse dimensions” pointing to S3 objects (SQLite databases?) that hold the fact data combined with dense/”partioning” dimensions (e.g. Time). Possible ? Yes. Fun to do? Yes. A solution that I can apply tomorrow? No, that’s why I’m glad EnterpriseDB and Elastra are delivery such a boring product!

UPDATE Ec2:

The other big EC2 missing – non-permanent IP addresses – has at last been addressed. EC2 now offers “Elastic IP Addresses”, addresses associated with an account not an instance. If the instance fails or is shut down, the IP address can either be immediately re-assigned to a new instance (no more waiting for Dynamic DNS propagation) or “reserved” for future use at a cost of USD0.01c per hour. Also, the new “multiple locations” facility puts the API changes in place to allow for location selection, hopefully a sign that we here in Europe will have “local” EC2 instances to match our European S3 buckets!

UPDATE EnterpriseDB:

It looks like IBM have invested in EnterpriseDB, possibly as a counter-weight against Sun’s acquisition of MySQL (EnterpriseDB’s targeting of Oracle’s customer base would also be an added benefit!).

Dublin Bus and PALO ETL – the connection!

Dublin buses, as is the norm with most road-based public transport systems in our increasingly car-choked cities, tend to operate on the basis of “no sign of a bus for ages, then two or three arrive at the same time”. Palo MOLAP ETL options appear to be following the same pattern; we’ve been waiting for ETL support for ages and now we see three of them heading down the road towards us. There’s Palo’s own offering, then came Stratebi‘s Kettle Plugin and now Talend Version 2.3.0RC2 is offering a Palo output component.

Mind you, the Talend offering is very basic and I’ve not managed to get the Sratebi plugin to work, leaving Palo’s ETL Server as the front runner at the moment (drill-through capability is a winner in my book).

I’ve also been busy re-factoring my VBA SQLite and Amazon S3 code with the intention of publishing them as an Excel based micro-ETL platform. While cleaning up the Amazon AWS modules I’ve been playing with SimpleDB, I’m impressed, Excel combined with SimpleDB rocks!

I’ve also wrapped the open source XySSL SHA1 HMAC C code in a VBA friendly DLL, as searching for a VBA hmac sha1 hash implementation (essential for Amazon AWS access) has proved fruitless.

Hope to release the lot the end of next month.

UPDATE:

Thanks to Javier and Jorge from Stratebi I’ve managed to get the new Kettle Palo plugin to work. It seems that the TEST facility in the Kettle database connection dialogue throws an exception for Palo connections but the connections work fine in the actual Palo input/output steps. Did a quick test and it looks very easy to use and fits in well with the Kettle “way of doing things”.