Category Archives: SQLite

Micro ETL in the PowerPivot age

Although PowerPivot has many of the characteristics of an ETL tool, i.e. the ability to connect to disparate datasources, to filter that data and to transform it, it will still hit a brick wall when confronted by the typical data spewed out by operational systems. I’m sure this is by design as a sophisticated ETL tool is both complex to design and, probably even more relevant, is difficult to use.

Mind you, a few years back we IT pros would have said the same about front-end BI cube configuration, and behold today we have tools such as PowerPivot that prove that this doesn’t always hold true. Perhaps subsequent versions of PowerPivot will do the same for ETL as it has for BI cubes. In the mean time much of the necessary ETL will have to take place prior to loading into PowerPivot.  But where?

First off, what’s ETL?

The term ETL applies to one of the trinity of activities that have, over the last two decades or so, been at the heart of reporting/decision support systems. The other two terms: DW (data warehousing)  & BI (business intelligence), are sometimes used to refer to the whole process but can also be used to refer to two distinct sub-processes. Confused? Well, so you should be; these terms have been abused and redefined by scores of vendors over the years but for our purpose here we’ll stick to their roles as acronyms for the two of the  processes involved in the preparation & presentation of reporting data.

BI is the term now most commonly associated by non-IT folks with decision support systems, as it’s role is the most obvious i.e. front-end presentation and manipulation of data; the dashboards, pivots, charts, summary lists etc…

DW, data warehousing, is the term that most IT people who’ve been in the business for a while would use to describe the techniques, best practice etc. associated with this area. The heart of traditional DW was the data warehouse itself, a mighty repository of historical data optimized for reporting purposes. When DW as a concept started it was very rare indeed for operational (OLTP) systems to hold transactional data for more that a few weeks, usually just long enough to get through month-end.  Such specially built datastores’ days may be numbered as the data capacity of operational systems grow and the data munching ability of new ETL techniques (MapReduce for example) to transform vast amounts of data continues to increase.

ETL stands for Extract,Transform and Load; sometimes also styled, ELT, extract load & transform (PowerPivot would fall into this catergory). This is the process which traditionally swallowed most of the development budget in DW/BI projects (and kept me gainfully employed for years). It was the area where the dark arts of datasmiths collided with the often frightening reality of raw untamed data, with the added venom of corporate-politics-driven “data ownership” battles.  A messy business, and continues to be, even in these days of open data and open APIs.

ETL tools vary from text-editor written SQL to hugely expensive point’n'click ETL packages. Packaged ETL vendors promised (and continue to do so) that their tools would vanquish the dark arts of datasmiths with products that were so easy to use that the CEO would chip in with a few scripts to get the project finished. The reality was that IT types  found they had to learn yet another sub-optimal “language” and more often than not had to drop-down to “proper” languages to actually drive the thing to completion. ETL was (and still largely is) the preserve of IT.

ETL is easier than herding cats but just about ...

The tools have improved a lot since those early days and open source has at least removed for some the 6 figure licensing costs from the equation. ETL, like programming in general, is hard, so get over it. Tools, basic knowledge of SQL and data modelling skills can help to make ETL approachable to non-IT types, but it still has the potential to make your head hurt.

So what’s a PowerPivot’r to do?

If your organisation already has a data warehouse in place you’re in luck as it’s quite likely a lot of the data you require will exist in the optimal PowerPivot import format, i.e. a star schema. You might be out of luck though, a significant percentage of DWs will not have used dimensional modeling and you could find yourself looking a complex OLTP like data model. In that case, and in the case of pulling the data directly from an operational system, you’re in the micro ETL business. Even if your IT infrastructure provides you with cleansed and understandable data, you’ll be faced with integrating external or shadow-IT data (probably one of the main reasons why PowerPivot appeals); again you’ll either need IT support or else you must learn how to do it yourself.

Long before the likes of PowerPivot appeared I regularly found myself  in need of a micro ETL toobox i.e. a set of tools that would enable me to quickly and cost-effectively prepare data for loading into some system or other. Nine out of ten times that system was an Excel PivotTable (the rest of the time it was usually a master data take-on task or some variation of systems’ commissioning). Although the consumers of my datasmithing services would most likely assume that I used Excel alone to perform these works of wonder, I usually had an Oracle database (along with its data loaders and superb PL/SQL language) as my secret ingredient. This combination of Excel and Oracle served me (and my clients) well, but it wasn’t the Oracle bit that gave me the edge; I could, and did, substitute SQL Server and MS Access for the SQL layer. The real trick was the combination of Excel’s flexibility/presentational strengths with SQL’s list handling power.

The problem with this approach was that the interface between the SQL engines & the spreadsheets often involved quite a number of manual steps, and the presence of database software (even MS Access) could not always be depended on. It took my discovery of SQLite to enabled me to finally combine the two worlds; xLite was born!

This combination of Excel, and an in-process SQL engine (provided by SQLite) with the added optional ability to call either VBA, JavaScript or Python scripts, has provided me with a hugely flexible and powerful micro ETL tool. Now, with the arrival of PowerPivot, I have both the micro ETL and micro BI tools to build cost-effective Excel-based decision support systems.

Being Excel based, means that the end result is delivered in a format that many business people are comfortable with; PowerPivot is designed very much with “civilian” datasmiths in mind. Likewise, xLite’s ability to use VBA, simple SQL and Excel formula to perform data transformations makes a large part (if not all) of the ETL process “civilian friendly”.

I’m not saying that everything I can do with xLite will be as easy for an non-IT datasmith; many datasources are either too difficult and/or extremely time consuming for end-users to navigate; but much of the business logic can be expressed in Excel terms with the highly technical or time-consuming tasks  handled by SQL or VBA/Python/JavaScript. xLite is not only for once-off transformations but can also be used to automate ETL, report generation and refresh tasks (including refreshing PowerPivot itself).

So, if you’re thinking about utilising PowerPivot, but need help in preparing your data and automating the tasks involved, perhaps we should talk.

SQLite XML Streaming Virtual Table via Expat

A few weeks ago I wrote a post on handling large XML datasets using MS’s SAX2 parser from within Excel. Although fast, the SAX2 parser is not as fast as the original of the species, the Expat Streaming XML parser; being written in C and with a minimalist approach Expat’s speed is to be expected and this speed can be a real time saver.

Being at heart a SQL Jockey, I also like to transfer XML data structures to relational ones at the earliest opportunity, especially when I’m in discovery mode. Small XML pretty-indented documents can be analysed by eye (wasn’t this the great selling-point, XML’s user-friendliness ;-) ), but when the dataset is large,ugly and deeply nested I prefer to  load the data into a classic parent-child table such as this…

static const char* DDL = "create table FOO("
"fileName HIDDEN text,"
"xmlString HIDDEN text,"
"filter HIDDEN text,"
"deParent text,"
"deChild text,"
"deLevel int,"
"isAttribute text,"
"ChildType text,"
"LineNo HIDDEN int,"
"Offset HIDDEN int,"
"ItemNo HIDDEN int,"
"parentRowid int"
")";

As I found myself doing this repeatability I decided to code a SQLite Virtual Table which by wrapping the Expat parser would stream out the parsed data  to populate such tables. The streaming data could either be consumed as it arrived or saved to a real (i.e. non-virtual) table for further analysis. In effect, creating a ‘relational DOM’, but unlike a DOM which is memory bound, it could exist as an in-memory table or as a disk-based one, making the handling and analysis of large datasets easier to achieve. The above snippet of code is the definition of the my VT_xmlPC virtual table. HIDDEN columns are a means of adding utility columns that don’t necessarily belong to the logical dataset, in this case fileName and xmlString are used to identify the XML to load.

If you’re interested in writing your own SQLite virtual tables this Dr. Dobb’s article Query Anything with SQLite offers a good introduction to the subject. Wrapping Expat adds some extra complexity as you have to deal with two sets of callbacks, SQLite’s row request events and Expat’s events, and also with nested Expat parsers; so if you’re new to virtual tables start with the Dr Dobb’s example before diving into the xmlPC.c code.

The resulting virtual table is capable of handling most XML ‘data documents’ as seen in the wild; as coded, it regards attributes as the equivalent of nested elements (with isAttribute = ‘Y’), and it doesn’t handle CDATA (but could easily be modified to so do).

To use it:

(a) load the xmlPC.dll extension

(b) create virtual table myTableName using VT_xmlPC

(c) select * from myTableName where fileName matches “C:\some xml file” or select * from myTableName where xmlString matches ” …”

The “matches” predicates filter prior to row creation, “=”s etc. filter after row is constructed. The “fileName”, “xmlString” and “filter” columns are “matches” only columns as they initiate the creation of the table (using “matches” rather than “=”s for such utility columns is my own convention not a requirement of SQlite virtual tables). The “filter” column will restrict loading to elements of that name and the children nodes of that element.

Find here:

Have fun.

Update:

The source code link on the Dr. Dobbs article appears to be broken, the code can be found here (Note: this is an FTP link) in the 0711.zip file.

JavaScript as an Excel scripting language via JSDB

A few years back I posted about JavaScript as an Excel scripting language via ExcelDNA. That involved using JavaScript (in the guise of JScript.NET) as an ExcelDNA scripting language. It was purely an academic exercise to prove it could be done, I continued to use C# (or increasingly VB.NET) to build .NET user defined functions. This time however, I’ve managed to embed JavaScript (in the guise of Mozilla Foundation’s SpiderMonkey) directly via a native C interface not to prove I could do it (even though there’s a definite satisfaction in simply doing it) but to use it.

Why add another scripting language to xLite, hasn’t it already got Python?

True, Python is and remains a very powerful add-on to xLite. It’s a mature and long-established language, popular amongst IT professionals and “citizen programmers” alike. But, it’s a bit of a monster and can be awkward to package, particularly on Windows. By using Py2Exe and after a lot of digging on the issue of manifest files I have managed to package and isolate xLite’s Pythonic bits so that it can be used on a PC without first installing the required Python version (I’ve only tested against V2.6, Python’s lack of a side-by-side Windows installation capability is a major pain-in-the-butt   bad news: tested against Python 2.5 & it doesn’t work; good news: side-by-side is possible; simply change the system path to reflect which-ever version you wish to run at the command line; xLite will (must) continue to use V2.6). This “version-hell” mitigates against using Python as a core-element of xLite, fine for those of us who are comfortable with and require the full power of Python, but not as the tool’s primary scripting environment.

No, what I need is:

  • a light (single EXE or DLL preferably) and an approachable popular language,
  • with native SQLite support,
  • runable as standalone executable (on both Windows & Linux),
  • embeddable (is that a word?) in Excel via VBA-friendly DLL.

Add to that essential list some nice-to-haves such as:

  • native COM-interface support (for the likes of ADO etc.),
  • native networking support, for HTTP, raw TCP sockets etc.,
  • native (and easy to use) XML and JSON parsers and emitters,
  • ability to spawn detached/attached command line processes and the ability to stream data to and from such processes; allowing me to easily orchestrate & provide a “grid” of processes (scripted, command line executables, Excel instances) both local and remote (with remote being either traditional servers, http servers or Hadoop Streaming grids).

While looking at the various implementations of Javascipt as a server/shell language, I came across JSDB – JavaScript for Databases. It’s a C++ wrapper around Mozilla’s SpiderMonkey, with lots of useful data related utility classes added. To make JSDB a perfect fit it simply required:

  • a few minor changes to the SQlite class (allow the loading of Virtual Table extensions, add the ability to pass in the address of already open SQLite memory structures);
  • a linker change to use the DLL version of SQLite;
  • plus a VBA-friendly DLL wrapper ( & VBA declares, to call the DLL) to replace the JSDB shell when embedding in Excel.

From my SQLite as the MP3 of Data post:  ”Just as “fractional horsepower” electrical motors revolutionised manufacturing and eventually all our lives (car starter-motors, fridge motors, washing machines etc.), “fractional horsepower” databases can do the same for data. Distributing data to where it is needed.” I can now add a distributed “fractional horsepower” processing engine for that distributed data. This transforms xLite from a micro-ETL platform into one capable of handling (or at least orchestrating) practically any ETL (Extract, Transform & Load),DI (data integration) or “Time Asset” (see this post) process.

SAX and Bugs and XBRuLe

Okay, the XBRuLe is a bit laboured, should be SAX & bugs & XBRL, but any excuse to play some Ian Dury :-)

Bugs (the programming type, not the creepy-crawlies), Simple API for XML and Extended Business Reporting Language;  these represented the trinity of my concerns for the last three weeks  or so.

First, the bugs:

Several weeks back, I decided that the C portion of xLite needed an overhaul. The codebase contained a lot of stuff that I no longer used and also contained code that I’d written when I first re-started using C (after a lapse of 20 years or so, the phrase “I’d forgotten more than I’d ever known” sums up the experience best); some of this code was memory-leaking like a sieve.

Also, the original Pivotal Solutions code was not UTF-8 enabled, instead it used the host pc’s default character set codepage, this needed to change (if you don’t know what I’m talking about see Joel Spolsky’s lecture to the developers of the world – well actually, primarily to those of us in the Anglo-Saxon “ascii-will-do-fine” world).

The bugs I introduced as a result of this upgrade were not of the logical kind but of a much nastier type, peculiar to the low-level world of C, “bad free()” bugs!

Excel was no longer leaking memory (well no more than it normally does) but it was crashing randomly (usually in a DDL called VB7, 1st upgrade to classic VB in over a decade!), a sure sign I was freeing memory that was not mine to free. Two days later, I’d tracked the bugs down, but only by a painful line by line code walk-through. If you’ve no idea what I’m talking about here, count your blessings and move on.

The other major change I added to xLite is the ability to code SQLite Virtual Tables in VBA. The Python side of xLite has always had that facility, but I look on Python as a nice-to-have add-on, not as a core component. The growing need for “core” virtual tables meant either coding them in C or in VBA, see previous paragraphs for why VBA won the day.

The immediate driver for adding both UTF-8 support and quick-to-build virtual tables was the need to better handle XML data within xLite.

… then the SAX:

For small XML/HTML datasets, I, like the rest of the world, use DOM manipulation; but for larger sets I’ve tended to go down the brute force and ignorance approach of hand coded File I/O combined with regular expressions to efficiently parse out the data required.

Last week, an email from a datasmith name Cathy prompted me to look into using Sax for loading XML. Cathy, like most datasmiths, is not a professional programmer, she has a “real job”; part of that job is analysing large datasets, she’s learned enough programming (mainly Access & Excel) to do that job more efficiently. The data she needed to parse this time was encoded in XML, but being very large and built on a schema that constantly changed, the default DOM approach overpowered both Access and Excel.

Cathy had originally contacted me looking for information on using Talend to read the data and it looked like she was about to start a new side-career as a Java programmer. I figured there must be a way for her to leverage her existing skill-set (VBA) & this led me to MSXML’s implementation of SAX2. She was delighted; although many of the concepts would have been new to her, at least they were bounded within a world she was already comfortable with (the basis of how we all manage to incrementally expand our knowledge).

The only problem was, the example code no longer existed (it would have been in VB6, but who uses that these days, other than a few million VBA para-programmers, let them eat the .NET cake). So I coded up the first example in Excel/VBA and here it is if you need a quick-start to the joys of SAX2.

Which leads me on to XBRuLe:

“One XML to rule them all, One XML to find them, One XML to bring them all and in the darkness bind them…”

A former colleague of mine when explaining his computer science studies to Meath farmers whom he regularly met while hitch-hiking home from college (mid-1970s) was usually met with the response: “Ah computers, dere de comint’ing. XBRL has been the coming thing for quite a while now.

Being in the business of Business Reporting, XBRL has always been on the radar, and of late, the radar is showing incoming fire. First the SEC, and now the UK’s HMRC, are mandating it as a filing method. Whether this is a good thing or not, is open to question. As this article puts it: “XBRL is a  case study in complexity” http://bit.ly/951oEY “the producer of the sample must have suffered a polymorphic recursive brain meltdown”.

But needs be; I’m in the business of shaping difficult data, so I’ve started to re-acquaint myself with the subject (last time I looked at XBRL in any depth was 2004). Part of that process will be to beef-up xLite’s XML capability, which, with me being on the table side of the “Tables Vs. XML; the data lingua franca debate“, will involve getting the data into a relational form at the earliest possible moment. For example, for discovery I would use a classic parent-child recursive structure, but not having something like Oracle’s Start-with, adding  a virtual table to make navigating such hierarchies easier with SQLite.

If anybody is peddling to you the concept that this brave new word of XBRL powered reporting will make your business reporting life easier, they’re either lying or don’t fully understand what they’re selling. As 19th century industrialists were wont to say: “Where there’s muck, there’s brass”; and with XBRL, you’ll be up to your oxters in muck, but with the brass all flowing to others, perhaps even to me!


Excel as a document-oriented NoSQL database

I’ve been a long time fan of CouchDB, one of the many NoSQL databases to appear in the last few years. CouchDB is a document-oriented database, which with solid B-tree indexing and easy replication, topped off by a MapReduce style view mechanism, puts it up there as a best-of-breed noSQL datastore.

Now it may seem strange that somebody whose SQL – does exactly what it says on the tin post clearly marks him out as an RDBMS fanboy, can also sing the praises of a noSQL database. Are they not mutually exclusive? To many, particularly in the noSQL world, this appears to be the case, with some clearly determined to re-invent the wheel, ignoring the lessons learned by relational database practitioners.

The main advantage to me of document-oriented databases, such as CouchDB, is the ease of setup and subsequent pain-free evolution of data models that comes with a schema-less database. The main disadvantage is the relative rigidity of downstream analysis built into most such databases. MapReduce, such as used by CouchDB, is fine for predefined views developed by programmers, but as we know, reporting never stops; datastores front-ended by a SQL interpreter open up the data within to a much wider audience (be that through hand-crafted SQL queries or more likley via reporting-tool generated SQL)

Of course document-oriented, noSQL, schema-less datastores have been all the rage with end-users for close on 30 years. They’re called spreadsheets. Excel has over the years added features (such as list handing & filtering) that have made the spreadsheet the database of choice for millions. Anybody who deals in corporate data is aware (sometimes painfully aware) of just how much data is stored in these Data Populi repositories.

I, as an IT professional, am aware that Excel workbooks as books-of-record, have been, and continue to be, the cause of many data quality problems. Yet, I’ve also seen, and am myself responsible for, many successful Excel ‘database implementations’. Take for example, my filing system.

I don’t have a filing cabinet, instead I use small stackable cardboard boxes to store documents. As I receive or generate documents I simply place them in the current open box. Every so often, usually prompted by a VAT or other tax return deadline looming, I record what’s in the box, and if the box is looking full or maybe it’s end-of-year, I’ll ‘close’ the box and open a new one.

Each box is represented by a separate workbook, each document by a separate worksheet. Some documents such as electronic Sales Invoices may not require a physical copy simply a link to a PDF, but I still tend to store a printed copy. Others, such as Purchase Invoices, have their details manually copied from the original paper based document, I usually also add a hyperlink to an image of the source document. (I no longer use my scanner, instead I use my phone camera to record paper documents).

Bank reconciliation involves recording the bank item ref against the appropriate document and linking back to the Bank Statement worksheet  (which as I still receive paper-statements consists simply of a link to a photo of the statement and basic info such as date of statement and whether or not I’ve reconciled it).

VAT Return documents are generated using links back to source documents and a link to an image of the completed paper return (not yet signed up for ROS). Similar documents are generated for year-end tax returns & accounts.

So my ‘filing system’ is also my ‘accounts system’. This is common practice amongst small (and not so small) businesses. The advantage of this approach rather than using a “proper accounts system”  is the simplicity and the in-depth knowledge it forces me to have of ‘my data’.

But can this type of thing scale, and what of the businesses that are using similar systems to manage thousands or indeed 10s of thousands of documents or transactions? The simple answer is no,  at least not without a semi-automated process and a cost-effective means of analysing the data; many such systems are on the road to disaster. That disaster may take the form of data quality issues or the significant (and often hidden) cost of operating such systems (often the operators are highly paid accounting staff or managers whose cost is buried in general overhead costs, unlike internal or external IT resources whose time tends to be project allocated).

But again, I and others, have managed to setup systems such as these that were  cost-effective (not just in initial construction but in ongoing running costs) and managed to maintain data quality. This usually involved building a simple work-flow process, automating to some degree but keeping the human touch as much as possible. My xLite datasmithing platform had its beginnings in such RSS (Really Simple Systems) scenarios. Many such “systems” were IT driven ETL processes or data cleansing initiatives, others, business initiatives such as sales planning/budgeting or customer surveys.

I haven’t used xLite to automate my filing system (my transactional volumes are too low and my motto when it comes to systems is, “good enough” will do), instead, relying on standard spreadsheet formulas and few bits of VBA, but if I suddenly found myself at the business end of a fire-hose of documents I could easily do so.

Much like CouchDB, I could  create ‘map’ views of my documents, but instead of MapReduce Javascript code, I’d  load the documents into SQLite tables (using a duck typing approach; if the document had the required data, e,g, Invoice No, etc. for Sales Invoices, load, otherwise ignore). The ‘reduce’ part would then be standard SUM(), Group By SQL statements.

I could also mine the documents for text and then use SQLite’s FTS full-text searching to create a free-format search index or use xLite’s TAG Cube functionality for a more formal, hierarchy supporting, tagging index.

If I needed to share the system with others in my organisation I could use a light simple distributed version control system such the SQLite based  Fossil. This would allow for many of the replication benefits that CouchDB offers.

In fact, if I wanted to backend the system with a server based database I could call in the services of CouchDB itself. Easily done as xLite has inbuilt Python support and the library that xLite uses to interact with SQLite on the Python side is APSW. And guess what, APSW now includes a virtual table implementation that lets you access CouchDB databases from SQLite. Excel as a front-end to CouchDB!

If the ‘it does exactly what it says on the tin‘  Ronseal catch-phrase epitomises SQL  then perhaps ‘ Simples‘ as Alexandr the Meerkat might say epitomises the potential of document-based databases.

Why not join me on Twitter at gobansaor?

Excel 2010 Application.Caller Bug

I’ve just released another xLite “introduction”, this time the xLiteWorkbookFunction function. I’ve had most of the now released functionality working (and in use) for quite a while but had delayed publishing until I’d installed Excel 2010 as I’d wished to test against a modern Excel version.

I’d not bothered with Excel 2007, as I couldn’t see the advantage over Excel 2003, but Excel’s new PowerPivot is one hell of a reason to upgrade to 2010. I’d preformed a quick test against 2007 by installing a trial version on an EC2 Window’s image and it had appeared to work fine; but it was a different story under 2010,strange things started to happen.

The core functionality, as tested by VBA code, worked OK but when I tested using xLite.SQL as a UDF (a user defined “formula”) things fell apart. For an explanation of what the xLite.SQL function is and why I wasn’t that surprised when it started to act up, see here. As xLite.SQL plays to the rules rather than the spirit of a UDF, I assumed it was pay back time  for my blasé ignoring of functional programming constraints and I set about tracking down the cause.

It turns out the cause is a change in behaviour (a bug) whereby in certain circumstances the cell range returned by Application.Caller is not, as one would expect, the cell hosting the called UDF but that cell more usually associated with Application.ActiveCell (i.e. most likely the cell where the cursor currently resides).

Why is this a problem and what is Application.Caller usually used for? The most common use I’ve made of Application.Caller is to determine whether a VBA function had been called from a cell as a UDF or from a menu, button or some VBA code. This is important because when called in UDF mode, a function must be side-effect free, i.e. its only affect on the workbook is the return value; attempting anything else will silently fail (or in extreme cases, abort Excel). This functionality is not affected by “the bug” as the usual method of achieving this is by …

If IsObject(Application.Caller) Then

… this will work even if Application.Caller returns .ActiveCell, as both are objects.

If however, various properties of that range need to be interrogated (such as the actual address or the formula text that xLite.SQL requires) then Application.Caller in certain circumstances returning the .ActiceCell range rather than the calling cell’s range causes problems. I’ve managed to get around these problems by adding an extra parameter (homeCell) to xLite.SQL which the function will auto populate on first entry (when Application.Caller and .ActiveCell are guaranteed to be the same object). For example, a SQL call entered in cell A2 of Sheet1 as ..

=SQL(“Select name from sqlite_master”)

is automatically rewritten as …

=SQL(“Select name from sqlite_master”,,,,,,”[test.xls]Sheet1!$A$2″)

Not ideal but it gets around the problem in the short term.  Long term I may do a version for sub-2010 reverting back to original dependence on Application.Caller. As xLite studiously avoids Excel’s UI features such as menus/ribbons I’d hope to avoid different versions for pre/post ribbon editions, but needs must.

The xLiteWorkbook example (test_call_workbook_function.xls calling test_workbook_function.xls) consistently generates the bug (there’s a SQL call against the log database at the end to test for this). Executing the same logic manually on the called workbook (test_workbook_function.xls) generally doesn’t, but it has done so occasionally!

I’ve messed around changing the .xls files to .xlsm in case of compatibility problems but it doesn’t appear to affect the outcome.

If anybody else has come across this problem or has an alternative to using Application.Caller to return a UDF’s calling cell’s range, do let me know.

UPDATE:

Peter Hoadley suggests a workaround by using Application.ThisCell, see comments below …

UPDATE: March 19th 2010

Danny Khen from Microsoft has emailed me to say …
This is a known Beta bug which has since been fixed. So you should expect it to go away when Office 2010 ships

Why not join me on Twitter at gobansaor?

TAG Cubes – SQLite Star Query Part III

It’s no secret that I’m a huge fan of SQLite and Excel, particularly when used in combination. I also greatly admire the open source BI engines, Palo and Mondrian. Mondrian appeals because of its “ROLAP with a cache” architecture and its implementation of MS’s excellent MDX language. When I say MDX is excellent I’m talking with my professional programmer’s hat on, as an end-user tool it’s a non-runner. This is where Palo comes in, building on the hypercube concepts pioneered by the likes of  TM1 and ESSbase, it presents a designview that’s approachable by a vastly greater percentage of “civilians” than is the case with ROLAP-based solutions.

The trick behind TM1, Essbase, Palo etc. is the extension of the spreadsheet metaphor from two to multiple dimensions, while still binding the interface closely to the familiar spreadsheet (which for most of the business world is still Excel).

So where does SQLite come in all this?

At first glance, SQLite lacks the sophisticated join functionality to support star-queries, but of course, if the dataset is small then a full-table scan of a fact table, or better still, loading the fact table into memory negates any such short-comings.

In fact, all traditional ROLAP engines have problems with dimensional models, particularly when you reach the point of using summary tables or query re-writes, that’s why the emerging SQL-speaking columnar-databases are such a godsend for ROLAP data warehouses.

It was SQLite combined with Excel acting as a data prep platform that was originally my main interest, so for pivoting, Excel’s own pivot table would have to do.  Nevertheless, I felt the tool was incomplete without the ability to directly pivot the underlying SQLite database.

Why not use Palo or Mondrian as a pivot tool? Well yes, where a fixed permanent “solution” is required then the extra moving parts of either approach would be justified and indeed necessary but that is to miss the essence of what I call datasmithing.

Datasmithing is not data warehousing nor is it the provision of solutions (which, for example, Palo superbly enables in multi-user budgeting situations). Datasmithing, as a skill, is of course part of the process of both, but it’s on the edges, at perhaps the planning or consumption stages.

Datasmiths deal in the unknown, in change, in disaster recovery, in systems’ commissioning, in the never-ending barely-repeatable processes thrown up by daily business life.  For that, the toolset required must be as simple as possible (but no simpler), self-contained, document-oriented, secureable (is that a word?) and easily archived and retrieved. Excel and file-based DBMSs such as MSAccess or SQLite fit the bill nicely, server-based technologies such as DBA controlled database servers or IT installed “solutions”, less so.

Jedox has made Palo relatively easy to install (and likewise, Canada’s SQLPower has made Mondrian setup a painless exercise via their excellent Wabit reporting tool), but, the zero-install, email friendly document approach that spreadsheets are famous (and infamous) for, is preferable in many situations. This is something that Microsoft have recognised in their Gemini add-in for Excel 2010, but Excel 2010 is a not here yet and it’s likely to be five years or more before it’s as common as Excel 2003 is today.

The inclusion of FTS full-text searching with SQLite triggered an ah-ah moment with regards to pivot-enabling SQLite.

The usual method that hypercube-like excel-friendly OLAP tools use to return data is via a UDF like so…

=DATA(“CubeName”,”value1″,value2″,…)

…where valueN represents dimensional elements, so…

=DATA(“SalesCube”,”Beer”,”Profit”,”Jan 09″,”Actual”)

…is the Actual Profit for Beer sales in Jan 09. The dimensional elements act as “tags” to locate a particular value, there is of course much more to tools like Palo; hierarchies, intra-cube rules etc. but in essence most OLAP tools are like www.delicious.com for number crunchers. This method of retrieving data fits well with how people use Excel and not just for pivots, but for embedding OLAP aggregated cells in lists.  For example, a CRM scenario; a Sales Rep makes a list of her ‘best’ (subjective) customers, but needs hard (objective) stats, to be placed alongside the list to convince the boss or to track actuals against expectation.

Dimensional elements as tags; FTS3 virtual tables as fact table indexes; the concept of a TAG Cube was born.

In the above example “Profit” would most likely be described as a measure (Palo, a near pure hypercube does not distinguish between Measure and other Dimensional coordinates). Dimensions, measures and attributes are in reality interchangeable (a Customer ID can act as a dimension or an attribute, but by applying a  Count Distinct to it, it’s a measure) but most OLAP solutions treat “Measure Dimensions” as different, and so do TAG Cubes.

By using the default fact table structure (a single-columned table) and querying using the default measure (which translates to the SUM() of that single value) a ‘pure’ approach can be used. But, ROLAP is tightly bound to the concept of a fact table, and since SQLite is relational, TAG Cubes offer the ability to use a wide fact table approach and I think gains considerably in flexibility by so going.

The above example of using Count Distinct, or the simple creation of calculated measures are examples of this flexibility. Another, is a measure based on SQLite’s concat_group aggregate function to provide a drill-down facility, e.g.

=DATA(“SalesCUBE”,”ROWIDList,”Beer”,”Jan 09″,”Actual”)

…where “ROWIDList” would be setup as concat_group(rowid,’,') and will return a comma separated list of the underlying fact table ROWIDs.

A major reason for rolling my own pivot engine was to add a concept of “namespaces” and to separate the implementation of these namespaces from the actual pivot.  When a tag (or a predefined hierarchy of tags) is assigned to a cube, it’s also assigned to a namespace, in many cases namespace and cube would be synonymous, but in some cases a more sophisticated approach is required:

  • Multiple cubes sharing the same set of conformed dimensions would be best served by such cubes sharing a common namespace, and so they can.
  • Different consumers of the pivot may require the use of a different language, be that a spoken language or a different ‘business language’ e.g. Manufacturing Product Codes V Consumer Product Names. Again, easily done.
  • Sometimes identifying data can’t be shared with the datasmith or the numerical analyst working on a problem; in such cases being able to replace  the actual namespace with an obfuscated one can be very useful. Or, for added security, the namespace might only be issued to approved  PCs while the tag index and fact table are stored on a shared drive.  Needs some more work to make managing such scenarios secure and easy to use but the structure is there.

As hinted on above, the three elements of a Tag Cube, the namespace, tag index and fact table can be assigned to different databases (i.e. files). Due to the wonders of SQLite’s ATTACH statement and the backup API’s ability to quickly load/unload databases in/out of memory, it’s possible, for example, to load namespace and tag index (i.e. the ‘dimensions’) into a memory database, while a very large (i.e. too big to fit to memory) fact table remains on disk. Fast and cheap SSDs will add further configuration options.

Although most of the TAG Cube functionality is available only within Excel, I’ve built a C based SQLite Virtual Table (cFact) to allow the tag index to used outside xLite. This means that SQLite drivers for ODBC (for use as a Pivot Table source, for example) or JDBC (for use in  SQLPower Wabit perhaps) can efficiently access data models built using xLite.

I had to revert to using C rather than my preferred Python (did I mention that xLite now embeds Python in Excel, no, well it does, Python the newVBA ?), having failed to get around multi-threading issues with callbacks to Python in both the ODBC and JDBC drivers. I’d make a career promise to myself many years ago, not to having anything to do with printers or threads, and I think I’ll stick with it :)

TAG Cubes are the latest addition (still WIP actually) to be added to xLite, adding to:

  • VBA coded SQLite SQL functions.
  • Worksheet Functions; call out to a ‘function’ built using Excel formula, passing a parameter list and returning a value.
  • Workbook Functions; like Worksheet Functions, but loading a new Workbook, passing in parameters, passing back a value (or tables) and closing the Workbook when finished.
  • XLiteScript; xLite exposes its functionality via VBA coded UDFs, which can be called like any other formula, but data prep activities often require sequential procedural logic, xLiteScript is a table-oriented scripting mechanism offering basic flow-control logic.
  • pyScript; I embedded Python into xLite to take advantage of Python’s speed in developing Virtual Tables, SQL Functions and extensions to SQLite and to tap in the wonderful world of Python code. I’ve also added the ability to use Python from scripts defined within Excel (to indent, tab to the next cell!).
  • Fast load/unload to/from CSV.
  • Load from any ADO source.
  • Remove xLite formulae and rename and save Workbook, very handy when used via Workbook Functions to mass produce Excel “reports”.
  • Other WIP items are; load from SAP, load/unload to/from Amazon S3, use Palo cubes as TAG Cube “facts”, slot in/out Palo for TAG Cubes, auto-generate Mondrian XML based on TAG Cubes, write-back and splash, Python & VBA TAG Cube “rules”.

I’ve started the process of releasing the beta code here …

Why not join me on Twitter at gobansaor?

Palo HTTP API via Excel/VBA

As a result of a request on the Palo support forum last week looking for a VBA tool to directly access the Palo OLAP server via its native HTTP API, I realised I had such a tool. I had built it about a year ago (to use alongside Fiddler Web Debugger and .NET Reflector) to help me understand in detail the interaction between the Excel client and Palo. (Remember this was before Jedox released the Excel add-in as open source)

I removed the SQLite dependent parts (which allowed me to load meta-data into tables and analyse using SQL) and used a pure-VBA MD5 hash routine to reduce the number of moving parts. Also added a few VBA helper routines (including an array-formula UDF to allow direct calling of the API from Excel UI).

As I said, I originally built it as a learning aid but I’ve started to look at it again in the context of a an Excel/Sqlite Palo ETL add-on.  Having the flexibility of the native API and my own meta-data cache in SQLite might prove very useful.

The code may be found here.  Enjoy!

Why not join me on Twitter at gobansaor?

PowerPivot – XXL, Excel on Steroids

In my last post about why I use SQLite in combination with Excel for datasmithing tasks, I listed the more traditional backends (Excel itself, MS Access, RDBMs & MOLAP cubes) that one would expect to “compete” with such an idea.   But I suspect that if that same post appeared  two years or so into the future, there would be a fifth contender, PowerPivot cubes.

PowerPivot (at the time it was called ProjectGemini) is due to be delivered as a free add-in to the next version of Excel (2010) ,like the Analysis ToolPak or the Data Mining add-ins for Excel 2003.  (See this OLAP Report Project Gemini, Microsoft’s Brillaint Trojan Horse for a good overview of the tool).

Donald Farmer ,who works on the project, having seen the SQLite as the MP3 of data post and recognising that the use cases behind combining SQLite with Excel were similar to those of Project Gemini, kindly offered me a demo of the product.  Well, the phrase “Excel on steroids” has been much used in the past (in particular of add-ins such as Essbase, Palo or TM1) but this “ya gotta see”, Donald likes to call it XXL.

Millions of rows of data in-memory on a 4GB PC being “modeled” using a “user-friendly” pivot-table-like interface. And when I say, modelled, the user isn’t being confronted with concepts such as dimensions, levels, attributes, facts and so on, but a classic star schema model is nevertheless being built behind the scenes.  And it’s this model that allows PowerPivot to escape some of the inadequacies of pivot tables, e.g. allowing for rules and hierarchies to be defined.  The resulting model can then be saved and shared as a file (keeping to the document-centric ethos of Excel) but it can also be posted to and managed by SharePoint.

SharePoint will be extended to allow the IT function to manage and audit shared models to whatever degree the organisation requires, but the single file format will also allow smaller groups to share without the need for IT involvement (essential if bottom-up adoption is to be encouraged).  SharePoint will also add the “Web2.0 collaboration layer”.

How will MS make money from this if it’s free?  The first clue is the SharePoint backend, more functionality means more reasons to purchase and use MS’s server stack and the same applies to Excel itself. I, like many others, are very happy using Excel 2003 and look on Excel 2007 the same way the market in general has looked on Vista; i.e. pretty, but lacking a strong enough reason to upgrade unless forced to do so. (Excel 2007 also has the ribbon issue, not one I find a major problem myself, but others do).  But I would upgrade to a version Excel that offered Project Gemini capabilities and I’m sure others would follow (and more importantly to MS’s revenues, thousands of corporate accounts would too).

PowerPivot offers proof that MS realises, what those of us on the ground have know for years, that BI projects are in the main, Excel-centric; all the ‘hard sums’  and awkward decisions end-up back on the desktop.  MS has decided to publicly recognise that fact and profit from it. The timing is both economically and technically opportune; PC speed and cheap memory means that a huge chunk of even a large corporation’s datasets can be analysed by a PC (according to this, the median size of original data in OLAP datasets is about 5GB); and there’s obvious cost-benefits for companies facing difficult times requiring more to be done with fewer resources.

What will the effect be on tools such as Essbase, TM1, Palo etc. ?  Well, let me put it this way, if their owners are making strategic plans for 2010 onwards and they’re not taking account of the PowerPivot effect perhaps they should.  Most likely PowerPivot will help increase the overall market for OLAP tools, with the incumbents tending to specialise in their existing niches (e.g. Palo in Budgeting, with the added value of being free and open source, which has a premium over just being ‘free’).

So will I put away my Excel-SQLite fixation then? No, for two reasons:

  • PowerPivot is not here yet, and the proof of the pudding will be in the eating. Also, when it does appear it will only apply to Excel 2010 (or whatever) and as many companies are still on Office 2000 (and a few on 97!), it’ll be at least  5 years before a significant percentage of sites upgrade.
  • The SQLite addition to Excel offers not just BI capabilities but also makes a nimble ETL and data integration engine. I’m also experimenting with Amazon S3 integration to enable simple work-flows for small distributed teams (or even same-office groups where the WAN is the new LAN).

Whether you agree or not in the validity of  ”workgroup BI“, be aware that MS does and it thinks that BI is about to enter a new phase,  for proof see MS’s Nic Smith’s The History of Business Intelligence video.

UPDATE: 19th Nov 2009

Last evening I downloaded for the 1st time both Excel 2010 Beta and the PowerPivot (new name for Gemini) add-in.  First impressions; yep, in the flesh it’s just as impressive as the above demo led me to believe it would be.  As I said on Twitter last night Datasmiths of the world; download the Excel 2010 Beta and PowerPivot add-in; this ya gotta see!!!

Why not join me on Twitter at gobansaor?

SQLite as the MP3 of data

… and Excel as its “mixing desk”.

When I tell people that I use SQLite in combination with Excel (via xLite) as my datasmithing platform, many ask why SQLite? (Many others ask why Excel?  but “sin scéal eile”, that’s another discussion – Excel as the iPod of Downloaded Data.) Those that question my use of SQLite tend to cluster into four camps:

  • Pure Excel jocks.
  • MS Access fans.
  • The client server database brigade (SQL Server,Oracle; or if FOSS fans; MySQL, PostrgeSQL).
  • The MOLAP folks (Essbase, TM1, Palo).

Now while I have used and will continue to use/encounter all four ‘approaches’, I’ve come to believe over the last couple of years that SQLite brings something special to the datasmithing game. When I look back over nearly 30 years in the data handling business I keep thinking – “If only I had SQLite then, how much easier/quicker/cheaper that task would have been!”.

Just as “fractional horsepower” electrical motors revolutionised manufacturing and eventually all our lives (car starter-motors, fridge motors, washing machines etc.), “fractional horsepower” databases can do the same for data. Distributing data to where it is needed.

As operational local caches, this use of SQLite is already far advanced. SQLite is embedded in lots of every day software tools, everything from McAfee anti-virus to TweetDeck Twitter clients (best one IMHO). But my interest is more in SQLite’s potential as a micro-BI (or maybe more correctly a distributed-BI) platform. A sort of MP3 format for distributed structured data, if you like.

But why SQLite (and in particular SQLite in combination with Excel) as my datasmithing tool rather than the four other approaches?  First, what’s a datasmith?

Managing and manipulating datasets has become an integral part of many people’s job, not just accountants (the original of the species) but marketing executives, sales staff, pricing analysts, process engineers; different job titles, different roles but using a skill that they’ve likely never been formally trained in, a skill without a name; a skill I call datasmithing. I like to think of  myself as a master datasmith, or a datamith’s datasmith.

If you consider yourself a datasmith then most likely the tool you use to manage your datasets is Excel. And before you apologise, don’t. Excel is by far the best and most flexible end-user data manipulation tool out there. Everything from the current financial crisis downwards has at some stage being blamed on Excel, but you know and I know that many tasks would remain undone or under-done were it not for end-user generated spreadsheets.

Spreadsheets are not however optimal for some tasks, linked spreadsheets in particular are data disasters in waiting. While fantastic for data transformations and presentations, as books-of-record they’re rarely suitable. Other tools such as SQL based relational databases and in-memory OLAP offer much better and potentially much more cost-effective data modelling functionality, but also at a cost of extra complexity and ongoing technical support.

MS Access (which like SQLite, is a document-centric, non-client-server database; but unlike it, is also a forms/reporting development environment) would appear to be the natural local store database. My problem with MS Access has been its tendency to try to be all things to all men, ending up not fully satisfying anybody. Professional developers think it’s too limiting, non-techs find it too intimidating, even reporting, where it once showed promise left a big enough opening for Crystal Reports to evolve. It is also limited to Windows which might not seem to be a problem if combining with Excel, but, as it’s often necessary, due to scale or complexity of the data,  to use ‘proper’ ETL tools such as Talend, having an OS agnostic database format than can act as a distribution media (think MP3s again) between “mixing desks” can be very useful.

The big difference to MS Access for me is SQLite’s open source code; code that’s a pleasure to browse and with an approachable API that even I, with my very rusty C skills, can manipulate. Having access to that code allows me to tightly integrate it with Excel, so much so, that I can use Excel functions (built-in functions, VBA user-defined functions and 3rd party add-in functions) directly from SQLite’s SQL; and vice-versa, access SQL functionality via Excel “formula” calls. It is  also possible to  load most datasets into memory using SQLite’s in-memory mode enabling very fast processing  and near zero-latency when passing data to and from Excel/VBA. In the near future, cheap, large SSDs will enable non-memory databases to offer similar speed but also handle extremely large datasets (see this for a glimpse of that future).

What about the big beasts of the data world, the client-server databases? Having spent most of my professional life working with such tools I’m aware of the power of a well designed relational database. If SQLite is the MP3, then these are the master tapes, the DDD recordings. Most of the data that eventual ends up in SQLite for analysis and/or transformation will have originated in data-warehouses or be directly sourced  from OLTP systems built using relational technology. But for close-up analysis and transformation, the pure simplicity and convenience of SQLite is hard to beat. That simplicity is primarily due to its Excel-like ‘document’ nature, all code and data can be housed in a single folder (or true-crypt container for added security), ensuring that the ‘problem domain’ can be easily archived and/or shared with others without the need for professional IT resources.

And yes, I hear you, isn’t that the basis of Excel-hell? Yes it is, but over the years I’ve found that this is rarely a problem for datasmiths, they deal day-in day-out with document work-flows, they understand the risks and the benefits (mainly the simplicity) of the approach. Where the nightmare truly happens is when this approach is used as an alternative to an OLTP system i.e. using Excel and other document-like datastores as books-of-record in large multi-user environments – “there be monsters for sure”.

How about MOLAP? Wasn’t Essbase’s name derived from “extended spreadsheet database” and doesn’t Palo offer a truly excel-friendly multi-user database back-end? Again, having worked with Essbase for many years and now being a big fan of the open source Palo MOLAP tool, I fully appreciate the power that such tools brings to analysis and multi-user planning tasks. But for many situations, an Excel Pivot Table is “good enough” and even when it’s not, it is possible by utilising what I call a tOLAP cube (essentially, a fact table indexed via tags enabled by Google’s great addition to SQLite, the FTS3 virtual table) to build and access  powerful, yet simple, cube-like data structures.

By integrating SQLite with Excel, datasmiths can have the best of both worlds, familiar spreadsheet front-end combined with a fast and powerful SQL engine and datastore, in fact, everything that MS Access should have been.

Why not join me on Twitter at gobansaor?

SQL – does exactly what it says on the tin

SQL how unloved it must feel sometimes, constantly being maligned, accused of being on the wrong side of the object-relational impedance mismatch,  lacking the glamour of OO programming languages that claim the moral high ground. Yet at the same time hewing and hauling most of the world’s structured data on its old but well fashioned back.

SQL is perhaps the world’s most popular DSL, a declarative language for the manipulation of tabular data, easy to learn yet capable of powerful (and sometimes complex) expressions.  And like the Ronseal ad, a SQL statement no matter how simple or complex, does exactly what it says, all the complexity of loops and iterations and the attendant errors, abstracted away, it just works!

SQL is both a programmer and an end-user tool; after Excel formulas, it’s the language most likely to be understood and used by “civilians”.  There are few enough such cross-over tools, so think twice before building a datastore that doesn’t offer a SQL API.  And I guess that’s what Amazon did. Although SimpleDB is not a relational database, they’ve decided to add a SQL API, following Google’s lead with its SQL front-end to the non relational big-table backed Google App datastore.

SQL is also the reason why I’ve integrated SQLite with Excel , leveraging SQL to manipulate tabular data with greater efficiency and fewer errors while still keeping the touchy-feely power of Excel.   I expose SQLite to Excel via UDFs rather than menu options or wizards, so that the transformation logic is visible and approachable (at least to those comfortable with excel formula “programming” and with basic SQL).

SQL is my weapon of choice because of my belief in the primacy of data. It is data that matters in the long run, not the algorithms or GUIs that temporarily use (and abuse) it.  In my time in Guinness Ireland I had the task of transferring master and historical transactional data from “legacy systems” into SAP ,Siebel and a new datawarehouse; data that had a decade and a half earlier been transferred by me  into those same legacy systems from even older systems. In fact, the data’s electronic lineage could be traced back to a 1960′s era ICL mainframe  (I have the original spec!) and I’m sure it existed in accountancy machine punch-cards  prior to that. Understand a business’s data and you’ll not just understand the business as it currently operates but also how it operated in the past and its future potential.

SQL abú.

Why not join me on Twitter at gobansaor?

Spending time on Excel-SQLite, C, VBA Callbacks & Twitter

Haven’t posted here in a while as my spare time has been soaked up programing, well actually refactoring would be more exact.  My xLite “SQLite empowered Excel” codebase has grown over the years and required a serious makeover to get rid of stuff I no longer use and to generally make it more robust.  I also decided to add some extra functionality to my VBA friendly C wrapper for SQLite (based on Pivotal Solutions’ pssqlite.dll) which meant I had to re-acquaint myself with my long lost C skills, so doing reminded me how much I like C. Close to the metal programing if not exactly super-productive is nevertheless super-powerful.

The new improved xLiteSQLite.dll now has a built-in CSV loader (both file based and string based – handy for loading Palo HTTP API responses into a table). It also returns a one columned variant array of CSV values for quick rendering via “text-to-columns” code (by far the quickest way of handling large dataset pasting into Excel).

I’ve also added the ability to create SQlite UDFs (user defined functions) in VBA (thanks to http://stackoverflow.com/users/4007/rpetrich).  This is a very powerful feature as it allows SQLite selects to act as a “loop controller” calling back to  Excel/VBA functions to process each row, really useful for ETL tasks. And not just scalar UDFs but aggregating (aka group-by) functions too, allowing the use of Excel’s powerful array functions in SQLite statements.

All in all, the changes to the xLite VBA code and the C wrapper makes Excel backed by SQLite a seriously good micro-ETL tool. Combined with Palo, the result in a truly wonderful micro-BI platform; a cost-effective toolset for these recessionary times.

Of course I’d be lying if I said code was the only reason I’ve been neglecting my blogging duties, I’m afraid I’ve a confession to make, Twitter has hooked yet another sucker, me! 

I’ve found I’ve settled in to the whole micro-blogging thing with ease, and have managed to make contact with people I would not have encountered otherwise, as well as reconnecting with others that I’d lost contact with.  So if you too are all-a-twitter then do please follow gobansaor-on-twitter.

Why Larry hates the cloud, and my data trinity.

Last week Oracle certified Amazon EC2 as a supported platform, that same week Larry Elison attacked the concept of cloud computing as pure hype. Obviously, Larry is not happy with this whole cloud thing, and I think it’s not just the threat it poses to the software industry’s traditional licensing model that worries him, rather, as Robert X. Cringely points out in his “Cloud computing will change the way we look at databases” post, it’s the likelihood that it sounds the death-knell for large-scale traditional databases.

This new database paradigm is memory rather than disk centric, with the disk-based element acting as an archive/backup/restore mechanism which can easily be stored on commodity SAN devices ( e.g. Amazon’s ESB). Using MapReduce technology Google effectively holds the whole Internet in memory, not in one big super computer but in lots of cheap commodity servers.

But it’s not just in the realm of mega datasets that RAM based databases threaten traditional models. Excel is a memory-based database engine, so too in-memory OLAP tools such as Palo. Such products’ ability to handle large volumes of data has increased over the years, with the decrease in RAM costs and the appearance of cheap 64 bit machines (which are no longer limited to 2G/3G process working sets).

That doesn’t mean that we’ll throw away SQL databases in their entirety, SQL and the relational model will continue to be useful. But perhaps of greater use in local datastores/caches that as the building blocks for large scale datastores. For such local caches, less will be more; fewer features, easier to configure, more flexibility. That’s why I like SQLite; long after the dinosaurs of the database world have disappeared, I imagine SQLite databases will continue to survive, embedded in mobile phones, browsers, wherever a local datastore is required. And more than likely operating in memory rather than off disk.

By combining Excel with an in-memory SQLite database, linked to a Palo OLAP in-memory server, it’s possible to take advantage of three powerful data-processing technologies (spreadsheets, SQL, multi-dimensional cubes) all within your PC’s RAM. You could do serious datasmithing with such a combination on a pretty mediocre laptop, with most modern machines providing an excess of CPU power, no need for super fast disks, just as much memory as you can muster. And, with Windows on EC2, these three amigos will soon be capable of being used as a cloud bursting platform.

Excel, SQLite and Palo, my data trinity.

Talend + SQLite + Groovy the new Oracle …

… well, at least for me.  Let me explain.

For most of my datasmithing career, I’ve had access to corporate Oracle databases and now with the availability of  Oracle10g  Express I can even run my own Oracle instances at home or on EC2.  The combination of a powerful SQL engine, expressive scripting language (PL/SQL) ,OS independence, web front-end (App Express) and the ability to communicate with Excel (via OO4O) made Oracle a natural fit for heavy-duty data manipulation.   But there was always one major problem, Oracle doesn’t play well with other data sources, necessitating a separate ETL bolt-on, which led me to play around with the likes of Kettle and Talend.  But having been seduced by these new shiny (and open source) “toys” I’ve found that rather than just been incidental add-ons they had the potential to totally replace Oracle.  The combination of Talend, SQLite and Groovy, is proving to be particularly magic.

So how will these three tools enable you to leave behind your Oracle past?

Talend (in its Java form) is a superb ETL tool, via JDBC is can access every database type on the planet, it has built-in web-service capability and access to a  multitude of APIs via its Java component for non-database data sources.  The addition of  Groovy makes the use of such Java APIs simpler and quicker and the same Groovy acts as a replacement for PL/SQL when a bit of “if-then-else” logic is required.  And although Talend offers a built-in option to plublish an ETL job as a WAR file exposing a SOAP web service, Java/Groovy also allows for the integration of the powerful, yet simple, Jetty API to embed a web server within Talend itself.  And all this for free, and better than free, open source.

So where does SQLite come in? And, didn’t you say that Excel integration was important, how will Excel communicate with Talend?

As very little corporate data is held in SQLite format, and Talend allows access to every major commercial/free database, the usefulness of SQLite might not be at first obvious.  But if you think of SQLite as a data cache, a fast and efficient local tabular datastore, with a powerful but well understood DSL (i.e. SQL) and a drop-dead-simple setup and backup regime (basically copying and creating files), maybe then you can see its attraction. The ability to extend the DSL by easily creating SQLite user defined functions (UDFs) within Talend using either Java or Groovy is also another powerful feature.

For example…

select customer_id, name,customer, sales_region, getpalodata(“SALES”,customer_id,”All Products”.”Total Sales”,”Euros”,”YTD”)  as customer_YTD, getpalodata(“SALES”,sales_region,”All Products”.”Total Sales”,”Euros”,”YTD”)  as region_total_YTD from list_of_top_customers;

… where getapalodata is a UDF that wraps calls to a Palo cube.

With this type of setup I can easily mix and match list/tabular data with multidimensional data points using SQL (something that Oracle also supports but only if you hand over a large wad of currency). In fact I can create a mini data warehouse, with Palo providing the pivot, ( as SQLite lacks star-query (or even multi-index query) support.)  SQLite would still host the conformed dimensions and the fact tables, but with the fact tables acting as feeds to Palo cubes, supporting finer-grained drill-throughs from cubes or for ad-hoc queries. This is powerful stuff, simple, free, powerful stuff.

… and the spreadsheet access?

A Talend sub-job such as this…

Talend Groovy Jetty web server

Talend Groovy Jetty web server

Example of Groovy code calling Jetty API

Example of Groovy code calling Jetty API

…would provide a simple RESTful (rather than SOAP) web service which could be accessed either with an Excel Web Query or via a VBA macro which would parse the result and allow for more control.  For example …

http://localhost:1234/sqlgateway?sql=select customer_id,name from all_customers&type=HTMLTable

… this would return a list of customers wrapped in an HTML table, or …

http://localhost:1234/job/extractProspects?Rep=JonesTom&Month=JAN&SourceCompany=AXA&type=HTMLTable

…this might call a Talend job called extractProspects, passing in JonesTom, JAN and AXA as context parameters, which would then return a list of prospects extracted from a feed supplied by AXA’s system.

What would the Talend job look like?

The job might operate something like this:

  • It would run either on the client as a service or on a LAN based server (or on a remote server, with a SSH VPN (or Hamachi) to provide security).
  • At start-up, do a bunch of ETL tasks, pulling data from remote sources and databases, transforming and aggregating data etc. Storing the resulting data in local SQLite databases.  It might also build Palo cubes or update larger enterprise databases.
  • The job would then setup a Jetty web server and await requests for data.
  • The requests might be a mixture of raw SQL or requests to run specific Talend transformations which would return a dataset directly to the calling client or maybe just acknowledge the request, queue it up for processing later, sending the resulting dataset by EMail or RSS feed when finished.
  • At a fixed time the service would shut it self down and requeue itself for the next day’s workload.

… or nothing at all like that, and that’s the point, build what you need, add the levels of security (or none at all) that fits your situation, all within a open framework, with zero lock-in (okay, still using Excel, anyone for OpenOffice, Google Apps or Zoho?).  You don’t even need your own server, host it on an EC2 instance, (if you bring up an instance for 10/12 hours every working day, it would cost about $20/$25 a month).

Now tell me that doesn’t make sense?

New universal SQLite JDBC library.

Both Talend (Java) and Kettle distribute the Zentus.com pure-Java SQLite JDBC driver and for most purposes this run-anywhere version is fine. But, if you really need to take advantage of SQLite’s speed then connecting using the native JNI version is a must.  Doing this was easy enough, just change over to using a generic JDBC connection specifying the required native jar and placing the associated dll/so on your system path.

But now there’s an easier way, the latest version (V052, in fact from V050 on) is a universal jar, it contains native JNI libraries for Windows, Linux and MacOS alongside the pure-Java version.  It will automatically pick the correct lib for the platform and fall back to the pure-Java version if required.  You can tell if it’s picked up the native lib by calling conn.getDriverVersion(); it’ll return “native” if it has.

To upgrade to this jar in Kettle see this, this time replacing the nested jar with sqlitejdbc-v052.jar.

For Talend:

  • Either rename the new V052 jar to sqlitejdbc_v037_nested.jar, replace the existing V037 jar in the ../lib/java folder with this new renamed file.
  • Or, you could edit the Java specific XML files in the various tSQlite component folders, replacing the references to the old nested V037 jar.
  • Or, and this is what I would do, don’t use the tSQLite components, replace them with tJDBC generic components, then you can pick whatever version of the driver you require, you could even change to a different database provider!

The Talend tradition of a separate set of components for each type of database, seems to be a hangover from its Perl-generating roots. It’s true that database specific components are required for certaing tasks such as  bulk-loading, ELTs and so on, but JDBC was designed to be generic and as long as the SQL syntax is compatible, it makes switching in an out database providers very easy.  So unless there’s a good reason, stick to using tJDBC.

Groovy as Talend’s scripting language

Although I had decided to use Talend (Java version) as my primary ETL tool I still had one major problem with it, its lack of a scripting tool.  Kettle (Pentaho PDI) has Javascript, Excel has VBA, Picalo has (well OK, is) Python and Talend in its Perl version has Perl.  I could have gone (and did experiment) with calling Javascript, Jython or JRuby via JSR223, but I wasn’t happy with the level of integration afforded by this, opting instead to make command line calls to Python (using SQLite as a data carrier).

Then, I discovered Groovy, or I should say rediscovered it, as I’d come across it many years ago when it was far less developed than is now, liked it then but couldn’t see a use for it at the time and promptly forgot about it.  Then it appeared wrapped in a Talend component, prompting me to do a quick visit to the Groovy website, which turned into a deep-dive into the language; I’d found my scripting tool!

Groovy (by the way what a terrible name for a language, or is that just me?), is not really a stand-alone language but more an extension to Java itself; offering the full power of Java but with addition of closures, builders and dynamic types.  In fact, over time Groovy has become more and more Java like (the biggest missing being lack of support for anonymous inner-classes).

To underline this convergence, Groovy is being developed under the separate JSR 241 rather than JSR 223. There’s full interoperability between both languages; Groovy  compiles down to JVM bytecode and can use Java classes and objects, Java can likewise use Groovy generated bytecode.  This allows for fast prototyping and development without compromising access to Java’s vast collection of libraries.

Here for example, is a piece of code to try out the JPalo library’s ability to access a Palo cube …


[sourcecode language='Java']

import org.palo.api.Connection;
import org.palo.api.ConnectionFactory;
import org.palo.api.Cube;
import org.palo.api.Database;
import org.palo.api.Element;
connection = ConnectionFactory.getInstance().newConnection("localhost","7777","admin","admin")
database = connection.getDatabaseByName("Demo");
;cube = database.getCubeByName("Sales");
rowElements = cube.getDimensionAt(0).getElementsInOrder();
columnElements = cube.getDimensionAt(1).getElementsInOrder();
dataSet = [rowElements,columnElements,]
dataSet << cube.getDimensionAt(2).getElementAt(0)
dataSet << cube.getDimensionAt(3).getElementAt(0)
dataSet << cube.getDimensionAt(4).getElementAt(0)
dataSet << cube.getDimensionAt(5).getElementAt(0)
// fetch data set
datas=cube.getDataArray(dataSet as Element[][])
connection.disconnect();
// parse the return string
rowcount = rowElements.length;
columncount = columnElements.length;
data=[]
heading=[]
// first row set to the row names (i.e. "Product name" followed by the country names )
heading << "Product"
for (i in 0..columncount-1) {
    heading << columnElements[i].getName()
   }
data << heading
// Now  out each line
for (i in 0..rowcount-1) {
    row = []
    row << rowElements[i].getName()
    for (j in 0..columncount-1) {
         row << datas[((i + (j*columncount)))]
         }
    data << row.flatten()
   }

//output to csv file
def csvOut= new FileOutputStream('c:/data/File.csv' )
for (lines in data) {
         lines.eachWithIndex{col,i ->
                             if (i > 0) {
                                 csvOut << ","
                             }
         csvOut << col
     }
     csvOut << "\n"
}
csvOut.close()
[/sourcecode]

This was done in the Groovy console as a proof of concept, it was then transferred to a tGroovy component where it was parametrised and instead of outputting to a CSV file, it was used to fill the globalBuffer structure (the structure used by tBufferOutput component).

Other things I managed to do with Talend tGroovy over a few days:

  • Extended SQLite with my own user-defined Palo functions.
  • Set-up a Talend job as an Excel accessible RESTful web service using Jetty.
  • Interfaced with Amazon S3.

Although I was very familiar with the S3 and the JPalo API, both SQLite UDFs and Jetty were new to me, and that’s were scripting proves it worth, giving the developer the maximum support with the minimum of background noise.  But it’s not just weird and wonderful new APIs that scripting helps expose but as a datasmithing tool, languages such as Groovy give analysts the ability to quickly de-construct and model datasets (for example, see Groovy’s SQL database support and collections’ functionality).

As a infamous Irish farming-pharma TV ad of my youth put it, “It’s a queer name but great stuff“.

Palo ETL Server – Not for me …

Jedox have just released V1.0 of their Palo-centric ETL Server. I had been looking forward to this, not so much for its ETL ability (which is somewhat limited when compared to the likes of Pentaho PDI or Talend) but for the drill-through capability it would add to Palo. Alas, there’s a catch, you must purchase Palo Supervision Server (€8,000) to enable the Excel add-in to avail of this feature!

The thing that attracted me to Palo in the first place was its simplicity of approach and the primacy of Excel as the end-user view of the product, a modern day ESSBase. The fact that the Excel Add-in is closed source always worried me, as I felt that it would inhibit the thing that really sets open source apart (no not the cost) ,the formation of an active and innovative developer community. The sort of developers who have a need for, and an interest in, MOLAP tools tend to be more familiar with VBA, .NET, SQL, SAP Config etc. than with C/C++ or even Java development. The one area where such a community could add value, the Excel front-end, is closed to them. And I know there’s some non-Jedox involvement in the form of JPalo and the OO-Calc Add-in, but Excel is the key to Palo’s wide-spread adoption.

Also, the choice of .NET as the main client-side development platform was a mistake in my opinion, a VBA accessible object model would have been much more useful, it would also have removed the need for the current painful installation process.

This partial “open-source” model and the increasing complexity of the platform makes Palo, at least for me, a less attractive Micro-BI option. And remember, Excel already has a very powerful in-memory OLAP tool, the humble PivotTable, which is in most cases “good enough” for most analytical needs. So why use Palo rather than a PivotTable?

Palo’s advantages:

  • Can handle very large data sets (limited by the free memory available to the server).
  • Allows write-back and splash-down, both very useful for planning/budgeting applications.
  • Allows for ragged-hierarchies,
  • Server-side MOLAP rules e.g. [Budget],[2008] = [Actual],[2007]*0.035

Excel’s PivotTable advantages:

  • Pure Excel. with the object model available for VBA scripting.
  • Drill-through as standard.
  • Excel 2007 can now handle a 1,000,000 rows, for earlier versions use an Access/SQLite local database or an enterprise database to first “group” and summarise the data to be pivoted.
  • Can be used against SQL Server Analytical Services (SSAS) Cubes (and those of other providers such as Pentaho’s Mondrian).
  • Much easier to set-up and use.

Update: May 1st 13:30

Beta Version of Palo 2.5 (which you’ll need to use the Palo ETL Server’s drill-through functionality) has just been released ..

SQLite – the ultimate data-smithing tool!

SQLite logo as of 2007-12-15Image via Wikipedia

Although my data-smithing tool box is full to the brim with powerful tools such as Talend, Kettle PDI, Picalo and Excel, all backed by the cloud infrastructure of Amazon’s S3, SImpleDB and EC2, there’s one simple yet powerful tool that I always seem to gravitate back to, that tool is SQLite.

Now obviously being a hewer of data, I need a SQL compliant database for data manipulation and SQLite performs that task with speed and ease. But it’s not just in the hewing, it’s in the hauling of data where SQLite also shines.

I use SQLite as the container for passing tabular datasets between (and within) my various tools, that data doesn’t even need to be clean (due to SQlite’s liberal manifest typing rules) just so long as it can be expressed as a table.

For example; a Talend job could store an extracted dataset in a SQLite file, pass that file on to a Python script for some special processing (for example extracting further data from a source not directly supported by Talend such as SAP or SimpleDB), and then pass the resulting SQLite database on to Excel or a similar tool to allow a business user to view and perhaps modify the data; finally Talend picking up the file again to load it into a corporate data warehouse.

Now you could use flat files to transport the data or store the intermediate results in a corporate database, but SQLite is as easy, if not easier than, flat files and offers the SQL processing capabilities of big-iron databases, but without the hassle of getting write access to an existing server or setting one up from scratch.

And I know there are other similar file based database offerings such as MS Access and the Java only HSQLDB, but neither match SQLite’s ubiquitousness, sheer simplicity and powerful data processing ability.

Python the new VBA ?

These last two weeks, Python has been on my mind. First off, last week I decided to make time to fully investigate Picalo, an open-source Python-based data analysis tool, and then, this week, Google announced their long awaited cloud-computing offering, Google Apps Engine, with the language at its core.

Python was the first of the “LAMP generation” scripting languages that I decided to learn in any detail ( I had used Perl before that but only on a per-task basis (similar to how I’d used AWK)). I then invested time in learning PHP, then Ruby and finally JavaScript. And here I am, back where I started, with Python.

But it’s not the same Python I learned three years ago, not that it has changed that much, but my appreciation of the language has, largely due to my deep dives into other languages. For example, JavaScript’s treatment of functions as first-class objects, highlighted the same functionality in Python, something I’d missed (or rather, not fully understood) the first time I encountered the language. Likewise, Ruby’s RoR introduced me to a “best of breed” approach to web application design, something that can be used as a comparison aid when approaching new web frameworks such as Django.

But of course the scripting language that continues to power most of my datasmithing activities is Excel VBA. That’s why I was so excited to see a tool such as Proto utilise VBA as its scripting language. But, Microsoft has abandoned VBA, there will be no more Protos.

Also, Excel VBA is now a Windows only language. Windows, however, is no longer the ‘only’ business client OS (see how many Apple laptops you can spot the next time you’re in a business-class airport lounge, a few years ago it would have been zero, not any more), and is currently nowhere to be seen as a cloud computing platform (but that’ll change).

I’m at heart a table-oriented programmer, and I, like Picalo’s author Conan Albrecht, believe “data analysis is best done through scripting”; but not just data analysis, the T in ETL (Extract, Transform and Load) and the I in DI (Data Integration) and SI (Systems Interfacing) also benefit from a scripting approach.

So, what to adopt as a successor/companion-in-her-old-age to VBA, will it be Ruby, JavaScript, Python, Perl, even PHP?

It looks like it’ll be Python because it’s …

The runner up is of course Ruby, but its poor integration with Windows is a major problem and the datasmithing “prior art” of Picalo and Resolver makes Python hard to beat.

UPDATE Jan 2010:

To experience the best of both worlds, VBA & Python, my xLite (Excel combined with SQLite) datasmithing platform now allows Python to be used in conjunction with VBA.  Check it out here http://www.gobansaor.com/xlite

UPDATE:

Also, as Dan pointed out in the comments below, I’d not included Jython in my list of reasons for embracing Python. I must add it to my list of things to try out particularly as both my “classic” ETL tools, Talend and Kettle are JVM based.

Another thing to add to the (ever growing) list is Mike Pitarro’s SnapLogic python-based ETL tool. They have …

…just released a 2.0 Beta version with some major architectural enhancements. The SnapLogic model is very different from traditional ETL systems. It takes an approach that’s more like the web, based on loose coupling and HTTP interactions. We model data source, sinks, and transformations as URI addressable endpoints, and have a model where than can be chained together in pipelines to build transformation logic. We use a plugin architecture to make it easy to add custom components.

xlAWS – 100,000 downloads?

Not sure, but this morning I received my monthly AWS bill, and it was double its usual amount! When I investigated the extra cost it was due to 133GBs of downloads from my www2.gobansaor.com bucket. This is the S3 bucket in which I store the xlAWS zip file, xlAWS being a “library-of-sorts” of VBA/VB6 helper code for accessing Amazon S3 and SimpleDB.

It’s linked to from this page on my blog (which has had 200 or so hits this month) and from this AWS Community Code page. The excessive hits on the bucket started on the 28th of Feb , the day the xlAWS code was published on Amazon and continued through most of March. Talking the size of the zip file, 133GB represents approximately 100,000 downloads. I don’t have server logging enabled on the bucket, so I can’t be sure how much is due to the other public files in the bucket (all belonging to the VBA/Proto SQLite xLite project), but as that project has been available for months and is accessible only through my website (who’s stats show a consistent 5-10 downloads per week) I’m guessing the downloads are for xlAWS.

Who would have though that there would be such interest in VBA/VB6 code for accessing AWS services! I wonder was it the Excel VBA side of the house or the dispossessed (and p*ssed off) VB6 developer hoards who downloaded it the most? Leave a comment if you downloaded and used the library, I’d love to know.