Category Archives: Palo

Star Schemas – to boldly go where no Excel spreadsheet has gone before

One of the many things that delights me about PowerPivot is the central role played by the Star Schema. Those of you reading with a data-warehousing background would shrug your shoulders and say: “So what, what else would you expect to find at the core of a BI tool?”.

Those from an Excel PivotTable background would ask: “What’s a Star Schema, why do we need one,what’s wrong with a the good old-fashioned single flattened table?”.

Those from a classic MOLAP background (Essbase, TM1, Palo) might also ask: “Why do we need this extra layer? Load the cube directly from the operational data model and get on with it!”.

A quick Q&A is perhaps the best way for me to explain why star schema design is a powerful skill in a datasmith’s toolset.

First off, what’s a Star Schema?

A Star Schema (also know as the dimensional model) is a denormalised (flattened) data model used to simplify an operational (OLTP) data model to better accommodate reporting and what-if analysis.

At its simplest, it consists of a central fact table with links back to a “surrounding” set of dimensional tables, hence the star name. A variation is the snow-flake schema, where the dimensional tables are not fully denormalised (e.g. Product Category->Product->Fact instead of  Product->Fact).

The role of the fact table (besides being the table that hosts most of the measure fields) is to create linkages between dimensions (such as Customer, Product, Date) usually based on an actual transactional event (e.g. Invoice) or a proposed event (such as a Budget or Forecasted Sale). In effect, simplifying  the often complex work-flow-driving connections of a typical operational system by using a single many-to-many relationship (modern ERP/CRM systems’ data models consist of scores of configurable many-to-many relationships).

Many wrongly believe the star-schema was adopted for performance reasons and now that in-memory OLAP is becoming the norm it’s no longer necessary to use dimensional modelling techniques. In fact, in the early days of data-warehousing, RDBMs had great difficulty efficiently handling star-queries (and some such as MySQL and SQLite, still do).

The original primary purpose of the star schema was to simplify the SQL required to access reporting data; to make the model more approachable to non-technical users. Of course, even simple SQL was beyond the knowledge or interest of most end-users but a sizeable proportion were happy to do so (often helped by SQL “generators” such as MicroStratery or Business Objects). But even in situations where SQL-wielding civilians were not to be found, the simplicity of the dimensional models  proved to be a valuable aid when establishing and developing the warehouse data requirements. PowerPivot requires no SQL knowledge to manipulate the dimensional model which brings the original concept full-circle but this time opening its possibilities to a much wider audience.

But surely, concentrating on the actual reports would be a more valuable requirements gathering exercise?

A so called “bottom-up approach” is often the best way to approach a reporting request particularly if the reports are simple one-off “traditional” reports. But for self-service BI, this needs to be combined with a top-down dimensional design. The idea is not to build out each and every report or indeed cube but to build a structure that’ll support likely queries. The process of building a star schema provides both a logical model and a physical implementation of that model against which potential queries can be tested. I’ve worked on several POCs destined for implementation in Essbase where the star-schema was built and potential cubes mocked up using Excel PivotTables that subsequently never went any further (except for the star-schema ETL process). The end-users derived sufficient value from the denormalised star-schema pivoted and reported in Excel.

In traditional ROLAP data-warehouses where the cubes were built directly against star-schemas, the pure logical approach to the data model often had to take a back-seat to the necessity of fine-tuning it to make response times (be that ETL or user-pivoting) acceptable. This is why I much preferred situations where the star acted as a logical model from which MOLAP cubes were built.

With PowerPivot, ROLAP has a new champion. The column-oriented high-compression in-memory architecture means that the compromises of the past are no longer necessary. The fact table reverts back to it primary role as a many-to-many connector. In a pure hypercube, measures are just another dimension (the approach that Palo takes), this is also now true of PowerPivot models; measures can be sourced from dimension tables and dimensions from fact tables as it logically should be, but without the performance hit of old.

But what’s the advantage of a star schema over a flattened table when using PowerPivot?

It is true that the same flattened table model as used to backend a PivotTable can be used within PowerPivot. But doing so would limit the potential of the DAX language to construct measures such as average sales spread over potential customers (rather than actual customers that would typically be represented on a flattened table). Also, by creating “conformed dimensions” (single cross-business views of Customer, Product etc.) and using such tables as dimensional sources for multiple fact tables, “virtual cubes” that combine values from multiple fact tables can be built.

If you’re new to dimensional modelling I’d recommend the books & articles of Ralph Kimbal as good starting point. You do have to be aware that some of the advice regarding efficiency trade-offs, surrogate keys etc. do not  apply in a PowerPivot scenario (even though other performance issues still apply) but the logical design tips still apply.

Star Schemas: to explore strange new conformed dimensions, to seek out new measures, to boldly go where no Excel spreadsheet has gone before.

TAG Cubes – SQLite Star Query Part III

It’s no secret that I’m a huge fan of SQLite and Excel, particularly when used in combination. I also greatly admire the open source BI engines, Palo and Mondrian. Mondrian appeals because of its “ROLAP with a cache” architecture and its implementation of MS’s excellent MDX language. When I say MDX is excellent I’m talking with my professional programmer’s hat on, as an end-user tool it’s a non-runner. This is where Palo comes in, building on the hypercube concepts pioneered by the likes of  TM1 and ESSbase, it presents a designview that’s approachable by a vastly greater percentage of “civilians” than is the case with ROLAP-based solutions.

The trick behind TM1, Essbase, Palo etc. is the extension of the spreadsheet metaphor from two to multiple dimensions, while still binding the interface closely to the familiar spreadsheet (which for most of the business world is still Excel).

So where does SQLite come in all this?

At first glance, SQLite lacks the sophisticated join functionality to support star-queries, but of course, if the dataset is small then a full-table scan of a fact table, or better still, loading the fact table into memory negates any such short-comings.

In fact, all traditional ROLAP engines have problems with dimensional models, particularly when you reach the point of using summary tables or query re-writes, that’s why the emerging SQL-speaking columnar-databases are such a godsend for ROLAP data warehouses.

It was SQLite combined with Excel acting as a data prep platform that was originally my main interest, so for pivoting, Excel’s own pivot table would have to do.  Nevertheless, I felt the tool was incomplete without the ability to directly pivot the underlying SQLite database.

Why not use Palo or Mondrian as a pivot tool? Well yes, where a fixed permanent “solution” is required then the extra moving parts of either approach would be justified and indeed necessary but that is to miss the essence of what I call datasmithing.

Datasmithing is not data warehousing nor is it the provision of solutions (which, for example, Palo superbly enables in multi-user budgeting situations). Datasmithing, as a skill, is of course part of the process of both, but it’s on the edges, at perhaps the planning or consumption stages.

Datasmiths deal in the unknown, in change, in disaster recovery, in systems’ commissioning, in the never-ending barely-repeatable processes thrown up by daily business life.  For that, the toolset required must be as simple as possible (but no simpler), self-contained, document-oriented, secureable (is that a word?) and easily archived and retrieved. Excel and file-based DBMSs such as MSAccess or SQLite fit the bill nicely, server-based technologies such as DBA controlled database servers or IT installed “solutions”, less so.

Jedox has made Palo relatively easy to install (and likewise, Canada’s SQLPower has made Mondrian setup a painless exercise via their excellent Wabit reporting tool), but, the zero-install, email friendly document approach that spreadsheets are famous (and infamous) for, is preferable in many situations. This is something that Microsoft have recognised in their Gemini add-in for Excel 2010, but Excel 2010 is a not here yet and it’s likely to be five years or more before it’s as common as Excel 2003 is today.

The inclusion of FTS full-text searching with SQLite triggered an ah-ah moment with regards to pivot-enabling SQLite.

The usual method that hypercube-like excel-friendly OLAP tools use to return data is via a UDF like so…

=DATA(“CubeName”,”value1″,value2″,…)

…where valueN represents dimensional elements, so…

=DATA(“SalesCube”,”Beer”,”Profit”,”Jan 09″,”Actual”)

…is the Actual Profit for Beer sales in Jan 09. The dimensional elements act as “tags” to locate a particular value, there is of course much more to tools like Palo; hierarchies, intra-cube rules etc. but in essence most OLAP tools are like www.delicious.com for number crunchers. This method of retrieving data fits well with how people use Excel and not just for pivots, but for embedding OLAP aggregated cells in lists.  For example, a CRM scenario; a Sales Rep makes a list of her ‘best’ (subjective) customers, but needs hard (objective) stats, to be placed alongside the list to convince the boss or to track actuals against expectation.

Dimensional elements as tags; FTS3 virtual tables as fact table indexes; the concept of a TAG Cube was born.

In the above example “Profit” would most likely be described as a measure (Palo, a near pure hypercube does not distinguish between Measure and other Dimensional coordinates). Dimensions, measures and attributes are in reality interchangeable (a Customer ID can act as a dimension or an attribute, but by applying a  Count Distinct to it, it’s a measure) but most OLAP solutions treat “Measure Dimensions” as different, and so do TAG Cubes.

By using the default fact table structure (a single-columned table) and querying using the default measure (which translates to the SUM() of that single value) a ‘pure’ approach can be used. But, ROLAP is tightly bound to the concept of a fact table, and since SQLite is relational, TAG Cubes offer the ability to use a wide fact table approach and I think gains considerably in flexibility by so going.

The above example of using Count Distinct, or the simple creation of calculated measures are examples of this flexibility. Another, is a measure based on SQLite’s concat_group aggregate function to provide a drill-down facility, e.g.

=DATA(“SalesCUBE”,”ROWIDList,”Beer”,”Jan 09″,”Actual”)

…where “ROWIDList” would be setup as concat_group(rowid,’,') and will return a comma separated list of the underlying fact table ROWIDs.

A major reason for rolling my own pivot engine was to add a concept of “namespaces” and to separate the implementation of these namespaces from the actual pivot.  When a tag (or a predefined hierarchy of tags) is assigned to a cube, it’s also assigned to a namespace, in many cases namespace and cube would be synonymous, but in some cases a more sophisticated approach is required:

  • Multiple cubes sharing the same set of conformed dimensions would be best served by such cubes sharing a common namespace, and so they can.
  • Different consumers of the pivot may require the use of a different language, be that a spoken language or a different ‘business language’ e.g. Manufacturing Product Codes V Consumer Product Names. Again, easily done.
  • Sometimes identifying data can’t be shared with the datasmith or the numerical analyst working on a problem; in such cases being able to replace  the actual namespace with an obfuscated one can be very useful. Or, for added security, the namespace might only be issued to approved  PCs while the tag index and fact table are stored on a shared drive.  Needs some more work to make managing such scenarios secure and easy to use but the structure is there.

As hinted on above, the three elements of a Tag Cube, the namespace, tag index and fact table can be assigned to different databases (i.e. files). Due to the wonders of SQLite’s ATTACH statement and the backup API’s ability to quickly load/unload databases in/out of memory, it’s possible, for example, to load namespace and tag index (i.e. the ‘dimensions’) into a memory database, while a very large (i.e. too big to fit to memory) fact table remains on disk. Fast and cheap SSDs will add further configuration options.

Although most of the TAG Cube functionality is available only within Excel, I’ve built a C based SQLite Virtual Table (cFact) to allow the tag index to used outside xLite. This means that SQLite drivers for ODBC (for use as a Pivot Table source, for example) or JDBC (for use in  SQLPower Wabit perhaps) can efficiently access data models built using xLite.

I had to revert to using C rather than my preferred Python (did I mention that xLite now embeds Python in Excel, no, well it does, Python the newVBA ?), having failed to get around multi-threading issues with callbacks to Python in both the ODBC and JDBC drivers. I’d make a career promise to myself many years ago, not to having anything to do with printers or threads, and I think I’ll stick with it :)

TAG Cubes are the latest addition (still WIP actually) to be added to xLite, adding to:

  • VBA coded SQLite SQL functions.
  • Worksheet Functions; call out to a ‘function’ built using Excel formula, passing a parameter list and returning a value.
  • Workbook Functions; like Worksheet Functions, but loading a new Workbook, passing in parameters, passing back a value (or tables) and closing the Workbook when finished.
  • XLiteScript; xLite exposes its functionality via VBA coded UDFs, which can be called like any other formula, but data prep activities often require sequential procedural logic, xLiteScript is a table-oriented scripting mechanism offering basic flow-control logic.
  • pyScript; I embedded Python into xLite to take advantage of Python’s speed in developing Virtual Tables, SQL Functions and extensions to SQLite and to tap in the wonderful world of Python code. I’ve also added the ability to use Python from scripts defined within Excel (to indent, tab to the next cell!).
  • Fast load/unload to/from CSV.
  • Load from any ADO source.
  • Remove xLite formulae and rename and save Workbook, very handy when used via Workbook Functions to mass produce Excel “reports”.
  • Other WIP items are; load from SAP, load/unload to/from Amazon S3, use Palo cubes as TAG Cube “facts”, slot in/out Palo for TAG Cubes, auto-generate Mondrian XML based on TAG Cubes, write-back and splash, Python & VBA TAG Cube “rules”.

I’ve started the process of releasing the beta code here …

Why not join me on Twitter at gobansaor?

Palo HTTP API via Excel/VBA

As a result of a request on the Palo support forum last week looking for a VBA tool to directly access the Palo OLAP server via its native HTTP API, I realised I had such a tool. I had built it about a year ago (to use alongside Fiddler Web Debugger and .NET Reflector) to help me understand in detail the interaction between the Excel client and Palo. (Remember this was before Jedox released the Excel add-in as open source)

I removed the SQLite dependent parts (which allowed me to load meta-data into tables and analyse using SQL) and used a pure-VBA MD5 hash routine to reduce the number of moving parts. Also added a few VBA helper routines (including an array-formula UDF to allow direct calling of the API from Excel UI).

As I said, I originally built it as a learning aid but I’ve started to look at it again in the context of a an Excel/Sqlite Palo ETL add-on.  Having the flexibility of the native API and my own meta-data cache in SQLite might prove very useful.

The code may be found here.  Enjoy!

Why not join me on Twitter at gobansaor?

Palo BI Suite Community Edition

Jedox have finally published a roadmap for the Palo BI Suite Community Edition, having caused considerable confusion by pre-announcing its availability last April. See here for the details.  The headline dates are, a beta version due 1st of July with a Release Candidate due 1st September.

Although the announcement in April was essentially vapour-ware (no Worksheet Server V3, no Amazon EC2 images), one very significant actual deliverable was the addition to SourceForge of the Palo Excel Add-in sources (GPL licence).  This, at least for me, is very welcome as it now means that Palo is truly open source.  Prior to this, the server sources were GPL’d but not the main front-end tool used by the vast majority of end-users. In fact, the deep Excel integration offered by the Add-In is Palo’s main attraction to the business-focused “datasmiths” who make up the bulk of the product’s user-base.

The GPL’ing of the Add-in has removed the last barrier that had stopped me, as an independent consultant, from committing to the platform.  While the ‘freeness’ of open source is nice, it’s the source that really attracts me.  With access (and rights) to the source, I have no worries that the terms of use or indeed the product’s core functionality can be arbitrarily changed.  Having the source also means I can delve as deep or as shallow as I need to into the inners of the product, improving my understanding of the technology (both bugs and functionality) as needs dictate.

What has Palo BI Suite to offer, besides being open source?  Well, even if Jedox’s offering consisted of mediocre products, being open source as I explained above is in itself a huge advantage. Having an agnostic FOSS pivot engine that can be shared across many technologies, from Excel to Open Office to a PHP based website, is extremely useful.

However, Jedox’s BI suite is far from mediocre.  Palo is now a very polished and powerful in-memory MOLAP server with excellent integration with Excel (through the Add-In, or if you take out a support contract, via ODBO/MDX powered Pivot Tables).  The addition of a browser delivered spreadsheet (Worksheet Server V3) will add significantly to the product’s street appeal.  Version 3 differs significantly from previous WSS versions; being open source is one, but the entire product was also completely redesigned to meet the challenge posed by web-based products from the likes of Zoho, EditGrid and of course Google Docs (not to mention the ever-present threat of a MS response). Web-based spreadsheets are becoming a commodity, so Jedox’s response was to open source the product but at the same time make it more usable for real-world business analytics.

Current browser-delivered spreadsheets suffer from two shortcomings;

  • Spreadsheets with large numbers of inter-related cells (typical of business models ) tend to perform poorly, in many cases being unusable compared with Excel or Open Office.
  • Only available as hosted SaaS; not a major problem for some businesses, but for others, services outside the corporate firewall, especially for sensitive information such as what-if, budgeting and sales analysis models, are a no-no.

WWS V3 gets around both problems.  Performance is improved by the use of Palo as the spreadsheet’s pivot engine but also by the “lazy calculation” of related cells i.e. a cell that’s not visible, and itself not yet referenced by other visible cells, remains uncalculated, saving on the constant churning that can effect large models.  This combined with a DynaRange concept means templates and models react dynamically and efficiently to the ever changing datasets being presented to the sheets from the Palo OLAP server.   The look’n’feel is very similar to Excel with even array-formulae being fully supported.

The second problem of only-behind-the-firewall access is solved by the open source GPL licence and by the front-end being coded in PHP (very approachable to most in-house support staff and even the odd accountant).  The core (the bit not yet released) is, as far as I know, C++, so is likely to join Palo Server as being highly efficient and well engineered but perhaps beyond the technical skills of most.

The other elements to the BI Suite are the web-based OLAP-centric ETL Server (now, I see, with Groovy and Javascript scripting support) and the Supervision Server (only in paid Enterprise version) which offers fine-tuned access control and monitoring, plus drill-through functionality from Palo cells, back to the ETL fact tables. The Enterprise Edition also offers a multi-core version of the Palo server along with SAP and ODBO/MDX connectivity.

If multi-dimensional analysis and budgeting could benefit your business and spreadsheets are your preferred method of communicating and working with such analysis, you need to check this out.  Palo is a well kept secret (at least outside of Germany), hardly ever mentioned by the mainstream BI community, but don’t let that put you off; this is one of the best solutions out there, it’s open source but also comes with the backup of a professional company that can offer not just technical support but also implementation know-how (Jedox eats its own dog-food, being both a BI consultancy and development house).

Update July 4th 2009:

Beta Community Edition is now available.   I downloaded and installed WWS V3 and gave it a quick test-drive; looks good, Palo interface has the look’n'feel of the Excel Add-in and the general spreadsheet functionality is very Excel-like, incluing CTRL-Shift-Enter to assign array formulae.  Overall, the Palo BI suite offers a intuitive end-user-friendly interface; from download to effective use in less than 60 minutes, how many BI tools could you say that about?

Also, in two weeks time a pivot-table friendly ODBO driver will be included for free with the Palo Excel Add-in (previously only available to those with a Jedox support contract).

Why not join me on Twitter at gobansaor?

SQLite as the MP3 of data

… and Excel as its “mixing desk”.

When I tell people that I use SQLite in combination with Excel (via xLite) as my datasmithing platform, many ask why SQLite? (Many others ask why Excel?  but “sin scéal eile”, that’s another discussion – Excel as the iPod of Downloaded Data.) Those that question my use of SQLite tend to cluster into four camps:

  • Pure Excel jocks.
  • MS Access fans.
  • The client server database brigade (SQL Server,Oracle; or if FOSS fans; MySQL, PostrgeSQL).
  • The MOLAP folks (Essbase, TM1, Palo).

Now while I have used and will continue to use/encounter all four ‘approaches’, I’ve come to believe over the last couple of years that SQLite brings something special to the datasmithing game. When I look back over nearly 30 years in the data handling business I keep thinking – “If only I had SQLite then, how much easier/quicker/cheaper that task would have been!”.

Just as “fractional horsepower” electrical motors revolutionised manufacturing and eventually all our lives (car starter-motors, fridge motors, washing machines etc.), “fractional horsepower” databases can do the same for data. Distributing data to where it is needed.

As operational local caches, this use of SQLite is already far advanced. SQLite is embedded in lots of every day software tools, everything from McAfee anti-virus to TweetDeck Twitter clients (best one IMHO). But my interest is more in SQLite’s potential as a micro-BI (or maybe more correctly a distributed-BI) platform. A sort of MP3 format for distributed structured data, if you like.

But why SQLite (and in particular SQLite in combination with Excel) as my datasmithing tool rather than the four other approaches?  First, what’s a datasmith?

Managing and manipulating datasets has become an integral part of many people’s job, not just accountants (the original of the species) but marketing executives, sales staff, pricing analysts, process engineers; different job titles, different roles but using a skill that they’ve likely never been formally trained in, a skill without a name; a skill I call datasmithing. I like to think of  myself as a master datasmith, or a datamith’s datasmith.

If you consider yourself a datasmith then most likely the tool you use to manage your datasets is Excel. And before you apologise, don’t. Excel is by far the best and most flexible end-user data manipulation tool out there. Everything from the current financial crisis downwards has at some stage being blamed on Excel, but you know and I know that many tasks would remain undone or under-done were it not for end-user generated spreadsheets.

Spreadsheets are not however optimal for some tasks, linked spreadsheets in particular are data disasters in waiting. While fantastic for data transformations and presentations, as books-of-record they’re rarely suitable. Other tools such as SQL based relational databases and in-memory OLAP offer much better and potentially much more cost-effective data modelling functionality, but also at a cost of extra complexity and ongoing technical support.

MS Access (which like SQLite, is a document-centric, non-client-server database; but unlike it, is also a forms/reporting development environment) would appear to be the natural local store database. My problem with MS Access has been its tendency to try to be all things to all men, ending up not fully satisfying anybody. Professional developers think it’s too limiting, non-techs find it too intimidating, even reporting, where it once showed promise left a big enough opening for Crystal Reports to evolve. It is also limited to Windows which might not seem to be a problem if combining with Excel, but, as it’s often necessary, due to scale or complexity of the data,  to use ‘proper’ ETL tools such as Talend, having an OS agnostic database format than can act as a distribution media (think MP3s again) between “mixing desks” can be very useful.

The big difference to MS Access for me is SQLite’s open source code; code that’s a pleasure to browse and with an approachable API that even I, with my very rusty C skills, can manipulate. Having access to that code allows me to tightly integrate it with Excel, so much so, that I can use Excel functions (built-in functions, VBA user-defined functions and 3rd party add-in functions) directly from SQLite’s SQL; and vice-versa, access SQL functionality via Excel “formula” calls. It is  also possible to  load most datasets into memory using SQLite’s in-memory mode enabling very fast processing  and near zero-latency when passing data to and from Excel/VBA. In the near future, cheap, large SSDs will enable non-memory databases to offer similar speed but also handle extremely large datasets (see this for a glimpse of that future).

What about the big beasts of the data world, the client-server databases? Having spent most of my professional life working with such tools I’m aware of the power of a well designed relational database. If SQLite is the MP3, then these are the master tapes, the DDD recordings. Most of the data that eventual ends up in SQLite for analysis and/or transformation will have originated in data-warehouses or be directly sourced  from OLTP systems built using relational technology. But for close-up analysis and transformation, the pure simplicity and convenience of SQLite is hard to beat. That simplicity is primarily due to its Excel-like ‘document’ nature, all code and data can be housed in a single folder (or true-crypt container for added security), ensuring that the ‘problem domain’ can be easily archived and/or shared with others without the need for professional IT resources.

And yes, I hear you, isn’t that the basis of Excel-hell? Yes it is, but over the years I’ve found that this is rarely a problem for datasmiths, they deal day-in day-out with document work-flows, they understand the risks and the benefits (mainly the simplicity) of the approach. Where the nightmare truly happens is when this approach is used as an alternative to an OLTP system i.e. using Excel and other document-like datastores as books-of-record in large multi-user environments – “there be monsters for sure”.

How about MOLAP? Wasn’t Essbase’s name derived from “extended spreadsheet database” and doesn’t Palo offer a truly excel-friendly multi-user database back-end? Again, having worked with Essbase for many years and now being a big fan of the open source Palo MOLAP tool, I fully appreciate the power that such tools brings to analysis and multi-user planning tasks. But for many situations, an Excel Pivot Table is “good enough” and even when it’s not, it is possible by utilising what I call a tOLAP cube (essentially, a fact table indexed via tags enabled by Google’s great addition to SQLite, the FTS3 virtual table) to build and access  powerful, yet simple, cube-like data structures.

By integrating SQLite with Excel, datasmiths can have the best of both worlds, familiar spreadsheet front-end combined with a fast and powerful SQL engine and datastore, in fact, everything that MS Access should have been.

Why not join me on Twitter at gobansaor?

Spending time on Excel-SQLite, C, VBA Callbacks & Twitter

Haven’t posted here in a while as my spare time has been soaked up programing, well actually refactoring would be more exact.  My xLite “SQLite empowered Excel” codebase has grown over the years and required a serious makeover to get rid of stuff I no longer use and to generally make it more robust.  I also decided to add some extra functionality to my VBA friendly C wrapper for SQLite (based on Pivotal Solutions’ pssqlite.dll) which meant I had to re-acquaint myself with my long lost C skills, so doing reminded me how much I like C. Close to the metal programing if not exactly super-productive is nevertheless super-powerful.

The new improved xLiteSQLite.dll now has a built-in CSV loader (both file based and string based – handy for loading Palo HTTP API responses into a table). It also returns a one columned variant array of CSV values for quick rendering via “text-to-columns” code (by far the quickest way of handling large dataset pasting into Excel).

I’ve also added the ability to create SQlite UDFs (user defined functions) in VBA (thanks to http://stackoverflow.com/users/4007/rpetrich).  This is a very powerful feature as it allows SQLite selects to act as a “loop controller” calling back to  Excel/VBA functions to process each row, really useful for ETL tasks. And not just scalar UDFs but aggregating (aka group-by) functions too, allowing the use of Excel’s powerful array functions in SQLite statements.

All in all, the changes to the xLite VBA code and the C wrapper makes Excel backed by SQLite a seriously good micro-ETL tool. Combined with Palo, the result in a truly wonderful micro-BI platform; a cost-effective toolset for these recessionary times.

Of course I’d be lying if I said code was the only reason I’ve been neglecting my blogging duties, I’m afraid I’ve a confession to make, Twitter has hooked yet another sucker, me! 

I’ve found I’ve settled in to the whole micro-blogging thing with ease, and have managed to make contact with people I would not have encountered otherwise, as well as reconnecting with others that I’d lost contact with.  So if you too are all-a-twitter then do please follow gobansaor-on-twitter.

I’ll give up Excel Pivot Tables when you take ‘em from my cold, dead hands

Jedox, the company behind the open source MOLAP server Palo, has just announced an MDX driver. This means that it’s now possible to access Palo cubes using Excel Pivot Tables or indeed any tool that supports ODBO.  This is excellent news, as MOLAP to most Excel users IS a Pivot Table, and somewhat like the NRA, the NPTA’s (National Pivot Table Association’s) motto is “I’ll give up Excel Pivot Tables when you take ‘em from my cold, dead hands”.

MDX/XMLA is now a de facto standard for OLAP servers, supported not just by MS SQLServer but by SAP BW, Hyperion/ESSBase and by Pentaho’s Mondrian. The new driver is not open source, nor is it for sale but instead comes free to those with Jedox support contracts. I’m sure lots of organisations will be more than willing to enter a support contract (starting at €3000 per server) to get their hands on this; think of the savings in training alone!

UPDATE: 2nd July 2009

Kristian Raue has announced on his blog that the ODBO/MDX driver will now come free with latest Palo BI Suite (both community and enterprise versions). Excellent news!


Why Larry hates the cloud, and my data trinity.

Last week Oracle certified Amazon EC2 as a supported platform, that same week Larry Elison attacked the concept of cloud computing as pure hype. Obviously, Larry is not happy with this whole cloud thing, and I think it’s not just the threat it poses to the software industry’s traditional licensing model that worries him, rather, as Robert X. Cringely points out in his “Cloud computing will change the way we look at databases” post, it’s the likelihood that it sounds the death-knell for large-scale traditional databases.

This new database paradigm is memory rather than disk centric, with the disk-based element acting as an archive/backup/restore mechanism which can easily be stored on commodity SAN devices ( e.g. Amazon’s ESB). Using MapReduce technology Google effectively holds the whole Internet in memory, not in one big super computer but in lots of cheap commodity servers.

But it’s not just in the realm of mega datasets that RAM based databases threaten traditional models. Excel is a memory-based database engine, so too in-memory OLAP tools such as Palo. Such products’ ability to handle large volumes of data has increased over the years, with the decrease in RAM costs and the appearance of cheap 64 bit machines (which are no longer limited to 2G/3G process working sets).

That doesn’t mean that we’ll throw away SQL databases in their entirety, SQL and the relational model will continue to be useful. But perhaps of greater use in local datastores/caches that as the building blocks for large scale datastores. For such local caches, less will be more; fewer features, easier to configure, more flexibility. That’s why I like SQLite; long after the dinosaurs of the database world have disappeared, I imagine SQLite databases will continue to survive, embedded in mobile phones, browsers, wherever a local datastore is required. And more than likely operating in memory rather than off disk.

By combining Excel with an in-memory SQLite database, linked to a Palo OLAP in-memory server, it’s possible to take advantage of three powerful data-processing technologies (spreadsheets, SQL, multi-dimensional cubes) all within your PC’s RAM. You could do serious datasmithing with such a combination on a pretty mediocre laptop, with most modern machines providing an excess of CPU power, no need for super fast disks, just as much memory as you can muster. And, with Windows on EC2, these three amigos will soon be capable of being used as a cloud bursting platform.

Excel, SQLite and Palo, my data trinity.

Cloudy skies, cloudy apps…

Just back from a break in Clifden, Connemara, summer is nearly over, the kids return to school today, back to work.

Aasleagh Falls, Co. Mayo

Aasleagh Falls, Co. Mayo

Counties Galway and Mayo were like the rest of the country last week, a tad wet, but unlike the developed east of the island, flooding was not a problem; a problematic drainage area is called a lake in the west.

This August has been the wettest and dullest I’ve ever experienced but at least I saw some sunshine earlier in the month thanks to Kristian Raue CEO of Jedox who kindly invited me to visit the company’s offices in Freiburg, Germany.  Freiburg is very green in both senses of the word, surrounded as it is by the Black Forest and its well deserved “eco-city” status.  Its also know as the warmest city in Germany, a reputation it thankfully lived up for this visitor from a rain-soaked Atlantic isle.

August morning, Frieburg Im Breisgau

August morning, Freiburg im Breisgau

If Freburg left a positive impression on my mind, so too did Jedox.  The overall impression is of a company which intends to use a combination of quality, vision and the judicious use of open-source to build the Jedox brand into one associated with best-of-breed products and consultancy.  This vision can be seen in the evolution of Palo, from its “good enough” beginnings to its current near-best-of-breed 2.5 version, and from talking to some of those working on the product, best-of-breed status is not that far off.

Likewise, ETL-Server which is currently a Palo only “loader”, is to be further  developed into a true ETL tool, while continuing to offer MOLAP-centric specialisms.

I also got a glimpse of the next version of Worksheet Server. “Wow!”, is all I can say.

Existing web based spreadsheet products are fine for simple data analysis or basic data capture purposes but cannot compete with their client-based elder cousins when serious datasmithing is required.  Well, from the demo I saw of Worksheet Server in action, that’s about to change.  The look and, more importantly, the feel is similar to that of traditional spreadsheets, its interface with Palo is identical to that of the existing Excel add-in, and here’s the big one, its open source!  Game-changing or what?

But …

That might enable me to move a lot of my spreadsheet applications to the cloud, but what about those applications that are more suited to an MS Access type solution?

Then try out WaveMaker. It’s open source and built on industry standards, Hibernate,Spring and the Javascript Dojo framework but has the ease of GUI database development more usually associated with MS tools. The resulting applications are packaged as a WAR file which can be hosted by any standards based Java server (e.g. Tomcat or Jetty).  The latest version makes developing Ajax-fronted database applications even easier with the addition of layout templates.  Its existing ability to automatically bind interfaces to SOAP web services has been extended to REST web services by means of a new WSDL auto-discover tool.  And Chris Keene CEO of WaveMaker also informs me that …

We are also releasing a cloud-based IDE in October with Amazon – stay tuned…

We launched in February and will be announcing our first 7 figure deal this month. We run on Mac, Linux and Windows and are currently the #1 developer download on Apple.com (http://www.apple.com/downloads/macosx/development_tools/)

Our goal is to make it easy to build rich internet applications without complex coding – kind of a MS Access for the Web.

Jedox and Wavemaker the new breed of open-source businesses

Talend + SQLite + Groovy the new Oracle …

… well, at least for me.  Let me explain.

For most of my datasmithing career, I’ve had access to corporate Oracle databases and now with the availability of  Oracle10g  Express I can even run my own Oracle instances at home or on EC2.  The combination of a powerful SQL engine, expressive scripting language (PL/SQL) ,OS independence, web front-end (App Express) and the ability to communicate with Excel (via OO4O) made Oracle a natural fit for heavy-duty data manipulation.   But there was always one major problem, Oracle doesn’t play well with other data sources, necessitating a separate ETL bolt-on, which led me to play around with the likes of Kettle and Talend.  But having been seduced by these new shiny (and open source) “toys” I’ve found that rather than just been incidental add-ons they had the potential to totally replace Oracle.  The combination of Talend, SQLite and Groovy, is proving to be particularly magic.

So how will these three tools enable you to leave behind your Oracle past?

Talend (in its Java form) is a superb ETL tool, via JDBC is can access every database type on the planet, it has built-in web-service capability and access to a  multitude of APIs via its Java component for non-database data sources.  The addition of  Groovy makes the use of such Java APIs simpler and quicker and the same Groovy acts as a replacement for PL/SQL when a bit of “if-then-else” logic is required.  And although Talend offers a built-in option to plublish an ETL job as a WAR file exposing a SOAP web service, Java/Groovy also allows for the integration of the powerful, yet simple, Jetty API to embed a web server within Talend itself.  And all this for free, and better than free, open source.

So where does SQLite come in? And, didn’t you say that Excel integration was important, how will Excel communicate with Talend?

As very little corporate data is held in SQLite format, and Talend allows access to every major commercial/free database, the usefulness of SQLite might not be at first obvious.  But if you think of SQLite as a data cache, a fast and efficient local tabular datastore, with a powerful but well understood DSL (i.e. SQL) and a drop-dead-simple setup and backup regime (basically copying and creating files), maybe then you can see its attraction. The ability to extend the DSL by easily creating SQLite user defined functions (UDFs) within Talend using either Java or Groovy is also another powerful feature.

For example…

select customer_id, name,customer, sales_region, getpalodata(“SALES”,customer_id,”All Products”.”Total Sales”,”Euros”,”YTD”)  as customer_YTD, getpalodata(“SALES”,sales_region,”All Products”.”Total Sales”,”Euros”,”YTD”)  as region_total_YTD from list_of_top_customers;

… where getapalodata is a UDF that wraps calls to a Palo cube.

With this type of setup I can easily mix and match list/tabular data with multidimensional data points using SQL (something that Oracle also supports but only if you hand over a large wad of currency). In fact I can create a mini data warehouse, with Palo providing the pivot, ( as SQLite lacks star-query (or even multi-index query) support.)  SQLite would still host the conformed dimensions and the fact tables, but with the fact tables acting as feeds to Palo cubes, supporting finer-grained drill-throughs from cubes or for ad-hoc queries. This is powerful stuff, simple, free, powerful stuff.

… and the spreadsheet access?

A Talend sub-job such as this…

Talend Groovy Jetty web server

Talend Groovy Jetty web server

Example of Groovy code calling Jetty API

Example of Groovy code calling Jetty API

…would provide a simple RESTful (rather than SOAP) web service which could be accessed either with an Excel Web Query or via a VBA macro which would parse the result and allow for more control.  For example …

http://localhost:1234/sqlgateway?sql=select customer_id,name from all_customers&type=HTMLTable

… this would return a list of customers wrapped in an HTML table, or …

http://localhost:1234/job/extractProspects?Rep=JonesTom&Month=JAN&SourceCompany=AXA&type=HTMLTable

…this might call a Talend job called extractProspects, passing in JonesTom, JAN and AXA as context parameters, which would then return a list of prospects extracted from a feed supplied by AXA’s system.

What would the Talend job look like?

The job might operate something like this:

  • It would run either on the client as a service or on a LAN based server (or on a remote server, with a SSH VPN (or Hamachi) to provide security).
  • At start-up, do a bunch of ETL tasks, pulling data from remote sources and databases, transforming and aggregating data etc. Storing the resulting data in local SQLite databases.  It might also build Palo cubes or update larger enterprise databases.
  • The job would then setup a Jetty web server and await requests for data.
  • The requests might be a mixture of raw SQL or requests to run specific Talend transformations which would return a dataset directly to the calling client or maybe just acknowledge the request, queue it up for processing later, sending the resulting dataset by EMail or RSS feed when finished.
  • At a fixed time the service would shut it self down and requeue itself for the next day’s workload.

… or nothing at all like that, and that’s the point, build what you need, add the levels of security (or none at all) that fits your situation, all within a open framework, with zero lock-in (okay, still using Excel, anyone for OpenOffice, Google Apps or Zoho?).  You don’t even need your own server, host it on an EC2 instance, (if you bring up an instance for 10/12 hours every working day, it would cost about $20/$25 a month).

Now tell me that doesn’t make sense?

OLAP Cube as a Mind Map

If you’ve worked with OLAP technologies for any length of time you’ll undoubtedly have been in the situation where you’ve had to explain the concept of an OLAP Cube to a “newbie”.  If the person in question has come across Excel pivot-tables, then you can probably short-circuit the conversation some what, explaining that a pivot table is in essence an OLAP cube, maybe highlighting the differences between it and whatever OLAP tool you’re proposing; ragged hierarchies, ability to update cells and ‘spread’ values down hierarchies etc.  Even better if you can show the user an working cube populated with hierarchies and elements that match the user’s business.

But what if the person is a complete novice to the world of analytics and you don’t have a relevant demo cube that you can demonstrate, what then?  I guess, you could start by first trying to fry the victim’s user’s brain by explaining the concepts behind multi-dimensional spaces and/or a quick intro into the world of de-normalised databases, with examples of snow and star schemas. The glazed look in the eyes of your audience may however suggest this doesn’t always work.  And it’s not just business-users who have problems getting their heads around cubes, many programmers also have difficulty the first time they’re exposed to OLAP concepts.

I’ve found that Mind Maps offer a good way to help both me and the client to visualise the domain model that a cube will eventually address; having worked through the mind mapping process it’s then easier to take the user with you as you translate this model to a physical cube or star-schema.  Well Hugo (who’s back blogging after a long absence) promises to take this method a step further, his modifications to FreeMind will allow for the export of a mindmap as a Palo cube!  Check it out …

Groovy as Talend’s scripting language

Although I had decided to use Talend (Java version) as my primary ETL tool I still had one major problem with it, its lack of a scripting tool.  Kettle (Pentaho PDI) has Javascript, Excel has VBA, Picalo has (well OK, is) Python and Talend in its Perl version has Perl.  I could have gone (and did experiment) with calling Javascript, Jython or JRuby via JSR223, but I wasn’t happy with the level of integration afforded by this, opting instead to make command line calls to Python (using SQLite as a data carrier).

Then, I discovered Groovy, or I should say rediscovered it, as I’d come across it many years ago when it was far less developed than is now, liked it then but couldn’t see a use for it at the time and promptly forgot about it.  Then it appeared wrapped in a Talend component, prompting me to do a quick visit to the Groovy website, which turned into a deep-dive into the language; I’d found my scripting tool!

Groovy (by the way what a terrible name for a language, or is that just me?), is not really a stand-alone language but more an extension to Java itself; offering the full power of Java but with addition of closures, builders and dynamic types.  In fact, over time Groovy has become more and more Java like (the biggest missing being lack of support for anonymous inner-classes).

To underline this convergence, Groovy is being developed under the separate JSR 241 rather than JSR 223. There’s full interoperability between both languages; Groovy  compiles down to JVM bytecode and can use Java classes and objects, Java can likewise use Groovy generated bytecode.  This allows for fast prototyping and development without compromising access to Java’s vast collection of libraries.

Here for example, is a piece of code to try out the JPalo library’s ability to access a Palo cube …


[sourcecode language='Java']

import org.palo.api.Connection;
import org.palo.api.ConnectionFactory;
import org.palo.api.Cube;
import org.palo.api.Database;
import org.palo.api.Element;
connection = ConnectionFactory.getInstance().newConnection("localhost","7777","admin","admin")
database = connection.getDatabaseByName("Demo");
;cube = database.getCubeByName("Sales");
rowElements = cube.getDimensionAt(0).getElementsInOrder();
columnElements = cube.getDimensionAt(1).getElementsInOrder();
dataSet = [rowElements,columnElements,]
dataSet << cube.getDimensionAt(2).getElementAt(0)
dataSet << cube.getDimensionAt(3).getElementAt(0)
dataSet << cube.getDimensionAt(4).getElementAt(0)
dataSet << cube.getDimensionAt(5).getElementAt(0)
// fetch data set
datas=cube.getDataArray(dataSet as Element[][])
connection.disconnect();
// parse the return string
rowcount = rowElements.length;
columncount = columnElements.length;
data=[]
heading=[]
// first row set to the row names (i.e. "Product name" followed by the country names )
heading << "Product"
for (i in 0..columncount-1) {
    heading << columnElements[i].getName()
   }
data << heading
// Now  out each line
for (i in 0..rowcount-1) {
    row = []
    row << rowElements[i].getName()
    for (j in 0..columncount-1) {
         row << datas[((i + (j*columncount)))]
         }
    data << row.flatten()
   }

//output to csv file
def csvOut= new FileOutputStream('c:/data/File.csv' )
for (lines in data) {
         lines.eachWithIndex{col,i ->
                             if (i > 0) {
                                 csvOut << ","
                             }
         csvOut << col
     }
     csvOut << "\n"
}
csvOut.close()
[/sourcecode]

This was done in the Groovy console as a proof of concept, it was then transferred to a tGroovy component where it was parametrised and instead of outputting to a CSV file, it was used to fill the globalBuffer structure (the structure used by tBufferOutput component).

Other things I managed to do with Talend tGroovy over a few days:

  • Extended SQLite with my own user-defined Palo functions.
  • Set-up a Talend job as an Excel accessible RESTful web service using Jetty.
  • Interfaced with Amazon S3.

Although I was very familiar with the S3 and the JPalo API, both SQLite UDFs and Jetty were new to me, and that’s were scripting proves it worth, giving the developer the maximum support with the minimum of background noise.  But it’s not just weird and wonderful new APIs that scripting helps expose but as a datasmithing tool, languages such as Groovy give analysts the ability to quickly de-construct and model datasets (for example, see Groovy’s SQL database support and collections’ functionality).

As a infamous Irish farming-pharma TV ad of my youth put it, “It’s a queer name but great stuff“.

Palo OLAP and sparse dimensions.

Last week I tried out both the latest Palo 2.5 release and its sister product, ETL-Server.  Although I’ve not done any proper benchmarks, 2.5 does appear to be faster than the previous release and the Excel add-in also behaves better when co-habiting with other add-ins and macros (the previous release’s use of, and response to, application level events meant it often caused the entire Excel session to grind to a halt when other macros were processing data).

There is however, a major memory leak problem when using cubes with sparse dimensions (such as the Biker database Invoice dimension), at one stage my Excel session had a working set size of 750M!

The failure of  MOLAP cubes to effectively handle sparse datasets has always been something that ROLAP star schema advocates (myself included) have pointed to as a major short-coming.  Over time, products such as Essbase have managed to get around this limitation (but often only with careful up front cube design by skilled staff). Palo hasn’t quite made it to that level yet.  But then, maybe it shouldn’t, perhaps simplicity of setup and use should be Palo’s goal, not the ability to handle a telco’s ‘grain-at the-level-of-call’ fact table.

Two of the new features in Palo 2.5 that I’d been looking forward to, Zero Suppression and Drill-through are both relevant to the handling of cube sparsity.

  • Drill-through; to reduce the sparseness of a dimension, by moving the ‘grain’ up a level in the consolidation hierarchy, e.g. have a base level of Invoice Type instead of Invoice Number or indeed, dropping the dimension entirely.
  • Zero Suppress; to filter out excessive elements, e.g. only show Invoice Numbers that were raised this month rather than showing every invoice number ever raised depending on a non-zero value to indicate that it belongs to this month.

It was my testing of Zero Suppression using the Biker database that appears to have caused Excel to go on a memory binge.  It doesn’t worry me that much as the right design can reduce the need for zero suppression e.g. in a customer dimension, ‘hide’  customers who trade with you infrequently under a “Others” consolidation.  Also, the use of drill-through would eliminate the need for many sparse dimensions, but, alas, the drill-through functionality is severely nobbled in Palo 2.5.

Firstly, the functionality is only available in Excel if you purchase the €8,000 Palo Supervision Server which is all fine and dandy if you have a need for all the good things that Supervision Server offers.  Having said that, I did managed to bypass this requirement by calling the ETL-Server’s SOAP API directly from Excel with the help of Simon Fell’s superb PocketSoap library, so all is not lost on that front. But …

… the other reason I’m not overly impressed with the feature, is its implementation; particularly for something Jedox is asking you to pay €8000 for!   Here’s what you get.

Within the ETL-Server’s cube ‘export’ you can specify “<info>” fields such as invoice number, that will not form part of the final cube coordinate list.  These coordinates and the info fields along with ‘value’ field are then written to a Derby (aka JavaDB) database table, where the schema is the ETL project name and the table name matches that of the cube.  So you end up with, in essence, a ‘fact table’ at a finer ‘grain’ than the cube and the role of degenerate dimensions being provided by the “info” fields.  The resulting table is not indexed, so large datasets will be a problem and for it to work, a cube can only be ‘owned’ by a single ETL project.

The other thing to note is that drill-thru only works for ‘N’ coordinates i.e. no consolidation elements.  This is unlike Excel’s Pivot table which allows you to double-click at any cell, to reveal the underlying dataset.

On the plus side, the new ETL-Server is actually very good and well put together.  Better off not thinking of it as a generic ETL tool but as a specialist Palo loader, like Oracle’s SQL*Loader.  And like SQL*Loader’s role on many Oracle projects, for a large percentage of Palo projects ETL-Server will  be the only ETL tool needed.

Palo ETL Server – Not for me …

Jedox have just released V1.0 of their Palo-centric ETL Server. I had been looking forward to this, not so much for its ETL ability (which is somewhat limited when compared to the likes of Pentaho PDI or Talend) but for the drill-through capability it would add to Palo. Alas, there’s a catch, you must purchase Palo Supervision Server (€8,000) to enable the Excel add-in to avail of this feature!

The thing that attracted me to Palo in the first place was its simplicity of approach and the primacy of Excel as the end-user view of the product, a modern day ESSBase. The fact that the Excel Add-in is closed source always worried me, as I felt that it would inhibit the thing that really sets open source apart (no not the cost) ,the formation of an active and innovative developer community. The sort of developers who have a need for, and an interest in, MOLAP tools tend to be more familiar with VBA, .NET, SQL, SAP Config etc. than with C/C++ or even Java development. The one area where such a community could add value, the Excel front-end, is closed to them. And I know there’s some non-Jedox involvement in the form of JPalo and the OO-Calc Add-in, but Excel is the key to Palo’s wide-spread adoption.

Also, the choice of .NET as the main client-side development platform was a mistake in my opinion, a VBA accessible object model would have been much more useful, it would also have removed the need for the current painful installation process.

This partial “open-source” model and the increasing complexity of the platform makes Palo, at least for me, a less attractive Micro-BI option. And remember, Excel already has a very powerful in-memory OLAP tool, the humble PivotTable, which is in most cases “good enough” for most analytical needs. So why use Palo rather than a PivotTable?

Palo’s advantages:

  • Can handle very large data sets (limited by the free memory available to the server).
  • Allows write-back and splash-down, both very useful for planning/budgeting applications.
  • Allows for ragged-hierarchies,
  • Server-side MOLAP rules e.g. [Budget],[2008] = [Actual],[2007]*0.035

Excel’s PivotTable advantages:

  • Pure Excel. with the object model available for VBA scripting.
  • Drill-through as standard.
  • Excel 2007 can now handle a 1,000,000 rows, for earlier versions use an Access/SQLite local database or an enterprise database to first “group” and summarise the data to be pivoted.
  • Can be used against SQL Server Analytical Services (SSAS) Cubes (and those of other providers such as Pentaho’s Mondrian).
  • Much easier to set-up and use.

Update: May 1st 13:30

Beta Version of Palo 2.5 (which you’ll need to use the Palo ETL Server’s drill-through functionality) has just been released ..

Python the new VBA ?

These last two weeks, Python has been on my mind. First off, last week I decided to make time to fully investigate Picalo, an open-source Python-based data analysis tool, and then, this week, Google announced their long awaited cloud-computing offering, Google Apps Engine, with the language at its core.

Python was the first of the “LAMP generation” scripting languages that I decided to learn in any detail ( I had used Perl before that but only on a per-task basis (similar to how I’d used AWK)). I then invested time in learning PHP, then Ruby and finally JavaScript. And here I am, back where I started, with Python.

But it’s not the same Python I learned three years ago, not that it has changed that much, but my appreciation of the language has, largely due to my deep dives into other languages. For example, JavaScript’s treatment of functions as first-class objects, highlighted the same functionality in Python, something I’d missed (or rather, not fully understood) the first time I encountered the language. Likewise, Ruby’s RoR introduced me to a “best of breed” approach to web application design, something that can be used as a comparison aid when approaching new web frameworks such as Django.

But of course the scripting language that continues to power most of my datasmithing activities is Excel VBA. That’s why I was so excited to see a tool such as Proto utilise VBA as its scripting language. But, Microsoft has abandoned VBA, there will be no more Protos.

Also, Excel VBA is now a Windows only language. Windows, however, is no longer the ‘only’ business client OS (see how many Apple laptops you can spot the next time you’re in a business-class airport lounge, a few years ago it would have been zero, not any more), and is currently nowhere to be seen as a cloud computing platform (but that’ll change).

I’m at heart a table-oriented programmer, and I, like Picalo’s author Conan Albrecht, believe “data analysis is best done through scripting”; but not just data analysis, the T in ETL (Extract, Transform and Load) and the I in DI (Data Integration) and SI (Systems Interfacing) also benefit from a scripting approach.

So, what to adopt as a successor/companion-in-her-old-age to VBA, will it be Ruby, JavaScript, Python, Perl, even PHP?

It looks like it’ll be Python because it’s …

The runner up is of course Ruby, but its poor integration with Windows is a major problem and the datasmithing “prior art” of Picalo and Resolver makes Python hard to beat.

UPDATE Jan 2010:

To experience the best of both worlds, VBA & Python, my xLite (Excel combined with SQLite) datasmithing platform now allows Python to be used in conjunction with VBA.  Check it out here http://www.gobansaor.com/xlite

UPDATE:

Also, as Dan pointed out in the comments below, I’d not included Jython in my list of reasons for embracing Python. I must add it to my list of things to try out particularly as both my “classic” ETL tools, Talend and Kettle are JVM based.

Another thing to add to the (ever growing) list is Mike Pitarro’s SnapLogic python-based ETL tool. They have …

…just released a 2.0 Beta version with some major architectural enhancements. The SnapLogic model is very different from traditional ETL systems. It takes an approach that’s more like the web, based on loose coupling and HTTP interactions. We model data source, sinks, and transformations as URI addressable endpoints, and have a model where than can be chained together in pipelines to build transformation logic. We use a plugin architecture to make it easy to add custom components.

Java – at the eye of a perfect storm

The “perfect storm” of ubiquitous broadband, powerful and cheap laptops, virtual machines, cloud-based services/infrastructure and open-source software is changing the nature of IT in a way that’s reminiscent of the revolution started by the IBM PC. Although a lot of emphasis has been put on the influence of consumer-focused services on the enterprise, the Web2.0 effect; there’s also traffic in the other direction.

Tools that were once the preserve of large multi-national or governmental organisations are now becoming available to a much larger audience at a fraction of the cost (either free via open source or pay-as-you-go via on-line services such as Amazon Ec2 or salesforce.com).

As a result of this leakage from the enterprise, I’m more and more using a skill that I though I’d left behind in the hallowed halls of “big business”, my knowledge of enterprise Java. Two of the tools that are at the centre of my datasmithing arsenal are Java to the core, Talend ETL and WaveMaker and a third Palo’s new ETL Server is built on a Java stack.

Talend

My first impressions of the “Java Project” version of Talend were, as they say in Texas, “all hat and no cattle”. I’ve stuck with it though, and have had the opportunity over the last few weeks to re-visit the product. Initially, my attitude to Talend was coloured by my experience with Kettle (aka Pentaho PDI), which under the direction of Matt Casters and the patronage of Pentaho has grown from strength to strength, but once I attuned myself to the idea that Talend is, in essence, a code generator, generating code in a language I know well, I became more comfortable with it.

What I like about Talend, is the ability to convert an ETL process into either a POJO or a WAR file representation of the solution, both stand-alone and fixed-in-time. Talend as a company or as a product could disappear in the morning, as could I, but the solution, cast in Java, will continue on regardless, a solution expressed in a standards based language, widely used and understood by a large number of IT professionals.

This is really important when you consider the ETL/BI products that have been swallowed by the rent-book collectors of the IT business, e.g. Essbase/Hyperion by Oracle, Business Objects by SAP (I might be a bit unfair to SAP, who continue to show real commitment to software R&D). The new owners of these once ground-breaking products will continue to milk the licence holders for “rent” long into the future , a situation that would be acceptable if they continue to offer value above a “protection” service more associated with that provided by your local hoodlum; but alas, most were bought to strengthen the purchaser’s control of their market, so I wouldn’t hold my breath.

I’ve always been a lazy programmer, so I’ve over the years developed numerous productivity aids to help automate development (i.e. reduce the boredom factor for me and the cost for my customers/employers) and to reduce errors (a cost to both sides). Code generation has been at the heart of many of these efforts, so I find myself at home with both Talend and Wavemaker’s approach.

WaveMaker

I’d not heard of Chris Keene’s WaveMaker until a few weeks ago. WaveMaker (previously know as ActiveGrid) belongs, along side Talend, Pentaho and Jedox, to a new breed of open-source businesses, and the product brings new life to Java web development. I was lucky to escape the worst of the J2EE nonsense and could never understand why an easy to use GUI builder like this never existed in the Java world, no wonder .NET continues to outgun J2EE in the market place.

Those of you with a background in VB6/VBA or VisualStudio will feel right at home here, but instead of desktop GUIs you’ll be building AJAX web applications. The resulting application is packaged as a WAR file which can be hosted by any standards based Java server (e.g. Tomcat or Jetty). It’s open source and built on industry standards, Hibernate,Spring and the Javascript Dojo framework.

Not only can WaveMaker act as a front-end to traditional databases, it’s designed to be equally at home with data served up by web services or POJOs. And, as Talend WAR projects and the Palo ETL Server (Jetty based) both expose Axis based web services, these three products are a match made in Java heaven.

Oracle Application Express

So, If you only came across WaveMaker recently, what did you intend to use as a Web/GUI front-end before this?

Well, for many tasks, Excel, backed by web-service aware VBA will continue to be an option, my xlAWS library and xLite code base will continue to be useful. Obviously, Excel is a natural front-end to Palo itself and VB6 is always there for quick and dirty Window’s GUI apps. I may also use Proto if circumstances warrant it.

For web front-ends, in the past I toyed with using Oracle Apex, then Rails, then JotSpot, then Zimki.

  • Rails taught me a lot about “good web app design” and introduced me to Ruby (and SQLite) but it didn’t offer me the speed of development and ease of deployment I’ve become accustomed to.
  • JotSpot got swallowed up by Google and spat out again as Google Sites minus the innovative “wiki database” capability.
  • Zimki, alas, is no more.
  • That left me with Oracle’s impressive Application Express (aka APEX aka HTML DB).

Application Express is excellent if your background is in Oracle databases and/or Oracle Forms and if you work in a Oracle shop and have not checked it out, then do so, you’ll be impressed. It’s standard with all versions since 10G and can be installed on V9 databases. It’s also the front-end for Oracle XE – the free edition of the database server.

So why jump ship to WaveMaker?

APEX is Oracle specific, closed source, costs a fortune once you outgrow Oracle XE, is a bit “odd” to configure, and, I’m not sure Oracle know what to do with the product (afraid it might cannabalise its lucrative J2EE business!).

WaveMaker embraces the world, is open source and is really easy to use, while still allowing access to the underlying code (both Java and Javascript) and CSS styling. And, it can be easily deployed.

No contest, I’m afraid.

Dublin Bus and PALO ETL – the connection!

Dublin buses, as is the norm with most road-based public transport systems in our increasingly car-choked cities, tend to operate on the basis of “no sign of a bus for ages, then two or three arrive at the same time”. Palo MOLAP ETL options appear to be following the same pattern; we’ve been waiting for ETL support for ages and now we see three of them heading down the road towards us. There’s Palo’s own offering, then came Stratebi‘s Kettle Plugin and now Talend Version 2.3.0RC2 is offering a Palo output component.

Mind you, the Talend offering is very basic and I’ve not managed to get the Sratebi plugin to work, leaving Palo’s ETL Server as the front runner at the moment (drill-through capability is a winner in my book).

I’ve also been busy re-factoring my VBA SQLite and Amazon S3 code with the intention of publishing them as an Excel based micro-ETL platform. While cleaning up the Amazon AWS modules I’ve been playing with SimpleDB, I’m impressed, Excel combined with SimpleDB rocks!

I’ve also wrapped the open source XySSL SHA1 HMAC C code in a VBA friendly DLL, as searching for a VBA hmac sha1 hash implementation (essential for Amazon AWS access) has proved fruitless.

Hope to release the lot the end of next month.

UPDATE:

Thanks to Javier and Jorge from Stratebi I’ve managed to get the new Kettle Palo plugin to work. It seems that the TEST facility in the Kettle database connection dialogue throws an exception for Palo connections but the connections work fine in the actual Palo input/output steps. Did a quick test and it looks very easy to use and fits in well with the Kettle “way of doing things”.

PALO ETL-Server and SAP

Jedox have just published a roadmap for their open-source ETL-Server, release date of March 2008, same date as the next release of the Palo OLAP Server. In a future release they intend to offer SAP RFC/BAPI and SAP-BW XMLA support, being an old SAP hand this looks very interesting.

There’s also a features page with a good overview of the ETL-Server’s technical architecture.

PALO ETL Server, more sightings …

First day back after Christmas, snow falling outside.

More additions to the PALO ETL-Server SourceForge project, new version of the core and, a new web server – built using Jetty and Apache Axis. Axis is a SOAP handler so I looked around for the WSDL file to see what services are to be exposed and found a reference to a drillThrough service which I guess is the mechanism by which we’ll soon be able to drill back from a PALO cube to its source data tables. At the moment I’m just wandering through the source code, I’ll need to fire up an EC2 instance to give it a test run as the new server code doesn’t fully support Windows.

Happy New Year!

PALO ETL-Server, first sighting …

I was wrong. I figured Jedox would build their new ETL server on one of the existing open source ETL project code-bases, either Talend or Pentaho’s Kettle. Instead, the new alpha ETL server code which has just been uploaded to SourceForge is based on neither and appears to have been developed by another German company Proclos.

Rather that a full featured all-things-to-all-men ETL tool, it’s a specialist MOLAP cube import tool, like an XML driven version of IMPPalo. Being Java based, it should be easy enough to combine with Kettle to offer the best of both worlds; let Kettle do the heavy lifting and the management of conformed dimensions and fact tables, then use Palo ETL-Server to build the hierarchies and load the cubes from these tables.

There’s no documentation as yet but there’s two demo projects; importRelDB.xml, which loads data into a cube from a HSQLDB in-memory database and a CSV file; and importOLAP.xml, which copies data from one Palo cube to another.

To run the importRelDB.xml project …

Java -jar importer.jar – p importRelDB

… each project is broken up into Jobs (such as Initdata, MasterData, CubeData, again like IMPPalo) and these can be run separately by using the -j option.

The tool is controlled via XML configuration files and lacks a GUI interface (which is fine by me, I’m more of a command-line guy). I’ve checked-out the SVN code and am slowly working my way through it, no sign as yet as to how drill-back from PALO cubes will be enabled; as this project is called Importer ETLCore, perhaps that’s yet to come.

So far, I like what I see.