Category Archives: kettle

SQLite – the ultimate data-smithing tool!

SQLite logo as of 2007-12-15Image via Wikipedia

Although my data-smithing tool box is full to the brim with powerful tools such as Talend, Kettle PDI, Picalo and Excel, all backed by the cloud infrastructure of Amazon’s S3, SImpleDB and EC2, there’s one simple yet powerful tool that I always seem to gravitate back to, that tool is SQLite.

Now obviously being a hewer of data, I need a SQL compliant database for data manipulation and SQLite performs that task with speed and ease. But it’s not just in the hewing, it’s in the hauling of data where SQLite also shines.

I use SQLite as the container for passing tabular datasets between (and within) my various tools, that data doesn’t even need to be clean (due to SQlite’s liberal manifest typing rules) just so long as it can be expressed as a table.

For example; a Talend job could store an extracted dataset in a SQLite file, pass that file on to a Python script for some special processing (for example extracting further data from a source not directly supported by Talend such as SAP or SimpleDB), and then pass the resulting SQLite database on to Excel or a similar tool to allow a business user to view and perhaps modify the data; finally Talend picking up the file again to load it into a corporate data warehouse.

Now you could use flat files to transport the data or store the intermediate results in a corporate database, but SQLite is as easy, if not easier than, flat files and offers the SQL processing capabilities of big-iron databases, but without the hassle of getting write access to an existing server or setting one up from scratch.

And I know there are other similar file based database offerings such as MS Access and the Java only HSQLDB, but neither match SQLite’s ubiquitousness, sheer simplicity and powerful data processing ability.

Dublin Bus and PALO ETL – the connection!

Dublin buses, as is the norm with most road-based public transport systems in our increasingly car-choked cities, tend to operate on the basis of “no sign of a bus for ages, then two or three arrive at the same time”. Palo MOLAP ETL options appear to be following the same pattern; we’ve been waiting for ETL support for ages and now we see three of them heading down the road towards us. There’s Palo’s own offering, then came Stratebi‘s Kettle Plugin and now Talend Version 2.3.0RC2 is offering a Palo output component.

Mind you, the Talend offering is very basic and I’ve not managed to get the Sratebi plugin to work, leaving Palo’s ETL Server as the front runner at the moment (drill-through capability is a winner in my book).

I’ve also been busy re-factoring my VBA SQLite and Amazon S3 code with the intention of publishing them as an Excel based micro-ETL platform. While cleaning up the Amazon AWS modules I’ve been playing with SimpleDB, I’m impressed, Excel combined with SimpleDB rocks!

I’ve also wrapped the open source XySSL SHA1 HMAC C code in a VBA friendly DLL, as searching for a VBA hmac sha1 hash implementation (essential for Amazon AWS access) has proved fruitless.

Hope to release the lot the end of next month.

UPDATE:

Thanks to Javier and Jorge from Stratebi I’ve managed to get the new Kettle Palo plugin to work. It seems that the TEST facility in the Kettle database connection dialogue throws an exception for Palo connections but the connections work fine in the actual Palo input/output steps. Did a quick test and it looks very easy to use and fits in well with the Kettle “way of doing things”.

PALO ETL-Server, first sighting …

I was wrong. I figured Jedox would build their new ETL server on one of the existing open source ETL project code-bases, either Talend or Pentaho’s Kettle. Instead, the new alpha ETL server code which has just been uploaded to SourceForge is based on neither and appears to have been developed by another German company Proclos.

Rather that a full featured all-things-to-all-men ETL tool, it’s a specialist MOLAP cube import tool, like an XML driven version of IMPPalo. Being Java based, it should be easy enough to combine with Kettle to offer the best of both worlds; let Kettle do the heavy lifting and the management of conformed dimensions and fact tables, then use Palo ETL-Server to build the hierarchies and load the cubes from these tables.

There’s no documentation as yet but there’s two demo projects; importRelDB.xml, which loads data into a cube from a HSQLDB in-memory database and a CSV file; and importOLAP.xml, which copies data from one Palo cube to another.

To run the importRelDB.xml project …

Java -jar importer.jar – p importRelDB

… each project is broken up into Jobs (such as Initdata, MasterData, CubeData, again like IMPPalo) and these can be run separately by using the -j option.

The tool is controlled via XML configuration files and lacks a GUI interface (which is fine by me, I’m more of a command-line guy). I’ve checked-out the SVN code and am slowly working my way through it, no sign as yet as to how drill-back from PALO cubes will be enabled; as this project is called Importer ETLCore, perhaps that’s yet to come.

So far, I like what I see.

New ETL platform for PALO OLAP

Jedox have announced that they intend to ship a Palo centric ETL open source server product early next year. This is excellent news and is on top of the new rules engine that was added to Palo this summer. Open source MOLAP has suddenly taken off the training wheels and is getting ready to mess with the big kids. The two things I really like about the new proposed Palo-ETL server are; it’s open-source and it’s designed to enable drill-down from the analysis cube back to the data source.

Drill back is the 2nd most common reason for continuing IT staff involvement in the day-to-day running of BI projects (the 1st is of course the ETL process). The “Where did that figure come from?” question is one of the reasons that the Excel pivot table function is so popular, double-click on a data cell – the rows used to generate that figure are displayed on a new sheet; simple but powerful.

As to what platform the new server is to be built on, I’m guessing Talend or Kettle. Now it is possible that Jedox have rolled their own product from scratch but with two superb open-source products already out there it would seem pointless. Talend actively seeks out technology licensing agreements with other companies and has just opened an office in Germany (Jedox is based in Freiburg) so it would be the most likely contender. But, Pentaho’s Kettle may also be in the running as there’s already prior-art and this Pentaho forum thread. Also, the one major missing from the comprehensive Pentaho stable is lack of an in-memory OLAP tool (their current OLAP offering is the ROLAP Mondrian project).

Using the latest Pure Java SQLite JDBC driver in Kettle

The bug in the pure Java SQLiteJDBC driver that caused an “out of memory” error when trying to connect to a SQLite database using standard windows drive letters (e.g. c:\kettle\mydata.db) is now fixed. The current version (V037) has also been updated to SQLite version 3.4.2. To use the latest driver within Kettle, download the file from here, go to the ../libext/JDBC folder, delete the included sqlitejdbc-v023-nested.jar and replace with sqlitejdbc-v027-nested.jar.

Update:

Matt Casters of the Pentaho Kettle team left a comment to say version 2.5.2 and version 3.0.0-RC2 will ship with the sqlitejdbc-v027-nested.jar