Category Archives: kettle

LiteBI, Heavy ETL

Although my major BI interest is in micro-BI (or is that  workgroup-BI?)  i.e. data, perhaps cleansed and packaged elsewhere, available locally on a datasmith’s PC,with most likely an in-memory OLAP as the analysis tool; the possibilities of the “cloud” as a BI platform have not escaped me.

From a micro-BI perspective, the ability to act as a backup/mirroring tool or as ETL/marshaling tool (anybody for Hadoop and SQLite?) attracts. I’ve yet to make up my mind on BI delivered as a cloud PaaS but obviously many others believe it has a future.

My main worry with PaaS is not lock-in (which exists equally for in-house proprietary solutions) but the dangers of a Coghead-like lock-out.  My other doubts are more technical; believing, as I do, that in-memory offers significant advantages over traditional ROLAP (simplicity been the main one) and multi-tenant in-memory architectures are not yet a runner.  But last week I had a demo of new Spanish BI PaaS service, LiteBI, which might just change my mind.

Javier Giménez Aznar and his team previously worked on delivering Pentaho based datawarehouses to large Spanish corporations and government agencies, so they have a deep understanding of Mondrian ROLAP and are using that knowledge to build the LiteBI service, but this time with SMBs as the target customers rather than corporates. Pricing starts at €145 per month and is based on number of concurrent users, number of analytical spaces and the data volumes, so it’s not for very small firms more for the Medium in SMB.

Impressions? The cube designer, dashboard builders and the general UI are all very good and I would think would appeal to end-user datasmiths and, as such, will be a major up-front aid to selling this product.  But it was LiteBIs approach to the thorny issue of ETL and data loading that impressed me and also helped ease some of my Coghead-induced-fears.

BI technology stacks consist of three elements:

  • The “fancy” front-end; graphs,animated dashboads and so on.
  • The pivot engine; ROLAP or MOLAP or both.
  • The ETL process.
  • (Many would say there’s an important 4th, the data-warehouse, but not every BI effort requires one, but that’s another issue)

LiteBI is continuing to build yet more functionality into their UI and this “fancy” front-end is essential as it’s their “shop window”.

Mondrian provides their pivot engine, and again they continue to work on optimisations such as column-based datastores to increase speed and automate responsiveness tuning (end-users are very unforgiving of slow pivots).

But it’s in the 3rd area, that of the ETL process, that you realise the LiteBI team has real-world BI experience.  Data is loaded into LiteBI via an API, but with the ETL process itself happening on the customer side.

“Well,so what?” you may ask. The extraction of data has to obviously happen customer-side (even though not in the case of data being sourced from the likes of SalesForce.com). Yes, but it’s the transformations and data cleansing that adds true value to the ETL process and subsequently determines the quality and usefulness (as opposed to the speed or the “prettiness” of delivery) of the solution.

Part of the process of adopting LiteBI, is an ETL consultancy stage where a LiteBI partner company will provide on-site services to build this ETL layer, handling not just transformations but initial load and automating the subsequent delta uploads.

So the cost mounts up, but in reality you can’t do BI without this investment; there’s no ETL magic bullet.  Even still, Javier says the typical go-live time for a LiteBI project would be in the order of 3-4 weeks rather than the 3-4 months of similar on-site Pentaho projects.

The end-user ‘owning’ the ETL process makes the prospect of a service lock-out slightly less worrying as, at least, one would still have a good starting point for moving to another provider or back in-house. What I would really like to see would be the option to self-host LiteBI, which I guess would involve open sourcing large parts of the service (the automated optimisation strategies could, for example, be excluded from this open source version).

The load API comes packaged as a plugin to Kettle (aka PDI) and the intention is to offer a similar add-on for Talend in the near future. LiteBI also offers a white-label offering whereby 3rd party OLTP solution providers can use the service as their product’s BI suite.

Like the Skibbereen Eagle keeping its eye on the Czar of Russia, I too will be keeping a watchful eye on LiteBI and the march of on-demand BI in general.

Why not join me on Twitter at gobansaor?

Pentaho Data Integration (Kettle) V Talend Benchmark

Pentaho’s Matt Caster has just published a benchmarking exercise comparing Kettle and Talend.  In it he admits he’s not a Talend expert and he advises that people should perform their own benchmarks where possible as requirements differ.  Nevertheless, unlike most other benchmarks we’ve seen on the subject he publishes not just the results but the actual transformation “code” used in the tests. 

For many people these benchmarks are of no real interest as long as the product does what is required within the time and resources available they’re content.  But it would be a mistake to think that benchmarks don’t matter, they do; people have and will make that final decision based on them.  Remember ETL is not life and death, the decision which tool (if any) to go with may not get the level of investigation that the developers behind such products expect of their potential clientele and this is particularly true of open source.  Busy people will use such reports to direct them down a path or to confirm their existing prejudices. So I’m really glad to see Matt responding and in particular, responding in the manner he has.

Databases vendors have for years played the benchmarking game, setting and breaking records either via real technological advances or simply gaming the process.  We as purchasers and users knew in many cases to take the results with a large dose of salt, but purchasing decisions where nevertheless made on the backs of these surveys.

Why not join me on Twitter at gobansaor?

Open Source Metrics and Benchmarks

Marc Russel’s blog links to a Manapps ELT benchmark report comparing the performance of several leading ETL tools both proprietary (DataStage and Informatica) and OS (Talend and PDI (aka Kettle)).  As would be expected each tool has their own strengths and weaknesses, but one thing stands out, the venerable Kettle ETL aka PDI 3.0 is now a serious contender for handling very large datasets.  Oops, that’s what I get for wishing for a result and (mis-)reading the report early in the morning with a cold and bad sore throat, sadly PDI is still very much slower that its OS cousin Talend. In fact, Talend continues to play on the strength that comes from a code generated sloution, i.e. raw speed.  As a pure ETL play, Talend is well capable of playing on the same pitch as the “big kids”. 

Interestingly, the report is also “open source” as it’s released under a Creative Commons License, so I can link to it here.

UPDATE:

There’s now a new version of the report available (www.manapps.com, Topic Benchmark), it seems the original was just a work-in-progress and was not meant for public release.  The main difference appears to be a significant improvement in Informatica’s ‘score’, but I’m not sure as I was really only interested in comparing the two OSS products, Talend and Pentaho PDI, in that ‘battle’ Pentaho still comes out ‘slower’.

 The original Marc Russel blog entry and a subsequent one reporting the new updated report appear to have both been removed.  

Also, I was informed of the ‘updated’ report via this email from manapps, which assures vendors that they are happy to rerun any tests and provide any information re the running of such tests … 

Dear Sir,

You referred on your web site to the report called “Benchmark ETL” by Manapps, from November 2008. This draft report was not intended to be publicly released since just a working document.
We would like you (i) publish Asap the modified version (or its related link) that supersedes the former one (on our web site (www.manapps.com, Topic Benchmark), (ii) state that Manapps had no intend to release the former report and accordingly takes no responsibility on its content, (iii) state that Manapps holds all necessary elements at the disposal of all vendors so that they can rerun some tests if wished that will then be published.

Regards,
Philippe THOMAS

Time: Thursday March 5, 2009 at 5:10 pm

 

Another analysis of OSS in the wild this time from Chris Keene, WaveMaker CEO, on OSS as a marketing tool. Bottom line, 1% conversion rate, 700 paying customers in 9 months …   

WaveMaker OSS as a marketing tool

WaveMaker OSS as a marketing tool

 

Why not join me on Twitter at gobansaor?

New universal SQLite JDBC library.

Both Talend (Java) and Kettle distribute the Zentus.com pure-Java SQLite JDBC driver and for most purposes this run-anywhere version is fine. But, if you really need to take advantage of SQLite’s speed then connecting using the native JNI version is a must.  Doing this was easy enough, just change over to using a generic JDBC connection specifying the required native jar and placing the associated dll/so on your system path.

But now there’s an easier way, the latest version (V052, in fact from V050 on) is a universal jar, it contains native JNI libraries for Windows, Linux and MacOS alongside the pure-Java version.  It will automatically pick the correct lib for the platform and fall back to the pure-Java version if required.  You can tell if it’s picked up the native lib by calling conn.getDriverVersion(); it’ll return “native” if it has.

To upgrade to this jar in Kettle see this, this time replacing the nested jar with sqlitejdbc-v052.jar.

For Talend:

  • Either rename the new V052 jar to sqlitejdbc_v037_nested.jar, replace the existing V037 jar in the ../lib/java folder with this new renamed file.
  • Or, you could edit the Java specific XML files in the various tSQlite component folders, replacing the references to the old nested V037 jar.
  • Or, and this is what I would do, don’t use the tSQLite components, replace them with tJDBC generic components, then you can pick whatever version of the driver you require, you could even change to a different database provider!

The Talend tradition of a separate set of components for each type of database, seems to be a hangover from its Perl-generating roots. It’s true that database specific components are required for certaing tasks such as  bulk-loading, ELTs and so on, but JDBC was designed to be generic and as long as the SQL syntax is compatible, it makes switching in an out database providers very easy.  So unless there’s a good reason, stick to using tJDBC.

Regular Expressions as an end-user programming tool?

“What? Have you completely lost the plot, Gleeson?”, I hear you scream.  Jamie Zawinski’s famous quote is intoned once more ..

Some people, when confronted with a problem, think
“I know, I’ll use regular expressions.”   Now they have two problems.

Of course the above quote could be (and probably has been) changed to…

Most business people, when confronted with a problem, think
“I know, I’ll use a spreadsheet.”   Now they have two problems.

They are dense, single-line, single purpose, self contained mini-programs.  The previous statement applies to regular expressions but could equally be used to describe the single most popular end-user programming tool, spreadsheet formulae (particularly in their nested form!).

As somebody with the “programming gene” (something most, but not all, IT professionals possess, as do a significant proportion of “civilians”), such compressed logic somewhat grates compared with the power and elegance of more expressive programming languages, but that hasn’t stopped me using both spreadsheet formulae and regex to quickly and effectively solve problems when the need arose.

Those without the programming gene (the vast majority of business users), find traditional programming languages next to impossible to get their heads around yet find spreadsheet formulae approachable and useful.  It seems to be something to do with approaching problems as a series of simple problems and not loading the whole problem domain into one’s brain at one sitting (as most programmers and system designers are capable of).

In the past, non-programmers would rarely come in contact with regex as its use was possible only within the realms of professional programming or Unix sys-admin toolsets (sed,awk etc.).  But now, ETL tools such as Kettle and Talend allow end-users to use regular expressions without the need to understand the underlying programming language.  Taking this to the next step, Talend’s new data profiling product uses regular expressions as its main discovery language. They could, I guess, have invented yet another XML dialect and/or query-by-example dialogue, but instead they’ve taken the sensible (and cheaper) option and exposed the full power of raw regular expressions.

Will the great unwashed embrace regex in the same way they took to nested Excel functions, embarrassing their professional colleagues with yet more amateurish and often unmaintainable messy solutions, that just work? I think they just might…

SQLite – the ultimate data-smithing tool!

SQLite logo as of 2007-12-15Image via Wikipedia

Although my data-smithing tool box is full to the brim with powerful tools such as Talend, Kettle PDI, Picalo and Excel, all backed by the cloud infrastructure of Amazon’s S3, SImpleDB and EC2, there’s one simple yet powerful tool that I always seem to gravitate back to, that tool is SQLite.

Now obviously being a hewer of data, I need a SQL compliant database for data manipulation and SQLite performs that task with speed and ease. But it’s not just in the hewing, it’s in the hauling of data where SQLite also shines.

I use SQLite as the container for passing tabular datasets between (and within) my various tools, that data doesn’t even need to be clean (due to SQlite’s liberal manifest typing rules) just so long as it can be expressed as a table.

For example; a Talend job could store an extracted dataset in a SQLite file, pass that file on to a Python script for some special processing (for example extracting further data from a source not directly supported by Talend such as SAP or SimpleDB), and then pass the resulting SQLite database on to Excel or a similar tool to allow a business user to view and perhaps modify the data; finally Talend picking up the file again to load it into a corporate data warehouse.

Now you could use flat files to transport the data or store the intermediate results in a corporate database, but SQLite is as easy, if not easier than, flat files and offers the SQL processing capabilities of big-iron databases, but without the hassle of getting write access to an existing server or setting one up from scratch.

And I know there are other similar file based database offerings such as MS Access and the Java only HSQLDB, but neither match SQLite’s ubiquitousness, sheer simplicity and powerful data processing ability.

Dublin Bus and PALO ETL – the connection!

Dublin buses, as is the norm with most road-based public transport systems in our increasingly car-choked cities, tend to operate on the basis of “no sign of a bus for ages, then two or three arrive at the same time”. Palo MOLAP ETL options appear to be following the same pattern; we’ve been waiting for ETL support for ages and now we see three of them heading down the road towards us. There’s Palo’s own offering, then came Stratebi‘s Kettle Plugin and now Talend Version 2.3.0RC2 is offering a Palo output component.

Mind you, the Talend offering is very basic and I’ve not managed to get the Sratebi plugin to work, leaving Palo’s ETL Server as the front runner at the moment (drill-through capability is a winner in my book).

I’ve also been busy re-factoring my VBA SQLite and Amazon S3 code with the intention of publishing them as an Excel based micro-ETL platform. While cleaning up the Amazon AWS modules I’ve been playing with SimpleDB, I’m impressed, Excel combined with SimpleDB rocks!

I’ve also wrapped the open source XySSL SHA1 HMAC C code in a VBA friendly DLL, as searching for a VBA hmac sha1 hash implementation (essential for Amazon AWS access) has proved fruitless.

Hope to release the lot the end of next month.

UPDATE:

Thanks to Javier and Jorge from Stratebi I’ve managed to get the new Kettle Palo plugin to work. It seems that the TEST facility in the Kettle database connection dialogue throws an exception for Palo connections but the connections work fine in the actual Palo input/output steps. Did a quick test and it looks very easy to use and fits in well with the Kettle “way of doing things”.

PALO ETL-Server, first sighting …

I was wrong. I figured Jedox would build their new ETL server on one of the existing open source ETL project code-bases, either Talend or Pentaho’s Kettle. Instead, the new alpha ETL server code which has just been uploaded to SourceForge is based on neither and appears to have been developed by another German company Proclos.

Rather that a full featured all-things-to-all-men ETL tool, it’s a specialist MOLAP cube import tool, like an XML driven version of IMPPalo. Being Java based, it should be easy enough to combine with Kettle to offer the best of both worlds; let Kettle do the heavy lifting and the management of conformed dimensions and fact tables, then use Palo ETL-Server to build the hierarchies and load the cubes from these tables.

There’s no documentation as yet but there’s two demo projects; importRelDB.xml, which loads data into a cube from a HSQLDB in-memory database and a CSV file; and importOLAP.xml, which copies data from one Palo cube to another.

To run the importRelDB.xml project …

Java -jar importer.jar – p importRelDB

… each project is broken up into Jobs (such as Initdata, MasterData, CubeData, again like IMPPalo) and these can be run separately by using the -j option.

The tool is controlled via XML configuration files and lacks a GUI interface (which is fine by me, I’m more of a command-line guy). I’ve checked-out the SVN code and am slowly working my way through it, no sign as yet as to how drill-back from PALO cubes will be enabled; as this project is called Importer ETLCore, perhaps that’s yet to come.

So far, I like what I see.

New ETL platform for PALO OLAP

Jedox have announced that they intend to ship a Palo centric ETL open source server product early next year. This is excellent news and is on top of the new rules engine that was added to Palo this summer. Open source MOLAP has suddenly taken off the training wheels and is getting ready to mess with the big kids. The two things I really like about the new proposed Palo-ETL server are; it’s open-source and it’s designed to enable drill-down from the analysis cube back to the data source.

Drill back is the 2nd most common reason for continuing IT staff involvement in the day-to-day running of BI projects (the 1st is of course the ETL process). The “Where did that figure come from?” question is one of the reasons that the Excel pivot table function is so popular, double-click on a data cell – the rows used to generate that figure are displayed on a new sheet; simple but powerful.

As to what platform the new server is to be built on, I’m guessing Talend or Kettle. Now it is possible that Jedox have rolled their own product from scratch but with two superb open-source products already out there it would seem pointless. Talend actively seeks out technology licensing agreements with other companies and has just opened an office in Germany (Jedox is based in Freiburg) so it would be the most likely contender. But, Pentaho’s Kettle may also be in the running as there’s already prior-art and this Pentaho forum thread. Also, the one major missing from the comprehensive Pentaho stable is lack of an in-memory OLAP tool (their current OLAP offering is the ROLAP Mondrian project).

Using the latest Pure Java SQLite JDBC driver in Kettle

The bug in the pure Java SQLiteJDBC driver that caused an “out of memory” error when trying to connect to a SQLite database using standard windows drive letters (e.g. c:\kettle\mydata.db) is now fixed. The current version (V037) has also been updated to SQLite version 3.4.2. To use the latest driver within Kettle, download the file from here, go to the ../libext/JDBC folder, delete the included sqlitejdbc-v023-nested.jar and replace with sqlitejdbc-v027-nested.jar.

Update:

Matt Casters of the Pentaho Kettle team left a comment to say version 2.5.2 and version 3.0.0-RC2 will ship with the sqlitejdbc-v027-nested.jar

Google Spreadsheets – ETL tool

Although I’m a total Excel fanboy, I most admit I rarely use it any longer for personal stuff such as home budgets, tax calculations, what-ifs, to-do lists etc.; I now tend to use Google Spreadsheets. Likewise, personal notes, drafts and useful bits of code are stored using Google Docs rather than MS Word. Three main reasons for this shift to the cloud:

  • Google Docs & Spreadsheets are ‘good enough’ for most of the trivial lists and calculations I require in my personal life and indeed for most business purposes as well, at least those that don’t require a pivot table.
  • These spreadsheets and documents are important but not necessarily in the ‘state secret/I-could-tell-but-then-I’d-have-to-kill-you’ scale of things, by building them in Google Apps they are securely backed-up and easily accessible.
  • A lot of the spreadsheets are collaborative in nature, and in the collaboration field, Google Spreadsheets just gets better and better.

Today, Google announced further additions to their spreadsheet product. The AutoFill feature adds functionality I’ve come to expect from Excel, but with a twist, integration with Google Sets. But the additions that really caught my eye were the new data import functions. Now again, Excel has had web queries since Excel97, and it always amazed me why online pretenders to the throne tended to ignore the most common source of tabular data on the web, the HTML table; something to do with the great XML/Tables divide I guess!

Google now not only fixes this omission,providing access to HTML tables and comma/tab separated file, but also provides access to RSS/ATOM and generic XML sources. All that’s missing now are functions that can read other common online data files formats such as Excel, MSAccess, XBase and of course SQLite.

This addition of HTML import support and the AutoFill feature will further reduce the number of times I’ll need to fire up Excel for personal tasks, but the RSS/ATOM/XML import feature also has potential as a tool in my micro-ETL toolbox. Using Excel as my only micro-ETL tool is possible when the data is either already in Excel/CSV or accessible via a COM API or via ODBC drivers, otherwise I can call-in either Ruby, Talend, Kettle or even RSSBus. But now I’ve another option, if the data is public and published as RSS/ATOM or some other variation on XML, I can use Google Spreadsheets to fetch the data and import the resulting tabular dataset into Excel via a Web Query or via the GData API.

New Google Reader Search facilityOne other thing. While researching this post, looking up links etc. I used another new feature Google added today, Google Reader’s new Search facility. As most of my references are discovered via the blogs I subscribe to, the ability to restrict searches to that subset of the web is fantastic; I even used it to search through my own blog posts! If del.icio.us offered the same option it would make re-finding stuff even easier. I did try to use Google Co-Op to build a search engine restricted to my del.icio.us links but it didn’t seem to like the volume of links (4000 odd) I sent it.

Apatar – a few extracts short of a load

I’ve been meaning to try out the Apatar ETL/Mashup tool for sometime and today being yet another rainy day in this the worst Irish summer that I can remember (and Irish summers are not renowned for the lack of rainfall) I decided to give it a try out. Not impressed I’m afraid; comes up short when compared to either Kettle (Pentaho PDI), Proto, RSSBus or Talend. Very few database connectors (e.g. no SQLite,DB2 or Firebird support) this wouldn’t be a problem if the product offered a generic JDBC or ODBC connector. It does have one nice feature the others (other than RSSBus) haven’t, an Amazon S3 connector. But the thing that I find amazing is that a product that’s positioning itself as an Enterprise 2.0. mashup tool doesn’t have the ability to read and write Excel files! And no, CSV files don’t count.

Google Gears – SQLite Killer App

The announcement of Google Gears is of course a game changer for those working in the development of online apps; its addition to Goggle Reader alone would make it worth while for me and I’m sure we’ll see it integrated into Google Docs and GMail in the near future. If you had any plans to develop a web based app or already deploy one you need to give this technology some quality time.

But it is the use of SQLite as the client-side persistence engine that excites this datasmith’s old heart. Since first coming across SQLite (while learning Ruby on Rails) I’ve been convinced that this “good enough” micro-database on steroids was a winner. Since then, as well as using it with Ruby ,I’ve integrated it with excel to use as a micro-ETL tool, I was instrumental in getting SQLite support added to Kettle (Pentaho PDI), I wrote a SQLite custom module for Proto, in fact the first thing I now do when checking out a data manipulation tool is to check if it supports SQLite.

I’m looking forward to investigating further what Google Gears can add to my existing datasmithing tool set, and I’m certainly glad I invested the time in learning JavaScript. Exciting times.

Talend vs. Kettle (Pentaho PDI)

Over the last few weeks I’ve received a lot of traffic from Goggle searches comparing Talend and Kettle and also from Vincent McBurney’s ITtoolbox article comparing the two products, so where do I stand?

As ETL tools they take different approaches, Kettle is a meta data driven framework (which is in turn tightly integrated into an even larger BI framework, the Pentaho BI Project), while Talend is at the end of the day a code generator. The code generator nature of Talend does impart certain capabilities such as the ability to easily integrate into other BI platforms (such as JasperSoft and SpagoBI ) and as micro-ETL tool. On the other hand Kettle’s increasing integration into the Pentaho family makes it natural choice for larger BI implementations. So which one would I recommend ? It’s got to be Kettle and the main reason for that choice is the Pentaho PDI community.

My experience so far with the Talend community has not been good. Admittedly I’ve only asked two questions, one looking for more details on a statement regarding ERP and CRM connectors mentioned in a press release and the other a technical query about a new tSQLite connector. Neither were answered. OK I’m a big boy I can handle the rejection and I’m also technically savvy, I can eventually fix most of my own problems. But I’m sure if I posted similar questions on the Pentaho forums I would have received an answer, probably from Matt Casters if he though the general community would not be able to provide the answer.

So will this experience stop me from using Talend? Maybe not, the code generation capability may come in handy as a micro-ETL tool (but I do have other options); however, through an offline conversation with Samatar (an active member of the Kettle community) and Matt Casters I learned of a new XML input step and of a PDI alternative to the Talend tJavaFlex component, the “Modified Java Script Value” step. This allows for the creation of START, TRANSFORM (i.e. for-each-row) and END JavaScript scripts. As Kettle uses the Rhino scripting engine I have access to any Java API (e.g. Palo,S3, Google Spreadsheets) but with the flexibility of JavaScript; brilliant!

Not only that, since my experience with Talend tJavaFlex reintroduced me to the world of Java I though I might as well look into creating my own Kettle plugin and guess what, it looks easy enough. So I’m going to develop my first plugin to be based on the ScriptValueMod step, replacing the Rhino engine with the Java 6 ScriptEngineManager. This will give me access to the latest version of Rhino which supports E4X (will make handling XML must easier) and also opens the possibility of using JavaFX or indeed JRuby as Kettle scripting languages.

UPDATE:

I’ve received a reply to my tSQLite question!

UPDATE: July 2008

This post is now over a year old, Talend as a product and as a community has improved enormously in the last 12 months; so much so, I now use Talend  (Java) in preference to PDI for most of my datasmithing needs.

I’ve got talend and I’m going to use it…

For the last few months I’ve being looking for my ideal ETL platform. That ideal would be open source, platform independent (well at least Windows and Linux), flexible, and easily deployable. It had looked like a combination of Kettle and my micro-ETL combinations of Ruby/SSQLite and Excel/SQLite would be the eventual “winners”. That was until I discovered (or rather rediscovered) Talend, to be more precise the just released Open Studio V2.0.

Having spent a few hours this weekend getting to know the new Java Project features I’ve come away a Talend fan-boy. I think I’ve found my ETL tool, one suitable for both heavy-duty server-side processing and client-side micro-ETL tasks. The tJavaFlex component allowed me to add missing functionality with ease; for example, I managed to..

…all within two hours of first starting to evaluate the product. (OK, I had evaluated the Perl version several months ago, so I wasn’t a total noobie!).

Because Talend is a code generator (Perl or Java) I can store my work as a ‘tool-independent-snapshot’ for future use, modify by hand if need be, and use it on any JRE or Perl supporting OS. It’s not that I’ll stop using the other solutions (the Excel/SQLite xLite utility has proved to be particularly useful and flexible) but Talend looks like it will become the tool-of-first-preference (along side Excel of course!) in the Gobán Saor’s datasmithing armory.

Talend ETL – A New Contender

Talend have released a new version of their Open Studio ETL tool. Not as full featured as Pentaho Kettle; only supports a limited number of databases and file formats – no SQLite support shock-horror! The press release promises More than 100 Native Connectors and promises connectors to ERP and CRM tools but I couldn’t find them (maybe they meant the ODBC support – well I guess I managed to connect the SQLite using this ODBC driver so maybe there’s an ODBC driver for SAP!). Compared to Kettle the design GUI runs much slower (built on an Eclipse platform, say no more).

But two things impressed…

Talend Open Studio is not an ETL engine as such, it’s a code generator. When I last looked at it, it generated Perl code but this release now also generates Java and not only that, it packages the resulting code ready to be deployed on any Windows or *nix platform, no Talend installation required. This could be an alternative to my Ruby/SQLite micro ETL idea, especially as SQLite support appears to be in the works.

The XML and CSV import components are excellent, especially the XML functionality. The Kettle equivalents work but they never felt like they provided a productivity gain and for me that’s what matters. Once I have data within either a database or an Excel environment I don’t need any other tools, it’s how well and how cost effectively an ETL tool handles the parsing, scheduling, provisioning, distribution and logging of external (usually XML,CSV and Excel) data that matters to me. Talend is now definitely a contender.

New software – Pentaho Kettle 2.5 RC1 and IMP:Palo

I’ve spend a few hours trying out the latest Kettle 2.5.0 RC1 release candidate, new UI and lots of new features. Looks like the PALO code developed by 3a-strategy will not make into this release, but I see Cubeware have released IMP:PALO cube loading software, offering both a free and a premium professional version. Looks promising; I’ve just installed the software (lots of form filling, recovering Emails from GMail’s spam bucket and an activation process to go through first – I’m exhausted!) so I haven’t managed to try it out yet. I hadn’t heard of this German company until today, anybody out there using IMP:Palo or any other Cubeware products?

UPDATE:
I managed to get IMP:Palo up and running; went through the tutorial in the Help file and I’m impressed. The tool is very Kettle like, better from a UI perspective in fact, but of course it’s not open source and I’m not sure if it’s free ware or “trial ware” as its not clear what happens when the temporary activation ends next December! But it makes building MOLAP cubes very easy indeed. I’ve included a link to a finished “import definition” file as the demo MS Access data supplied is in a bit of a mess or rather the Product table is (missing GroupIDs – you’ll need an LEFT OUTER JOIN in the Products Mapping – and missing Products). First rule of a demo, make it easy, make it fool proof, otherwise you risk loosing your audience, but I’m glad I persevered, this is a good product!

Download finished IMP:Palo demo.

One other thing, to get rid of the …

PALO Error: not authorized for operation: login error (Error -44) (palo_auth_a failed.) 

…error if you’re connecting to a localhost based Palo server,  remove (or comment out by means of a #) the line

user-login

in the c:\Program Files\Jedox\Palo\data\palo.ini file. Stop and restart the server.

VBA & JavaScript – glue languages

What have Javascript and VBA in common? Not much on the surface and their respective user bases rarely if ever overlap. What they do share are their roles as the imperative (the-if-then-else-loop-etc) programming languages of the “I’m not a programmer” programmers, the great unwashed, the “normal” people out there who are just trying to get something done. VBA adds that “something special” to millions of Excel spreadsheets and MS Access databases. JavaScript does likewise for millions of web pages. Most of the code is copied from existing examples or generated by macro recorders. Often the code is not pretty, not well structured, not adequately tested but it is usually “good enough”.

Both languages can also transcend their “ugly duckling” status. If you want to take VBA development to a new level have a look at Professional Excel Development by Bullen et al. GMail, Google Docs & Spreadsheets and the other AJAX powered applications of Web2.0 show the power of JavaScript. Both languages are now starting to move beyond their original home bases, particularly in the case of JavaScript.

VBA has been used in the past as the scripting language for corporate applications such as Siebel; now in an inspired move, Proto opted to use VBA as the scripting environment in their corporate focused mashup tool.

JavaScript is however showing the real expansionist streak. Along side its traditional client-side role in web page DOM manipulation, Mozilla Firefox extensions, widgets and within Flash applications under its ActionScript implementation, it can now be found providing the imperative flourish to many client and server side declarative tools (e.g. Kettle ETL, OpenLaszlo and Grazr). You can even use JavaScript to provide the complete server-side logic for a web application/service by using Zimki.(deadpool)

My decision to spend time to improve my JavaScript skills was as a result of my experience with Zimki. I initially created a web service to sign requests for uploading files to my Amazon S3 account; allowing me to give clients a small executable (developed in VBA’s big brother VB6) with which to securely upload data to S3 without the need to expose my Amazon account credentials. This turned out to be so simple and elegant with Zimki’s Rhino based JavaScript environment that I decided this language deserved some quality time. I had I guess, like many others, confused the mess that is the browser-wars scared DOM with JavaScript (or more correctly, EMACScript) the language. Having looked further, I’ve discovered an elegant,simple yet powerful language that I’m tempted to think may become the universal glue-language.

UPDATE:

As Tom from Fotango (the Canon Europe company behind Zimki also in deadpool) points out, Zimki uses Mozilla’s C implementation of JavaScript, SpiderMonkey rather than Rhino.

UPDATE: Sep 2007

Cannon Europe are to close the Zimki service on Dec 24th, 2007

UPDATE: June 2010

See how I combined VBA and JavaScript, to make JavaScript an Excel scripting language http://blog.gobansaor.com/2010/06/04/javascript-as-an-excel-scripting-language-via-jsdb/

PALO plugin for Kettle (Pentaho) ETL

The much awaited Palo plugin for the Kettle ETL tool has been released. Oh happy days!

Palo is an open source MOLAP database developed by the German company Jedox. Although it doesn’t the match the power of established OLAP engines such as Essbase and many simple cross-tab/pivot requirements can be handled by an Excel Pivot table; it is free and it is very useful in mid-sized Budget and Planning scenarios where unlike Excel pivot tables it allows write back to a multi-user back-end database.

The plugin is the work of another German company 3a strategy. This is a great addition to both the Kettle and Palo communities.

Update:

Unfortunately it’s not a plugin but a code addition to Kettle, according to Alex on the Palo forum

>>Is this plug-in compatible with Kettle version 2.4? Or it is just for 2.3.1 version.

That’s not a plugin actually… Our _version_ was created on 3.2.1 codebase Probably we’ll make migration to 2.4 codebase but not in the nearest future

The add-in/extension (not sure what to call it) looks very comprehensive but maybe a simpler version built as plugin would suit more people, but then again 3a strategy developed this for their own purposes and were kind enough to share it with others.

Further Update:

Matt Casters over on the Palo forums has recommended not using the Palo extension ..

The folks from Tripple-A-Strategies have created a modified version of the Kettle codebase with the PALO functionality in it, rather than a plugin, but we are working with them to resolve this issue and help them achieve compliance with the LGPL license…

So there is something you can use to access PALO, but I would recommend against using it. They are a major version and hundreds of bug fixes behind on the latest development in Pentaho Data Integration a.k.a. Kettle

but it looks like the Palo code might be integrated into the next Kettle release.

Update (April 24)

Looks like no Palo support in the latest Kettle release candidate, but there’s another kid on the block: PALO:Imp, check it out, it’s the business.

Kettle and SQLite

Matt Casters has added SQLite support to Pentaho’s Kettle ETL tool in the latest development release. I’ve tested it under Windows using JRE 1.5.0_09 and it worked fine but having upgraded to JRE 1.5.0._10 I’m now getting “out of memory” errors, appears to be a problem with the “pure java” jdbc driver and not with Kettle itself.

Update:

Problem persists with JRE 1.5.0._11, but the bug appears to be in how the driver responds to window’s drive letters, so a work-around is to use unix-style folder addressing. For example, if your database is “c:\kettle\data\mydatabase.db” and Kettle is installed in a folder “c:\kettle\kettle-latest-version”, use the following relative address “../data/mydatabase.db” or the absolute form “/kettle/data/mydatabase.db” (note the use of unix style slash).

Update:

Bug now fixed.