Category Archives: kettle

Google Spreadsheets – ETL tool

Although I’m a total Excel fanboy, I most admit I rarely use it any longer for personal stuff such as home budgets, tax calculations, what-ifs, to-do lists etc.; I now tend to use Google Spreadsheets. Likewise, personal notes, drafts and useful bits of code are stored using Google Docs rather than MS Word. Three main reasons for this shift to the cloud:

  • Google Docs & Spreadsheets are ‘good enough’ for most of the trivial lists and calculations I require in my personal life and indeed for most business purposes as well, at least those that don’t require a pivot table.
  • These spreadsheets and documents are important but not necessarily in the ‘state secret/I-could-tell-but-then-I’d-have-to-kill-you’ scale of things, by building them in Google Apps they are securely backed-up and easily accessible.
  • A lot of the spreadsheets are collaborative in nature, and in the collaboration field, Google Spreadsheets just gets better and better.

Today, Google announced further additions to their spreadsheet product. The AutoFill feature adds functionality I’ve come to expect from Excel, but with a twist, integration with Google Sets. But the additions that really caught my eye were the new data import functions. Now again, Excel has had web queries since Excel97, and it always amazed me why online pretenders to the throne tended to ignore the most common source of tabular data on the web, the HTML table; something to do with the great XML/Tables divide I guess!

Google now not only fixes this omission,providing access to HTML tables and comma/tab separated file, but also provides access to RSS/ATOM and generic XML sources. All that’s missing now are functions that can read other common online data files formats such as Excel, MSAccess, XBase and of course SQLite.

This addition of HTML import support and the AutoFill feature will further reduce the number of times I’ll need to fire up Excel for personal tasks, but the RSS/ATOM/XML import feature also has potential as a tool in my micro-ETL toolbox. Using Excel as my only micro-ETL tool is possible when the data is either already in Excel/CSV or accessible via a COM API or via ODBC drivers, otherwise I can call-in either Ruby, Talend, Kettle or even RSSBus. But now I’ve another option, if the data is public and published as RSS/ATOM or some other variation on XML, I can use Google Spreadsheets to fetch the data and import the resulting tabular dataset into Excel via a Web Query or via the GData API.

New Google Reader Search facilityOne other thing. While researching this post, looking up links etc. I used another new feature Google added today, Google Reader’s new Search facility. As most of my references are discovered via the blogs I subscribe to, the ability to restrict searches to that subset of the web is fantastic; I even used it to search through my own blog posts! If del.icio.us offered the same option it would make re-finding stuff even easier. I did try to use Google Co-Op to build a search engine restricted to my del.icio.us links but it didn’t seem to like the volume of links (4000 odd) I sent it.

Apatar – a few extracts short of a load

I’ve been meaning to try out the Apatar ETL/Mashup tool for sometime and today being yet another rainy day in this the worst Irish summer that I can remember (and Irish summers are not renowned for the lack of rainfall) I decided to give it a try out. Not impressed I’m afraid; comes up short when compared to either Kettle (Pentaho PDI), Proto, RSSBus or Talend. Very few database connectors (e.g. no SQLite,DB2 or Firebird support) this wouldn’t be a problem if the product offered a generic JDBC or ODBC connector. It does have one nice feature the others (other than RSSBus) haven’t, an Amazon S3 connector. But the thing that I find amazing is that a product that’s positioning itself as an Enterprise 2.0. mashup tool doesn’t have the ability to read and write Excel files! And no, CSV files don’t count.

Google Gears – SQLite Killer App

The announcement of Google Gears is of course a game changer for those working in the development of online apps; its addition to Goggle Reader alone would make it worth while for me and I’m sure we’ll see it integrated into Google Docs and GMail in the near future. If you had any plans to develop a web based app or already deploy one you need to give this technology some quality time.

But it is the use of SQLite as the client-side persistence engine that excites this datasmith’s old heart. Since first coming across SQLite (while learning Ruby on Rails) I’ve been convinced that this “good enough” micro-database on steroids was a winner. Since then, as well as using it with Ruby ,I’ve integrated it with excel to use as a micro-ETL tool, I was instrumental in getting SQLite support added to Kettle (Pentaho PDI), I wrote a SQLite custom module for Proto, in fact the first thing I now do when checking out a data manipulation tool is to check if it supports SQLite.

I’m looking forward to investigating further what Google Gears can add to my existing datasmithing tool set, and I’m certainly glad I invested the time in learning JavaScript. Exciting times.

Talend vs. Kettle (Pentaho PDI)

Over the last few weeks I’ve received a lot of traffic from Goggle searches comparing Talend and Kettle and also from Vincent McBurney’s ITtoolbox article comparing the two products, so where do I stand?

As ETL tools they take different approaches, Kettle is a meta data driven framework (which is in turn tightly integrated into an even larger BI framework, the Pentaho BI Project), while Talend is at the end of the day a code generator. The code generator nature of Talend does impart certain capabilities such as the ability to easily integrate into other BI platforms (such as JasperSoft and SpagoBI ) and as micro-ETL tool. On the other hand Kettle’s increasing integration into the Pentaho family makes it natural choice for larger BI implementations. So which one would I recommend ? It’s got to be Kettle and the main reason for that choice is the Pentaho PDI community.

My experience so far with the Talend community has not been good. Admittedly I’ve only asked two questions, one looking for more details on a statement regarding ERP and CRM connectors mentioned in a press release and the other a technical query about a new tSQLite connector. Neither were answered. OK I’m a big boy I can handle the rejection and I’m also technically savvy, I can eventually fix most of my own problems. But I’m sure if I posted similar questions on the Pentaho forums I would have received an answer, probably from Matt Casters if he though the general community would not be able to provide the answer.

So will this experience stop me from using Talend? Maybe not, the code generation capability may come in handy as a micro-ETL tool (but I do have other options); however, through an offline conversation with Samatar (an active member of the Kettle community) and Matt Casters I learned of a new XML input step and of a PDI alternative to the Talend tJavaFlex component, the “Modified Java Script Value” step. This allows for the creation of START, TRANSFORM (i.e. for-each-row) and END JavaScript scripts. As Kettle uses the Rhino scripting engine I have access to any Java API (e.g. Palo,S3, Google Spreadsheets) but with the flexibility of JavaScript; brilliant!

Not only that, since my experience with Talend tJavaFlex reintroduced me to the world of Java I though I might as well look into creating my own Kettle plugin and guess what, it looks easy enough. So I’m going to develop my first plugin to be based on the ScriptValueMod step, replacing the Rhino engine with the Java 6 ScriptEngineManager. This will give me access to the latest version of Rhino which supports E4X (will make handling XML must easier) and also opens the possibility of using JavaFX or indeed JRuby as Kettle scripting languages.

UPDATE:

I’ve received a reply to my tSQLite question!

UPDATE: July 2008

This post is now over a year old, Talend as a product and as a community has improved enormously in the last 12 months; so much so, I now use Talend  (Java) in preference to PDI for most of my datasmithing needs.

I’ve got talend and I’m going to use it…

For the last few months I’ve being looking for my ideal ETL platform. That ideal would be open source, platform independent (well at least Windows and Linux), flexible, and easily deployable. It had looked like a combination of Kettle and my micro-ETL combinations of Ruby/SSQLite and Excel/SQLite would be the eventual “winners”. That was until I discovered (or rather rediscovered) Talend, to be more precise the just released Open Studio V2.0.

Having spent a few hours this weekend getting to know the new Java Project features I’ve come away a Talend fan-boy. I think I’ve found my ETL tool, one suitable for both heavy-duty server-side processing and client-side micro-ETL tasks. The tJavaFlex component allowed me to add missing functionality with ease; for example, I managed to..

…all within two hours of first starting to evaluate the product. (OK, I had evaluated the Perl version several months ago, so I wasn’t a total noobie!).

Because Talend is a code generator (Perl or Java) I can store my work as a ‘tool-independent-snapshot’ for future use, modify by hand if need be, and use it on any JRE or Perl supporting OS. It’s not that I’ll stop using the other solutions (the Excel/SQLite xLite utility has proved to be particularly useful and flexible) but Talend looks like it will become the tool-of-first-preference (along side Excel of course!) in the Gobán Saor’s datasmithing armory.