Category Archives: kettle

LiteBI, Heavy ETL

Although my major BI interest is in micro-BI (or is that  workgroup-BI?)  i.e. data, perhaps cleansed and packaged elsewhere, available locally on a datasmith’s PC,with most likely an in-memory OLAP as the analysis tool; the possibilities of the “cloud” as a BI platform have not escaped me.

From a micro-BI perspective, the ability to act as a backup/mirroring tool or as ETL/marshaling tool (anybody for Hadoop and SQLite?) attracts. I’ve yet to make up my mind on BI delivered as a cloud PaaS but obviously many others believe it has a future.

My main worry with PaaS is not lock-in (which exists equally for in-house proprietary solutions) but the dangers of a Coghead-like lock-out.  My other doubts are more technical; believing, as I do, that in-memory offers significant advantages over traditional ROLAP (simplicity been the main one) and multi-tenant in-memory architectures are not yet a runner.  But last week I had a demo of new Spanish BI PaaS service, LiteBI, which might just change my mind.

Javier Giménez Aznar and his team previously worked on delivering Pentaho based datawarehouses to large Spanish corporations and government agencies, so they have a deep understanding of Mondrian ROLAP and are using that knowledge to build the LiteBI service, but this time with SMBs as the target customers rather than corporates. Pricing starts at €145 per month and is based on number of concurrent users, number of analytical spaces and the data volumes, so it’s not for very small firms more for the Medium in SMB.

Impressions? The cube designer, dashboard builders and the general UI are all very good and I would think would appeal to end-user datasmiths and, as such, will be a major up-front aid to selling this product.  But it was LiteBIs approach to the thorny issue of ETL and data loading that impressed me and also helped ease some of my Coghead-induced-fears.

BI technology stacks consist of three elements:

  • The “fancy” front-end; graphs,animated dashboads and so on.
  • The pivot engine; ROLAP or MOLAP or both.
  • The ETL process.
  • (Many would say there’s an important 4th, the data-warehouse, but not every BI effort requires one, but that’s another issue)

LiteBI is continuing to build yet more functionality into their UI and this “fancy” front-end is essential as it’s their “shop window”.

Mondrian provides their pivot engine, and again they continue to work on optimisations such as column-based datastores to increase speed and automate responsiveness tuning (end-users are very unforgiving of slow pivots).

But it’s in the 3rd area, that of the ETL process, that you realise the LiteBI team has real-world BI experience.  Data is loaded into LiteBI via an API, but with the ETL process itself happening on the customer side.

“Well,so what?” you may ask. The extraction of data has to obviously happen customer-side (even though not in the case of data being sourced from the likes of SalesForce.com). Yes, but it’s the transformations and data cleansing that adds true value to the ETL process and subsequently determines the quality and usefulness (as opposed to the speed or the “prettiness” of delivery) of the solution.

Part of the process of adopting LiteBI, is an ETL consultancy stage where a LiteBI partner company will provide on-site services to build this ETL layer, handling not just transformations but initial load and automating the subsequent delta uploads.

So the cost mounts up, but in reality you can’t do BI without this investment; there’s no ETL magic bullet.  Even still, Javier says the typical go-live time for a LiteBI project would be in the order of 3-4 weeks rather than the 3-4 months of similar on-site Pentaho projects.

The end-user ‘owning’ the ETL process makes the prospect of a service lock-out slightly less worrying as, at least, one would still have a good starting point for moving to another provider or back in-house. What I would really like to see would be the option to self-host LiteBI, which I guess would involve open sourcing large parts of the service (the automated optimisation strategies could, for example, be excluded from this open source version).

The load API comes packaged as a plugin to Kettle (aka PDI) and the intention is to offer a similar add-on for Talend in the near future. LiteBI also offers a white-label offering whereby 3rd party OLTP solution providers can use the service as their product’s BI suite.

Like the Skibbereen Eagle keeping its eye on the Czar of Russia, I too will be keeping a watchful eye on LiteBI and the march of on-demand BI in general.

Why not join me on Twitter at gobansaor?

Pentaho Data Integration (Kettle) V Talend Benchmark

Pentaho’s Matt Caster has just published a benchmarking exercise comparing Kettle and Talend.  In it he admits he’s not a Talend expert and he advises that people should perform their own benchmarks where possible as requirements differ.  Nevertheless, unlike most other benchmarks we’ve seen on the subject he publishes not just the results but the actual transformation “code” used in the tests. 

For many people these benchmarks are of no real interest as long as the product does what is required within the time and resources available they’re content.  But it would be a mistake to think that benchmarks don’t matter, they do; people have and will make that final decision based on them.  Remember ETL is not life and death, the decision which tool (if any) to go with may not get the level of investigation that the developers behind such products expect of their potential clientele and this is particularly true of open source.  Busy people will use such reports to direct them down a path or to confirm their existing prejudices. So I’m really glad to see Matt responding and in particular, responding in the manner he has.

Databases vendors have for years played the benchmarking game, setting and breaking records either via real technological advances or simply gaming the process.  We as purchasers and users knew in many cases to take the results with a large dose of salt, but purchasing decisions where nevertheless made on the backs of these surveys.

Why not join me on Twitter at gobansaor?

Open Source Metrics and Benchmarks

Marc Russel’s blog links to a Manapps ELT benchmark report comparing the performance of several leading ETL tools both proprietary (DataStage and Informatica) and OS (Talend and PDI (aka Kettle)).  As would be expected each tool has their own strengths and weaknesses, but one thing stands out, the venerable Kettle ETL aka PDI 3.0 is now a serious contender for handling very large datasets.  Oops, that’s what I get for wishing for a result and (mis-)reading the report early in the morning with a cold and bad sore throat, sadly PDI is still very much slower that its OS cousin Talend. In fact, Talend continues to play on the strength that comes from a code generated sloution, i.e. raw speed.  As a pure ETL play, Talend is well capable of playing on the same pitch as the “big kids”. 

Interestingly, the report is also “open source” as it’s released under a Creative Commons License, so I can link to it here.

UPDATE:

There’s now a new version of the report available (www.manapps.com, Topic Benchmark), it seems the original was just a work-in-progress and was not meant for public release.  The main difference appears to be a significant improvement in Informatica’s ‘score’, but I’m not sure as I was really only interested in comparing the two OSS products, Talend and Pentaho PDI, in that ‘battle’ Pentaho still comes out ‘slower’.

 The original Marc Russel blog entry and a subsequent one reporting the new updated report appear to have both been removed.  

Also, I was informed of the ‘updated’ report via this email from manapps, which assures vendors that they are happy to rerun any tests and provide any information re the running of such tests … 

Dear Sir,

You referred on your web site to the report called “Benchmark ETL” by Manapps, from November 2008. This draft report was not intended to be publicly released since just a working document.
We would like you (i) publish Asap the modified version (or its related link) that supersedes the former one (on our web site (www.manapps.com, Topic Benchmark), (ii) state that Manapps had no intend to release the former report and accordingly takes no responsibility on its content, (iii) state that Manapps holds all necessary elements at the disposal of all vendors so that they can rerun some tests if wished that will then be published.

Regards,
Philippe THOMAS

Time: Thursday March 5, 2009 at 5:10 pm

 

Another analysis of OSS in the wild this time from Chris Keene, WaveMaker CEO, on OSS as a marketing tool. Bottom line, 1% conversion rate, 700 paying customers in 9 months …   

WaveMaker OSS as a marketing tool

WaveMaker OSS as a marketing tool

 

Why not join me on Twitter at gobansaor?

New universal SQLite JDBC library.

Both Talend (Java) and Kettle distribute the Zentus.com pure-Java SQLite JDBC driver and for most purposes this run-anywhere version is fine. But, if you really need to take advantage of SQLite’s speed then connecting using the native JNI version is a must.  Doing this was easy enough, just change over to using a generic JDBC connection specifying the required native jar and placing the associated dll/so on your system path.

But now there’s an easier way, the latest version (V052, in fact from V050 on) is a universal jar, it contains native JNI libraries for Windows, Linux and MacOS alongside the pure-Java version.  It will automatically pick the correct lib for the platform and fall back to the pure-Java version if required.  You can tell if it’s picked up the native lib by calling conn.getDriverVersion(); it’ll return “native” if it has.

To upgrade to this jar in Kettle see this, this time replacing the nested jar with sqlitejdbc-v052.jar.

For Talend:

  • Either rename the new V052 jar to sqlitejdbc_v037_nested.jar, replace the existing V037 jar in the ../lib/java folder with this new renamed file.
  • Or, you could edit the Java specific XML files in the various tSQlite component folders, replacing the references to the old nested V037 jar.
  • Or, and this is what I would do, don’t use the tSQLite components, replace them with tJDBC generic components, then you can pick whatever version of the driver you require, you could even change to a different database provider!

The Talend tradition of a separate set of components for each type of database, seems to be a hangover from its Perl-generating roots. It’s true that database specific components are required for certaing tasks such as  bulk-loading, ELTs and so on, but JDBC was designed to be generic and as long as the SQL syntax is compatible, it makes switching in an out database providers very easy.  So unless there’s a good reason, stick to using tJDBC.

Regular Expressions as an end-user programming tool?

“What? Have you completely lost the plot, Gleeson?”, I hear you scream.  Jamie Zawinski’s famous quote is intoned once more ..

Some people, when confronted with a problem, think
“I know, I’ll use regular expressions.”   Now they have two problems.

Of course the above quote could be (and probably has been) changed to…

Most business people, when confronted with a problem, think
“I know, I’ll use a spreadsheet.”   Now they have two problems.

They are dense, single-line, single purpose, self contained mini-programs.  The previous statement applies to regular expressions but could equally be used to describe the single most popular end-user programming tool, spreadsheet formulae (particularly in their nested form!).

As somebody with the “programming gene” (something most, but not all, IT professionals possess, as do a significant proportion of “civilians”), such compressed logic somewhat grates compared with the power and elegance of more expressive programming languages, but that hasn’t stopped me using both spreadsheet formulae and regex to quickly and effectively solve problems when the need arose.

Those without the programming gene (the vast majority of business users), find traditional programming languages next to impossible to get their heads around yet find spreadsheet formulae approachable and useful.  It seems to be something to do with approaching problems as a series of simple problems and not loading the whole problem domain into one’s brain at one sitting (as most programmers and system designers are capable of).

In the past, non-programmers would rarely come in contact with regex as its use was possible only within the realms of professional programming or Unix sys-admin toolsets (sed,awk etc.).  But now, ETL tools such as Kettle and Talend allow end-users to use regular expressions without the need to understand the underlying programming language.  Taking this to the next step, Talend’s new data profiling product uses regular expressions as its main discovery language. They could, I guess, have invented yet another XML dialect and/or query-by-example dialogue, but instead they’ve taken the sensible (and cheaper) option and exposed the full power of raw regular expressions.

Will the great unwashed embrace regex in the same way they took to nested Excel functions, embarrassing their professional colleagues with yet more amateurish and often unmaintainable messy solutions, that just work? I think they just might…