Category Archives: Talend

LiteBI, Heavy ETL

Although my major BI interest is in micro-BI (or is that  workgroup-BI?)  i.e. data, perhaps cleansed and packaged elsewhere, available locally on a datasmith’s PC,with most likely an in-memory OLAP as the analysis tool; the possibilities of the “cloud” as a BI platform have not escaped me.

From a micro-BI perspective, the ability to act as a backup/mirroring tool or as ETL/marshaling tool (anybody for Hadoop and SQLite?) attracts. I’ve yet to make up my mind on BI delivered as a cloud PaaS but obviously many others believe it has a future.

My main worry with PaaS is not lock-in (which exists equally for in-house proprietary solutions) but the dangers of a Coghead-like lock-out.  My other doubts are more technical; believing, as I do, that in-memory offers significant advantages over traditional ROLAP (simplicity been the main one) and multi-tenant in-memory architectures are not yet a runner.  But last week I had a demo of new Spanish BI PaaS service, LiteBI, which might just change my mind.

Javier Giménez Aznar and his team previously worked on delivering Pentaho based datawarehouses to large Spanish corporations and government agencies, so they have a deep understanding of Mondrian ROLAP and are using that knowledge to build the LiteBI service, but this time with SMBs as the target customers rather than corporates. Pricing starts at €145 per month and is based on number of concurrent users, number of analytical spaces and the data volumes, so it’s not for very small firms more for the Medium in SMB.

Impressions? The cube designer, dashboard builders and the general UI are all very good and I would think would appeal to end-user datasmiths and, as such, will be a major up-front aid to selling this product.  But it was LiteBIs approach to the thorny issue of ETL and data loading that impressed me and also helped ease some of my Coghead-induced-fears.

BI technology stacks consist of three elements:

  • The “fancy” front-end; graphs,animated dashboads and so on.
  • The pivot engine; ROLAP or MOLAP or both.
  • The ETL process.
  • (Many would say there’s an important 4th, the data-warehouse, but not every BI effort requires one, but that’s another issue)

LiteBI is continuing to build yet more functionality into their UI and this “fancy” front-end is essential as it’s their “shop window”.

Mondrian provides their pivot engine, and again they continue to work on optimisations such as column-based datastores to increase speed and automate responsiveness tuning (end-users are very unforgiving of slow pivots).

But it’s in the 3rd area, that of the ETL process, that you realise the LiteBI team has real-world BI experience.  Data is loaded into LiteBI via an API, but with the ETL process itself happening on the customer side.

“Well,so what?” you may ask. The extraction of data has to obviously happen customer-side (even though not in the case of data being sourced from the likes of SalesForce.com). Yes, but it’s the transformations and data cleansing that adds true value to the ETL process and subsequently determines the quality and usefulness (as opposed to the speed or the “prettiness” of delivery) of the solution.

Part of the process of adopting LiteBI, is an ETL consultancy stage where a LiteBI partner company will provide on-site services to build this ETL layer, handling not just transformations but initial load and automating the subsequent delta uploads.

So the cost mounts up, but in reality you can’t do BI without this investment; there’s no ETL magic bullet.  Even still, Javier says the typical go-live time for a LiteBI project would be in the order of 3-4 weeks rather than the 3-4 months of similar on-site Pentaho projects.

The end-user ‘owning’ the ETL process makes the prospect of a service lock-out slightly less worrying as, at least, one would still have a good starting point for moving to another provider or back in-house. What I would really like to see would be the option to self-host LiteBI, which I guess would involve open sourcing large parts of the service (the automated optimisation strategies could, for example, be excluded from this open source version).

The load API comes packaged as a plugin to Kettle (aka PDI) and the intention is to offer a similar add-on for Talend in the near future. LiteBI also offers a white-label offering whereby 3rd party OLTP solution providers can use the service as their product’s BI suite.

Like the Skibbereen Eagle keeping its eye on the Czar of Russia, I too will be keeping a watchful eye on LiteBI and the march of on-demand BI in general.

Why not join me on Twitter at gobansaor?

Talend ETL Excel report generator

Hugo, who you may remember from his OLAP Cube as a Mind Map project, has struck again. This time something really useful, a component for the Talend ETL platform that generates Excel reports using templates and a JSP style TAG language to control the output.

I’ve in the past used the excellent Xlsgen to automate the production of Excel reports, but Hugo’s component has the benefit of being free (xlsgen now costs €390!) and open source and it also taps into the vast world of existing Talend ETL components.

Well done Hugo.

Why not join me on Twitter at gobansaor?

Pentaho Data Integration (Kettle) V Talend Benchmark

Pentaho’s Matt Caster has just published a benchmarking exercise comparing Kettle and Talend.  In it he admits he’s not a Talend expert and he advises that people should perform their own benchmarks where possible as requirements differ.  Nevertheless, unlike most other benchmarks we’ve seen on the subject he publishes not just the results but the actual transformation “code” used in the tests. 

For many people these benchmarks are of no real interest as long as the product does what is required within the time and resources available they’re content.  But it would be a mistake to think that benchmarks don’t matter, they do; people have and will make that final decision based on them.  Remember ETL is not life and death, the decision which tool (if any) to go with may not get the level of investigation that the developers behind such products expect of their potential clientele and this is particularly true of open source.  Busy people will use such reports to direct them down a path or to confirm their existing prejudices. So I’m really glad to see Matt responding and in particular, responding in the manner he has.

Databases vendors have for years played the benchmarking game, setting and breaking records either via real technological advances or simply gaming the process.  We as purchasers and users knew in many cases to take the results with a large dose of salt, but purchasing decisions where nevertheless made on the backs of these surveys.

Why not join me on Twitter at gobansaor?

Open Source Metrics and Benchmarks

Marc Russel’s blog links to a Manapps ELT benchmark report comparing the performance of several leading ETL tools both proprietary (DataStage and Informatica) and OS (Talend and PDI (aka Kettle)).  As would be expected each tool has their own strengths and weaknesses, but one thing stands out, the venerable Kettle ETL aka PDI 3.0 is now a serious contender for handling very large datasets.  Oops, that’s what I get for wishing for a result and (mis-)reading the report early in the morning with a cold and bad sore throat, sadly PDI is still very much slower that its OS cousin Talend. In fact, Talend continues to play on the strength that comes from a code generated sloution, i.e. raw speed.  As a pure ETL play, Talend is well capable of playing on the same pitch as the “big kids”. 

Interestingly, the report is also “open source” as it’s released under a Creative Commons License, so I can link to it here.

UPDATE:

There’s now a new version of the report available (www.manapps.com, Topic Benchmark), it seems the original was just a work-in-progress and was not meant for public release.  The main difference appears to be a significant improvement in Informatica’s ‘score’, but I’m not sure as I was really only interested in comparing the two OSS products, Talend and Pentaho PDI, in that ‘battle’ Pentaho still comes out ‘slower’.

 The original Marc Russel blog entry and a subsequent one reporting the new updated report appear to have both been removed.  

Also, I was informed of the ‘updated’ report via this email from manapps, which assures vendors that they are happy to rerun any tests and provide any information re the running of such tests … 

Dear Sir,

You referred on your web site to the report called “Benchmark ETL” by Manapps, from November 2008. This draft report was not intended to be publicly released since just a working document.
We would like you (i) publish Asap the modified version (or its related link) that supersedes the former one (on our web site (www.manapps.com, Topic Benchmark), (ii) state that Manapps had no intend to release the former report and accordingly takes no responsibility on its content, (iii) state that Manapps holds all necessary elements at the disposal of all vendors so that they can rerun some tests if wished that will then be published.

Regards,
Philippe THOMAS

Time: Thursday March 5, 2009 at 5:10 pm

 

Another analysis of OSS in the wild this time from Chris Keene, WaveMaker CEO, on OSS as a marketing tool. Bottom line, 1% conversion rate, 700 paying customers in 9 months …   

WaveMaker OSS as a marketing tool

WaveMaker OSS as a marketing tool

 

Why not join me on Twitter at gobansaor?

Talend + SQLite + Groovy the new Oracle …

… well, at least for me.  Let me explain.

For most of my datasmithing career, I’ve had access to corporate Oracle databases and now with the availability of  Oracle10g  Express I can even run my own Oracle instances at home or on EC2.  The combination of a powerful SQL engine, expressive scripting language (PL/SQL) ,OS independence, web front-end (App Express) and the ability to communicate with Excel (via OO4O) made Oracle a natural fit for heavy-duty data manipulation.   But there was always one major problem, Oracle doesn’t play well with other data sources, necessitating a separate ETL bolt-on, which led me to play around with the likes of Kettle and Talend.  But having been seduced by these new shiny (and open source) “toys” I’ve found that rather than just been incidental add-ons they had the potential to totally replace Oracle.  The combination of Talend, SQLite and Groovy, is proving to be particularly magic.

So how will these three tools enable you to leave behind your Oracle past?

Talend (in its Java form) is a superb ETL tool, via JDBC is can access every database type on the planet, it has built-in web-service capability and access to a  multitude of APIs via its Java component for non-database data sources.  The addition of  Groovy makes the use of such Java APIs simpler and quicker and the same Groovy acts as a replacement for PL/SQL when a bit of “if-then-else” logic is required.  And although Talend offers a built-in option to plublish an ETL job as a WAR file exposing a SOAP web service, Java/Groovy also allows for the integration of the powerful, yet simple, Jetty API to embed a web server within Talend itself.  And all this for free, and better than free, open source.

So where does SQLite come in? And, didn’t you say that Excel integration was important, how will Excel communicate with Talend?

As very little corporate data is held in SQLite format, and Talend allows access to every major commercial/free database, the usefulness of SQLite might not be at first obvious.  But if you think of SQLite as a data cache, a fast and efficient local tabular datastore, with a powerful but well understood DSL (i.e. SQL) and a drop-dead-simple setup and backup regime (basically copying and creating files), maybe then you can see its attraction. The ability to extend the DSL by easily creating SQLite user defined functions (UDFs) within Talend using either Java or Groovy is also another powerful feature.

For example…

select customer_id, name,customer, sales_region, getpalodata(“SALES”,customer_id,”All Products”.”Total Sales”,”Euros”,”YTD”)  as customer_YTD, getpalodata(“SALES”,sales_region,”All Products”.”Total Sales”,”Euros”,”YTD”)  as region_total_YTD from list_of_top_customers;

… where getapalodata is a UDF that wraps calls to a Palo cube.

With this type of setup I can easily mix and match list/tabular data with multidimensional data points using SQL (something that Oracle also supports but only if you hand over a large wad of currency). In fact I can create a mini data warehouse, with Palo providing the pivot, ( as SQLite lacks star-query (or even multi-index query) support.)  SQLite would still host the conformed dimensions and the fact tables, but with the fact tables acting as feeds to Palo cubes, supporting finer-grained drill-throughs from cubes or for ad-hoc queries. This is powerful stuff, simple, free, powerful stuff.

… and the spreadsheet access?

A Talend sub-job such as this…

Talend Groovy Jetty web server

Talend Groovy Jetty web server

Example of Groovy code calling Jetty API

Example of Groovy code calling Jetty API

…would provide a simple RESTful (rather than SOAP) web service which could be accessed either with an Excel Web Query or via a VBA macro which would parse the result and allow for more control.  For example …

http://localhost:1234/sqlgateway?sql=select customer_id,name from all_customers&type=HTMLTable

… this would return a list of customers wrapped in an HTML table, or …

http://localhost:1234/job/extractProspects?Rep=JonesTom&Month=JAN&SourceCompany=AXA&type=HTMLTable

…this might call a Talend job called extractProspects, passing in JonesTom, JAN and AXA as context parameters, which would then return a list of prospects extracted from a feed supplied by AXA’s system.

What would the Talend job look like?

The job might operate something like this:

  • It would run either on the client as a service or on a LAN based server (or on a remote server, with a SSH VPN (or Hamachi) to provide security).
  • At start-up, do a bunch of ETL tasks, pulling data from remote sources and databases, transforming and aggregating data etc. Storing the resulting data in local SQLite databases.  It might also build Palo cubes or update larger enterprise databases.
  • The job would then setup a Jetty web server and await requests for data.
  • The requests might be a mixture of raw SQL or requests to run specific Talend transformations which would return a dataset directly to the calling client or maybe just acknowledge the request, queue it up for processing later, sending the resulting dataset by EMail or RSS feed when finished.
  • At a fixed time the service would shut it self down and requeue itself for the next day’s workload.

… or nothing at all like that, and that’s the point, build what you need, add the levels of security (or none at all) that fits your situation, all within a open framework, with zero lock-in (okay, still using Excel, anyone for OpenOffice, Google Apps or Zoho?).  You don’t even need your own server, host it on an EC2 instance, (if you bring up an instance for 10/12 hours every working day, it would cost about $20/$25 a month).

Now tell me that doesn’t make sense?

New universal SQLite JDBC library.

Both Talend (Java) and Kettle distribute the Zentus.com pure-Java SQLite JDBC driver and for most purposes this run-anywhere version is fine. But, if you really need to take advantage of SQLite’s speed then connecting using the native JNI version is a must.  Doing this was easy enough, just change over to using a generic JDBC connection specifying the required native jar and placing the associated dll/so on your system path.

But now there’s an easier way, the latest version (V052, in fact from V050 on) is a universal jar, it contains native JNI libraries for Windows, Linux and MacOS alongside the pure-Java version.  It will automatically pick the correct lib for the platform and fall back to the pure-Java version if required.  You can tell if it’s picked up the native lib by calling conn.getDriverVersion(); it’ll return “native” if it has.

To upgrade to this jar in Kettle see this, this time replacing the nested jar with sqlitejdbc-v052.jar.

For Talend:

  • Either rename the new V052 jar to sqlitejdbc_v037_nested.jar, replace the existing V037 jar in the ../lib/java folder with this new renamed file.
  • Or, you could edit the Java specific XML files in the various tSQlite component folders, replacing the references to the old nested V037 jar.
  • Or, and this is what I would do, don’t use the tSQLite components, replace them with tJDBC generic components, then you can pick whatever version of the driver you require, you could even change to a different database provider!

The Talend tradition of a separate set of components for each type of database, seems to be a hangover from its Perl-generating roots. It’s true that database specific components are required for certaing tasks such as  bulk-loading, ELTs and so on, but JDBC was designed to be generic and as long as the SQL syntax is compatible, it makes switching in an out database providers very easy.  So unless there’s a good reason, stick to using tJDBC.

Groovy as Talend’s scripting language

Although I had decided to use Talend (Java version) as my primary ETL tool I still had one major problem with it, its lack of a scripting tool.  Kettle (Pentaho PDI) has Javascript, Excel has VBA, Picalo has (well OK, is) Python and Talend in its Perl version has Perl.  I could have gone (and did experiment) with calling Javascript, Jython or JRuby via JSR223, but I wasn’t happy with the level of integration afforded by this, opting instead to make command line calls to Python (using SQLite as a data carrier).

Then, I discovered Groovy, or I should say rediscovered it, as I’d come across it many years ago when it was far less developed than is now, liked it then but couldn’t see a use for it at the time and promptly forgot about it.  Then it appeared wrapped in a Talend component, prompting me to do a quick visit to the Groovy website, which turned into a deep-dive into the language; I’d found my scripting tool!

Groovy (by the way what a terrible name for a language, or is that just me?), is not really a stand-alone language but more an extension to Java itself; offering the full power of Java but with addition of closures, builders and dynamic types.  In fact, over time Groovy has become more and more Java like (the biggest missing being lack of support for anonymous inner-classes).

To underline this convergence, Groovy is being developed under the separate JSR 241 rather than JSR 223. There’s full interoperability between both languages; Groovy  compiles down to JVM bytecode and can use Java classes and objects, Java can likewise use Groovy generated bytecode.  This allows for fast prototyping and development without compromising access to Java’s vast collection of libraries.

Here for example, is a piece of code to try out the JPalo library’s ability to access a Palo cube …


[sourcecode language='Java']

import org.palo.api.Connection;
import org.palo.api.ConnectionFactory;
import org.palo.api.Cube;
import org.palo.api.Database;
import org.palo.api.Element;
connection = ConnectionFactory.getInstance().newConnection("localhost","7777","admin","admin")
database = connection.getDatabaseByName("Demo");
;cube = database.getCubeByName("Sales");
rowElements = cube.getDimensionAt(0).getElementsInOrder();
columnElements = cube.getDimensionAt(1).getElementsInOrder();
dataSet = [rowElements,columnElements,]
dataSet << cube.getDimensionAt(2).getElementAt(0)
dataSet << cube.getDimensionAt(3).getElementAt(0)
dataSet << cube.getDimensionAt(4).getElementAt(0)
dataSet << cube.getDimensionAt(5).getElementAt(0)
// fetch data set
datas=cube.getDataArray(dataSet as Element[][])
connection.disconnect();
// parse the return string
rowcount = rowElements.length;
columncount = columnElements.length;
data=[]
heading=[]
// first row set to the row names (i.e. "Product name" followed by the country names )
heading << "Product"
for (i in 0..columncount-1) {
    heading << columnElements[i].getName()
   }
data << heading
// Now  out each line
for (i in 0..rowcount-1) {
    row = []
    row << rowElements[i].getName()
    for (j in 0..columncount-1) {
         row << datas[((i + (j*columncount)))]
         }
    data << row.flatten()
   }

//output to csv file
def csvOut= new FileOutputStream('c:/data/File.csv' )
for (lines in data) {
         lines.eachWithIndex{col,i ->
                             if (i > 0) {
                                 csvOut << ","
                             }
         csvOut << col
     }
     csvOut << "\n"
}
csvOut.close()
[/sourcecode]

This was done in the Groovy console as a proof of concept, it was then transferred to a tGroovy component where it was parametrised and instead of outputting to a CSV file, it was used to fill the globalBuffer structure (the structure used by tBufferOutput component).

Other things I managed to do with Talend tGroovy over a few days:

  • Extended SQLite with my own user-defined Palo functions.
  • Set-up a Talend job as an Excel accessible RESTful web service using Jetty.
  • Interfaced with Amazon S3.

Although I was very familiar with the S3 and the JPalo API, both SQLite UDFs and Jetty were new to me, and that’s were scripting proves it worth, giving the developer the maximum support with the minimum of background noise.  But it’s not just weird and wonderful new APIs that scripting helps expose but as a datasmithing tool, languages such as Groovy give analysts the ability to quickly de-construct and model datasets (for example, see Groovy’s SQL database support and collections’ functionality).

As a infamous Irish farming-pharma TV ad of my youth put it, “It’s a queer name but great stuff“.

Regular Expressions as an end-user programming tool?

“What? Have you completely lost the plot, Gleeson?”, I hear you scream.  Jamie Zawinski’s famous quote is intoned once more ..

Some people, when confronted with a problem, think
“I know, I’ll use regular expressions.”   Now they have two problems.

Of course the above quote could be (and probably has been) changed to…

Most business people, when confronted with a problem, think
“I know, I’ll use a spreadsheet.”   Now they have two problems.

They are dense, single-line, single purpose, self contained mini-programs.  The previous statement applies to regular expressions but could equally be used to describe the single most popular end-user programming tool, spreadsheet formulae (particularly in their nested form!).

As somebody with the “programming gene” (something most, but not all, IT professionals possess, as do a significant proportion of “civilians”), such compressed logic somewhat grates compared with the power and elegance of more expressive programming languages, but that hasn’t stopped me using both spreadsheet formulae and regex to quickly and effectively solve problems when the need arose.

Those without the programming gene (the vast majority of business users), find traditional programming languages next to impossible to get their heads around yet find spreadsheet formulae approachable and useful.  It seems to be something to do with approaching problems as a series of simple problems and not loading the whole problem domain into one’s brain at one sitting (as most programmers and system designers are capable of).

In the past, non-programmers would rarely come in contact with regex as its use was possible only within the realms of professional programming or Unix sys-admin toolsets (sed,awk etc.).  But now, ETL tools such as Kettle and Talend allow end-users to use regular expressions without the need to understand the underlying programming language.  Taking this to the next step, Talend’s new data profiling product uses regular expressions as its main discovery language. They could, I guess, have invented yet another XML dialect and/or query-by-example dialogue, but instead they’ve taken the sensible (and cheaper) option and exposed the full power of raw regular expressions.

Will the great unwashed embrace regex in the same way they took to nested Excel functions, embarrassing their professional colleagues with yet more amateurish and often unmaintainable messy solutions, that just work? I think they just might…

What to do when Talend gets its knickers in a twist?

If you’ve done any significant amount of work with Talend you’ll undoubtedly have experienced situations where either the generated code/JETemitters or the GUI representation of a job become unstable like so…

Talend loses the plot!

The usual advice is to backup your projects (workspace/projectName) , delete the workspace/.Java (or .Perl) and workspace/.JETEmitters folders and restart Talend to force a rebuild of the generated code.  I’ve have varying degrees of success with this approach, from it fixing the problem to the other extreme of the project raising the dreaded “null pointer” error never to load again!  Also, in such a situation, don’t depend on the in-built backup facility (particularly if it’s showing errors on the export), make a copy of the project folder directly (in particular, the process folder).

I’ve found the most reliable method to untwist such messes is to create a brand new Talend installation (unzip download file into another folder), create a new project (don’t use the front screen’s import functionality); within this new project, use the import item facility, import the required jobs from the corrupted project and the resulting project should now work OK.

If the job wouldn’t compile due to its inability to find standard Talend classes (this applies to Java projects, switch to the code view and click on the red spots on the right hand side to see the errors), just saving the job, exiting and restarting Talend should fix this problem.

Sometimes however, as with the screenshot above, none of the above will work.  Remedying this required me to edit the job’s .item file’s XML directly (the bug that caused this was seen in a 2.4 RC release, it thankfully appears to have fixed in the production release).  Editing a job’s XML is not for the faint hearted, but it is possible and with the knowledge gained from studying the XML structure you may find such experience useful (as I did) in the mass-editing  of jobs.  In fact, I’ve used Talend’s own data integration functionality to mass generate Talend jobs, demonstrating again the power of this continually improving product!

SQLite – the ultimate data-smithing tool!

SQLite logo as of 2007-12-15Image via Wikipedia

Although my data-smithing tool box is full to the brim with powerful tools such as Talend, Kettle PDI, Picalo and Excel, all backed by the cloud infrastructure of Amazon’s S3, SImpleDB and EC2, there’s one simple yet powerful tool that I always seem to gravitate back to, that tool is SQLite.

Now obviously being a hewer of data, I need a SQL compliant database for data manipulation and SQLite performs that task with speed and ease. But it’s not just in the hewing, it’s in the hauling of data where SQLite also shines.

I use SQLite as the container for passing tabular datasets between (and within) my various tools, that data doesn’t even need to be clean (due to SQlite’s liberal manifest typing rules) just so long as it can be expressed as a table.

For example; a Talend job could store an extracted dataset in a SQLite file, pass that file on to a Python script for some special processing (for example extracting further data from a source not directly supported by Talend such as SAP or SimpleDB), and then pass the resulting SQLite database on to Excel or a similar tool to allow a business user to view and perhaps modify the data; finally Talend picking up the file again to load it into a corporate data warehouse.

Now you could use flat files to transport the data or store the intermediate results in a corporate database, but SQLite is as easy, if not easier than, flat files and offers the SQL processing capabilities of big-iron databases, but without the hassle of getting write access to an existing server or setting one up from scratch.

And I know there are other similar file based database offerings such as MS Access and the Java only HSQLDB, but neither match SQLite’s ubiquitousness, sheer simplicity and powerful data processing ability.

Dublin Bus and PALO ETL – the connection!

Dublin buses, as is the norm with most road-based public transport systems in our increasingly car-choked cities, tend to operate on the basis of “no sign of a bus for ages, then two or three arrive at the same time”. Palo MOLAP ETL options appear to be following the same pattern; we’ve been waiting for ETL support for ages and now we see three of them heading down the road towards us. There’s Palo’s own offering, then came Stratebi‘s Kettle Plugin and now Talend Version 2.3.0RC2 is offering a Palo output component.

Mind you, the Talend offering is very basic and I’ve not managed to get the Sratebi plugin to work, leaving Palo’s ETL Server as the front runner at the moment (drill-through capability is a winner in my book).

I’ve also been busy re-factoring my VBA SQLite and Amazon S3 code with the intention of publishing them as an Excel based micro-ETL platform. While cleaning up the Amazon AWS modules I’ve been playing with SimpleDB, I’m impressed, Excel combined with SimpleDB rocks!

I’ve also wrapped the open source XySSL SHA1 HMAC C code in a VBA friendly DLL, as searching for a VBA hmac sha1 hash implementation (essential for Amazon AWS access) has proved fruitless.

Hope to release the lot the end of next month.

UPDATE:

Thanks to Javier and Jorge from Stratebi I’ve managed to get the new Kettle Palo plugin to work. It seems that the TEST facility in the Kettle database connection dialogue throws an exception for Palo connections but the connections work fine in the actual Palo input/output steps. Did a quick test and it looks very easy to use and fits in well with the Kettle “way of doing things”.

PALO ETL-Server, first sighting …

I was wrong. I figured Jedox would build their new ETL server on one of the existing open source ETL project code-bases, either Talend or Pentaho’s Kettle. Instead, the new alpha ETL server code which has just been uploaded to SourceForge is based on neither and appears to have been developed by another German company Proclos.

Rather that a full featured all-things-to-all-men ETL tool, it’s a specialist MOLAP cube import tool, like an XML driven version of IMPPalo. Being Java based, it should be easy enough to combine with Kettle to offer the best of both worlds; let Kettle do the heavy lifting and the management of conformed dimensions and fact tables, then use Palo ETL-Server to build the hierarchies and load the cubes from these tables.

There’s no documentation as yet but there’s two demo projects; importRelDB.xml, which loads data into a cube from a HSQLDB in-memory database and a CSV file; and importOLAP.xml, which copies data from one Palo cube to another.

To run the importRelDB.xml project …

Java -jar importer.jar – p importRelDB

… each project is broken up into Jobs (such as Initdata, MasterData, CubeData, again like IMPPalo) and these can be run separately by using the -j option.

The tool is controlled via XML configuration files and lacks a GUI interface (which is fine by me, I’m more of a command-line guy). I’ve checked-out the SVN code and am slowly working my way through it, no sign as yet as to how drill-back from PALO cubes will be enabled; as this project is called Importer ETLCore, perhaps that’s yet to come.

So far, I like what I see.

New ETL platform for PALO OLAP

Jedox have announced that they intend to ship a Palo centric ETL open source server product early next year. This is excellent news and is on top of the new rules engine that was added to Palo this summer. Open source MOLAP has suddenly taken off the training wheels and is getting ready to mess with the big kids. The two things I really like about the new proposed Palo-ETL server are; it’s open-source and it’s designed to enable drill-down from the analysis cube back to the data source.

Drill back is the 2nd most common reason for continuing IT staff involvement in the day-to-day running of BI projects (the 1st is of course the ETL process). The “Where did that figure come from?” question is one of the reasons that the Excel pivot table function is so popular, double-click on a data cell – the rows used to generate that figure are displayed on a new sheet; simple but powerful.

As to what platform the new server is to be built on, I’m guessing Talend or Kettle. Now it is possible that Jedox have rolled their own product from scratch but with two superb open-source products already out there it would seem pointless. Talend actively seeks out technology licensing agreements with other companies and has just opened an office in Germany (Jedox is based in Freiburg) so it would be the most likely contender. But, Pentaho’s Kettle may also be in the running as there’s already prior-art and this Pentaho forum thread. Also, the one major missing from the comprehensive Pentaho stable is lack of an in-memory OLAP tool (their current OLAP offering is the ROLAP Mondrian project).

Google Spreadsheets – ETL tool

Although I’m a total Excel fanboy, I most admit I rarely use it any longer for personal stuff such as home budgets, tax calculations, what-ifs, to-do lists etc.; I now tend to use Google Spreadsheets. Likewise, personal notes, drafts and useful bits of code are stored using Google Docs rather than MS Word. Three main reasons for this shift to the cloud:

  • Google Docs & Spreadsheets are ‘good enough’ for most of the trivial lists and calculations I require in my personal life and indeed for most business purposes as well, at least those that don’t require a pivot table.
  • These spreadsheets and documents are important but not necessarily in the ‘state secret/I-could-tell-but-then-I’d-have-to-kill-you’ scale of things, by building them in Google Apps they are securely backed-up and easily accessible.
  • A lot of the spreadsheets are collaborative in nature, and in the collaboration field, Google Spreadsheets just gets better and better.

Today, Google announced further additions to their spreadsheet product. The AutoFill feature adds functionality I’ve come to expect from Excel, but with a twist, integration with Google Sets. But the additions that really caught my eye were the new data import functions. Now again, Excel has had web queries since Excel97, and it always amazed me why online pretenders to the throne tended to ignore the most common source of tabular data on the web, the HTML table; something to do with the great XML/Tables divide I guess!

Google now not only fixes this omission,providing access to HTML tables and comma/tab separated file, but also provides access to RSS/ATOM and generic XML sources. All that’s missing now are functions that can read other common online data files formats such as Excel, MSAccess, XBase and of course SQLite.

This addition of HTML import support and the AutoFill feature will further reduce the number of times I’ll need to fire up Excel for personal tasks, but the RSS/ATOM/XML import feature also has potential as a tool in my micro-ETL toolbox. Using Excel as my only micro-ETL tool is possible when the data is either already in Excel/CSV or accessible via a COM API or via ODBC drivers, otherwise I can call-in either Ruby, Talend, Kettle or even RSSBus. But now I’ve another option, if the data is public and published as RSS/ATOM or some other variation on XML, I can use Google Spreadsheets to fetch the data and import the resulting tabular dataset into Excel via a Web Query or via the GData API.

New Google Reader Search facilityOne other thing. While researching this post, looking up links etc. I used another new feature Google added today, Google Reader’s new Search facility. As most of my references are discovered via the blogs I subscribe to, the ability to restrict searches to that subset of the web is fantastic; I even used it to search through my own blog posts! If del.icio.us offered the same option it would make re-finding stuff even easier. I did try to use Google Co-Op to build a search engine restricted to my del.icio.us links but it didn’t seem to like the volume of links (4000 odd) I sent it.

Talend and Perl

I’ve downloaded V2.1 of the open source Talend ETL tool; lots of new connectors added and the Java SQLite connector no longer requires a JNI adapter. I’ve evaluated Talend in the past mainly concentrating on its Java code generating capability, this time I revisited the original Perl generator. Why? Well I know Perl, it has a very active community and I’ve successfully used it in the past as a data hacking language on Unix boxes, but I’ve always found using it on Windows a PITA.

For Perl to be a runner as a micro-ETL tool, Windows deployment had to be as simple (for the end-user) as possible. So I decided to invest some time looking at the options for deployment on Win32 and more particularly the use of the PAR module to generate Windows executables. I eventually managed to get PAR to work and now I have a way of shipping a Talend generated Perl ETL transformation to a Windows box without worrying about installing a Perl environment.

With the wealth of Perl data manipulation modules and examples out there, and the ability to create new Talend components via tFlexPerl, I’m seriously considering using Talend/Perl as my micro-ETL environment. The one thing I’m missing is a Perl wrapper for Palo, but I’ve been looking at using SWIG to generate one, looks easy enough, particularly as I would only be interested in exposing two or three Palo functions (adding data to cubes, adding elements to dimensions and issuing a database save).

I know, I know, I’ve been here,there and everywhere looking for my ideal technology. Where will I land? Which option will I finally go with? Stay tuned, as my business plan for “the country datasmith” becomes more concrete, the technology required to support the business becomes clearer, I call this “the perfect storm”. More anon.

Apatar – a few extracts short of a load

I’ve been meaning to try out the Apatar ETL/Mashup tool for sometime and today being yet another rainy day in this the worst Irish summer that I can remember (and Irish summers are not renowned for the lack of rainfall) I decided to give it a try out. Not impressed I’m afraid; comes up short when compared to either Kettle (Pentaho PDI), Proto, RSSBus or Talend. Very few database connectors (e.g. no SQLite,DB2 or Firebird support) this wouldn’t be a problem if the product offered a generic JDBC or ODBC connector. It does have one nice feature the others (other than RSSBus) haven’t, an Amazon S3 connector. But the thing that I find amazing is that a product that’s positioning itself as an Enterprise 2.0. mashup tool doesn’t have the ability to read and write Excel files! And no, CSV files don’t count.

Talend vs. Kettle (Pentaho PDI)

Over the last few weeks I’ve received a lot of traffic from Goggle searches comparing Talend and Kettle and also from Vincent McBurney’s ITtoolbox article comparing the two products, so where do I stand?

As ETL tools they take different approaches, Kettle is a meta data driven framework (which is in turn tightly integrated into an even larger BI framework, the Pentaho BI Project), while Talend is at the end of the day a code generator. The code generator nature of Talend does impart certain capabilities such as the ability to easily integrate into other BI platforms (such as JasperSoft and SpagoBI ) and as micro-ETL tool. On the other hand Kettle’s increasing integration into the Pentaho family makes it natural choice for larger BI implementations. So which one would I recommend ? It’s got to be Kettle and the main reason for that choice is the Pentaho PDI community.

My experience so far with the Talend community has not been good. Admittedly I’ve only asked two questions, one looking for more details on a statement regarding ERP and CRM connectors mentioned in a press release and the other a technical query about a new tSQLite connector. Neither were answered. OK I’m a big boy I can handle the rejection and I’m also technically savvy, I can eventually fix most of my own problems. But I’m sure if I posted similar questions on the Pentaho forums I would have received an answer, probably from Matt Casters if he though the general community would not be able to provide the answer.

So will this experience stop me from using Talend? Maybe not, the code generation capability may come in handy as a micro-ETL tool (but I do have other options); however, through an offline conversation with Samatar (an active member of the Kettle community) and Matt Casters I learned of a new XML input step and of a PDI alternative to the Talend tJavaFlex component, the “Modified Java Script Value” step. This allows for the creation of START, TRANSFORM (i.e. for-each-row) and END JavaScript scripts. As Kettle uses the Rhino scripting engine I have access to any Java API (e.g. Palo,S3, Google Spreadsheets) but with the flexibility of JavaScript; brilliant!

Not only that, since my experience with Talend tJavaFlex reintroduced me to the world of Java I though I might as well look into creating my own Kettle plugin and guess what, it looks easy enough. So I’m going to develop my first plugin to be based on the ScriptValueMod step, replacing the Rhino engine with the Java 6 ScriptEngineManager. This will give me access to the latest version of Rhino which supports E4X (will make handling XML must easier) and also opens the possibility of using JavaFX or indeed JRuby as Kettle scripting languages.

UPDATE:

I’ve received a reply to my tSQLite question!

UPDATE: July 2008

This post is now over a year old, Talend as a product and as a community has improved enormously in the last 12 months; so much so, I now use Talend  (Java) in preference to PDI for most of my datasmithing needs.

JavaFX – a GUI DSL

Having mastered JavaScript (OK master is too strong a word – having become comfortable with both its syntax and usage patterns) my next port of call is JavaFX the recently announced Flash/Silverlight competitor. What led me to JavaFX Script was not its role in this Flash/AJAX alternative platform (which unless  Sun improves the JRE download experience is dead in the water they may have a chance with the promised consumer JRE ) but its status as the 2nd scripting language to be supported by the Java 6 ScriptEngineManager – the first being JavaScript. Although the JavaFX platform is still in alpha, the important elements of the scripting language appear to be usable.

Why do I need another scripting language? Well JavaFX Script is a DSL (domain-specific language) and the domain in question is the GUI and as a GUI DSL, JavaFX is impressive.

In my quest for a micro ETL environment the lack of a fast and powerful GUI tool has been a problem. Admittedly, in my Excel/Sqlite xLite suite GUI generation is not a problem as VBA forms are both fast to develop in and feature rich, but the other two options (Java and Ruby) both lack a cost-effective GUI tool (I know there’s Swing and Tk but for somebody who’s used to the speed of development of tools such as VB forms or Oracle’s Application Express, neither appealed). JavaFX changes all that, it’s elegant, powerful, fast to develop in and portable. Further pushes me down the road of using Talend generated Java as my micro-ETL environment of choice.

It’s easy to dismiss JavaFX as hype but the technology behind the hype looks sound. The language borrows powerful features from other languages, object literals from JavaScript, list-comprehensions from Python/Erland/Haskell and first-class functions from JavaScript/Lisp all combined with the full power and glory of Java. The end result:

..is a declarative and statically typed programming language. It has first-class functions, declarative syntax, list-comprehensions, and incremental dependency-based evaluation. The JavaFX Script language makes intensive use of the Java2D swing GUI components and allows for easy creation of GUIs.

Go the community page for FAQs, tutorials and reference information. Timothy O’Brien has a good first pass at calling JavaFX from a Java program and this article by John Smart explains the basics of using ScriptEngineManager to call JavaScript.

I’ve got talend and I’m going to use it…

For the last few months I’ve being looking for my ideal ETL platform. That ideal would be open source, platform independent (well at least Windows and Linux), flexible, and easily deployable. It had looked like a combination of Kettle and my micro-ETL combinations of Ruby/SSQLite and Excel/SQLite would be the eventual “winners”. That was until I discovered (or rather rediscovered) Talend, to be more precise the just released Open Studio V2.0.

Having spent a few hours this weekend getting to know the new Java Project features I’ve come away a Talend fan-boy. I think I’ve found my ETL tool, one suitable for both heavy-duty server-side processing and client-side micro-ETL tasks. The tJavaFlex component allowed me to add missing functionality with ease; for example, I managed to..

…all within two hours of first starting to evaluate the product. (OK, I had evaluated the Perl version several months ago, so I wasn’t a total noobie!).

Because Talend is a code generator (Perl or Java) I can store my work as a ‘tool-independent-snapshot’ for future use, modify by hand if need be, and use it on any JRE or Perl supporting OS. It’s not that I’ll stop using the other solutions (the Excel/SQLite xLite utility has proved to be particularly useful and flexible) but Talend looks like it will become the tool-of-first-preference (along side Excel of course!) in the Gobán Saor’s datasmithing armory.

Talend ETL – A New Contender

Talend have released a new version of their Open Studio ETL tool. Not as full featured as Pentaho Kettle; only supports a limited number of databases and file formats – no SQLite support shock-horror! The press release promises More than 100 Native Connectors and promises connectors to ERP and CRM tools but I couldn’t find them (maybe they meant the ODBC support – well I guess I managed to connect the SQLite using this ODBC driver so maybe there’s an ODBC driver for SAP!). Compared to Kettle the design GUI runs much slower (built on an Eclipse platform, say no more).

But two things impressed…

Talend Open Studio is not an ETL engine as such, it’s a code generator. When I last looked at it, it generated Perl code but this release now also generates Java and not only that, it packages the resulting code ready to be deployed on any Windows or *nix platform, no Talend installation required. This could be an alternative to my Ruby/SQLite micro ETL idea, especially as SQLite support appears to be in the works.

The XML and CSV import components are excellent, especially the XML functionality. The Kettle equivalents work but they never felt like they provided a productivity gain and for me that’s what matters. Once I have data within either a database or an Excel environment I don’t need any other tools, it’s how well and how cost effectively an ETL tool handles the parsing, scheduling, provisioning, distribution and logging of external (usually XML,CSV and Excel) data that matters to me. Talend is now definitely a contender.