Category Archives: Talend

Talend + SQLite + Groovy the new Oracle …

… well, at least for me.  Let me explain.

For most of my datasmithing career, I’ve had access to corporate Oracle databases and now with the availability of  Oracle10g  Express I can even run my own Oracle instances at home or on EC2.  The combination of a powerful SQL engine, expressive scripting language (PL/SQL) ,OS independence, web front-end (App Express) and the ability to communicate with Excel (via OO4O) made Oracle a natural fit for heavy-duty data manipulation.   But there was always one major problem, Oracle doesn’t play well with other data sources, necessitating a separate ETL bolt-on, which led me to play around with the likes of Kettle and Talend.  But having been seduced by these new shiny (and open source) “toys” I’ve found that rather than just been incidental add-ons they had the potential to totally replace Oracle.  The combination of Talend, SQLite and Groovy, is proving to be particularly magic.

So how will these three tools enable you to leave behind your Oracle past?

Talend (in its Java form) is a superb ETL tool, via JDBC is can access every database type on the planet, it has built-in web-service capability and access to a  multitude of APIs via its Java component for non-database data sources.  The addition of  Groovy makes the use of such Java APIs simpler and quicker and the same Groovy acts as a replacement for PL/SQL when a bit of “if-then-else” logic is required.  And although Talend offers a built-in option to plublish an ETL job as a WAR file exposing a SOAP web service, Java/Groovy also allows for the integration of the powerful, yet simple, Jetty API to embed a web server within Talend itself.  And all this for free, and better than free, open source.

So where does SQLite come in? And, didn’t you say that Excel integration was important, how will Excel communicate with Talend?

As very little corporate data is held in SQLite format, and Talend allows access to every major commercial/free database, the usefulness of SQLite might not be at first obvious.  But if you think of SQLite as a data cache, a fast and efficient local tabular datastore, with a powerful but well understood DSL (i.e. SQL) and a drop-dead-simple setup and backup regime (basically copying and creating files), maybe then you can see its attraction. The ability to extend the DSL by easily creating SQLite user defined functions (UDFs) within Talend using either Java or Groovy is also another powerful feature.

For example…

select customer_id, name,customer, sales_region, getpalodata(“SALES”,customer_id,”All Products”.”Total Sales”,”Euros”,”YTD”)  as customer_YTD, getpalodata(“SALES”,sales_region,”All Products”.”Total Sales”,”Euros”,”YTD”)  as region_total_YTD from list_of_top_customers;

… where getapalodata is a UDF that wraps calls to a Palo cube.

With this type of setup I can easily mix and match list/tabular data with multidimensional data points using SQL (something that Oracle also supports but only if you hand over a large wad of currency). In fact I can create a mini data warehouse, with Palo providing the pivot, ( as SQLite lacks star-query (or even multi-index query) support.)  SQLite would still host the conformed dimensions and the fact tables, but with the fact tables acting as feeds to Palo cubes, supporting finer-grained drill-throughs from cubes or for ad-hoc queries. This is powerful stuff, simple, free, powerful stuff.

… and the spreadsheet access?

A Talend sub-job such as this…

Talend Groovy Jetty web server

Talend Groovy Jetty web server

Example of Groovy code calling Jetty API

Example of Groovy code calling Jetty API

…would provide a simple RESTful (rather than SOAP) web service which could be accessed either with an Excel Web Query or via a VBA macro which would parse the result and allow for more control.  For example …

http://localhost:1234/sqlgateway?sql=select customer_id,name from all_customers&type=HTMLTable

… this would return a list of customers wrapped in an HTML table, or …

http://localhost:1234/job/extractProspects?Rep=JonesTom&Month=JAN&SourceCompany=AXA&type=HTMLTable

…this might call a Talend job called extractProspects, passing in JonesTom, JAN and AXA as context parameters, which would then return a list of prospects extracted from a feed supplied by AXA’s system.

What would the Talend job look like?

The job might operate something like this:

  • It would run either on the client as a service or on a LAN based server (or on a remote server, with a SSH VPN (or Hamachi) to provide security).
  • At start-up, do a bunch of ETL tasks, pulling data from remote sources and databases, transforming and aggregating data etc. Storing the resulting data in local SQLite databases.  It might also build Palo cubes or update larger enterprise databases.
  • The job would then setup a Jetty web server and await requests for data.
  • The requests might be a mixture of raw SQL or requests to run specific Talend transformations which would return a dataset directly to the calling client or maybe just acknowledge the request, queue it up for processing later, sending the resulting dataset by EMail or RSS feed when finished.
  • At a fixed time the service would shut it self down and requeue itself for the next day’s workload.

… or nothing at all like that, and that’s the point, build what you need, add the levels of security (or none at all) that fits your situation, all within a open framework, with zero lock-in (okay, still using Excel, anyone for OpenOffice, Google Apps or Zoho?).  You don’t even need your own server, host it on an EC2 instance, (if you bring up an instance for 10/12 hours every working day, it would cost about $20/$25 a month).

Now tell me that doesn’t make sense?

New universal SQLite JDBC library.

Both Talend (Java) and Kettle distribute the Zentus.com pure-Java SQLite JDBC driver and for most purposes this run-anywhere version is fine. But, if you really need to take advantage of SQLite’s speed then connecting using the native JNI version is a must.  Doing this was easy enough, just change over to using a generic JDBC connection specifying the required native jar and placing the associated dll/so on your system path.

But now there’s an easier way, the latest version (V052, in fact from V050 on) is a universal jar, it contains native JNI libraries for Windows, Linux and MacOS alongside the pure-Java version.  It will automatically pick the correct lib for the platform and fall back to the pure-Java version if required.  You can tell if it’s picked up the native lib by calling conn.getDriverVersion(); it’ll return “native” if it has.

To upgrade to this jar in Kettle see this, this time replacing the nested jar with sqlitejdbc-v052.jar.

For Talend:

  • Either rename the new V052 jar to sqlitejdbc_v037_nested.jar, replace the existing V037 jar in the ../lib/java folder with this new renamed file.
  • Or, you could edit the Java specific XML files in the various tSQlite component folders, replacing the references to the old nested V037 jar.
  • Or, and this is what I would do, don’t use the tSQLite components, replace them with tJDBC generic components, then you can pick whatever version of the driver you require, you could even change to a different database provider!

The Talend tradition of a separate set of components for each type of database, seems to be a hangover from its Perl-generating roots. It’s true that database specific components are required for certaing tasks such as  bulk-loading, ELTs and so on, but JDBC was designed to be generic and as long as the SQL syntax is compatible, it makes switching in an out database providers very easy.  So unless there’s a good reason, stick to using tJDBC.

Groovy as Talend’s scripting language

Although I had decided to use Talend (Java version) as my primary ETL tool I still had one major problem with it, its lack of a scripting tool.  Kettle (Pentaho PDI) has Javascript, Excel has VBA, Picalo has (well OK, is) Python and Talend in its Perl version has Perl.  I could have gone (and did experiment) with calling Javascript, Jython or JRuby via JSR223, but I wasn’t happy with the level of integration afforded by this, opting instead to make command line calls to Python (using SQLite as a data carrier).

Then, I discovered Groovy, or I should say rediscovered it, as I’d come across it many years ago when it was far less developed than is now, liked it then but couldn’t see a use for it at the time and promptly forgot about it.  Then it appeared wrapped in a Talend component, prompting me to do a quick visit to the Groovy website, which turned into a deep-dive into the language; I’d found my scripting tool!

Groovy (by the way what a terrible name for a language, or is that just me?), is not really a stand-alone language but more an extension to Java itself; offering the full power of Java but with addition of closures, builders and dynamic types.  In fact, over time Groovy has become more and more Java like (the biggest missing being lack of support for anonymous inner-classes).

To underline this convergence, Groovy is being developed under the separate JSR 241 rather than JSR 223. There’s full interoperability between both languages; Groovy  compiles down to JVM bytecode and can use Java classes and objects, Java can likewise use Groovy generated bytecode.  This allows for fast prototyping and development without compromising access to Java’s vast collection of libraries.

Here for example, is a piece of code to try out the JPalo library’s ability to access a Palo cube …


[sourcecode language='Java']

import org.palo.api.Connection;
import org.palo.api.ConnectionFactory;
import org.palo.api.Cube;
import org.palo.api.Database;
import org.palo.api.Element;
connection = ConnectionFactory.getInstance().newConnection("localhost","7777","admin","admin")
database = connection.getDatabaseByName("Demo");
;cube = database.getCubeByName("Sales");
rowElements = cube.getDimensionAt(0).getElementsInOrder();
columnElements = cube.getDimensionAt(1).getElementsInOrder();
dataSet = [rowElements,columnElements,]
dataSet << cube.getDimensionAt(2).getElementAt(0)
dataSet << cube.getDimensionAt(3).getElementAt(0)
dataSet << cube.getDimensionAt(4).getElementAt(0)
dataSet << cube.getDimensionAt(5).getElementAt(0)
// fetch data set
datas=cube.getDataArray(dataSet as Element[][])
connection.disconnect();
// parse the return string
rowcount = rowElements.length;
columncount = columnElements.length;
data=[]
heading=[]
// first row set to the row names (i.e. "Product name" followed by the country names )
heading << "Product"
for (i in 0..columncount-1) {
    heading << columnElements[i].getName()
   }
data << heading
// Now  out each line
for (i in 0..rowcount-1) {
    row = []
    row << rowElements[i].getName()
    for (j in 0..columncount-1) {
         row << datas[((i + (j*columncount)))]
         }
    data << row.flatten()
   }

//output to csv file
def csvOut= new FileOutputStream('c:/data/File.csv' )
for (lines in data) {
         lines.eachWithIndex{col,i ->
                             if (i > 0) {
                                 csvOut << ","
                             }
         csvOut << col
     }
     csvOut << "\n"
}
csvOut.close()
[/sourcecode]


This was done in the Groovy console as a proof of concept, it was then transferred to a tGroovy component where it was parametrised and instead of outputting to a CSV file, it was used to fill the globalBuffer structure (the structure used by tBufferOutput component).

Other things I managed to do with Talend tGroovy over a few days:

  • Extended SQLite with my own user-defined Palo functions.
  • Set-up a Talend job as an Excel accessible RESTful web service using Jetty.
  • Interfaced with Amazon S3.

Although I was very familiar with the S3 and the JPalo API, both SQLite UDFs and Jetty were new to me, and that’s were scripting proves it worth, giving the developer the maximum support with the minimum of background noise.  But it’s not just weird and wonderful new APIs that scripting helps expose but as a datasmithing tool, languages such as Groovy give analysts the ability to quickly de-construct and model datasets (for example, see Groovy’s SQL database support and collections’ functionality).

As a infamous Irish farming-pharma TV ad of my youth put it, “It’s a queer name but great stuff“.

Regular Expressions as an end-user programming tool?

“What? Have you completely lost the plot, Gleeson?”, I hear you scream.  Jamie Zawinski’s famous quote is intoned once more ..

Some people, when confronted with a problem, think
“I know, I’ll use regular expressions.”   Now they have two problems.

Of course the above quote could be (and probably has been) changed to…

Most business people, when confronted with a problem, think
“I know, I’ll use a spreadsheet.”   Now they have two problems.

They are dense, single-line, single purpose, self contained mini-programs.  The previous statement applies to regular expressions but could equally be used to describe the single most popular end-user programming tool, spreadsheet formulae (particularly in their nested form!).

As somebody with the “programming gene” (something most, but not all, IT professionals possess, as do a significant proportion of “civilians”), such compressed logic somewhat grates compared with the power and elegance of more expressive programming languages, but that hasn’t stopped me using both spreadsheet formulae and regex to quickly and effectively solve problems when the need arose.

Those without the programming gene (the vast majority of business users), find traditional programming languages next to impossible to get their heads around yet find spreadsheet formulae approachable and useful.  It seems to be something to do with approaching problems as a series of simple problems and not loading the whole problem domain into one’s brain at one sitting (as most programmers and system designers are capable of).

In the past, non-programmers would rarely come in contact with regex as its use was possible only within the realms of professional programming or Unix sys-admin toolsets (sed,awk etc.).  But now, ETL tools such as Kettle and Talend allow end-users to use regular expressions without the need to understand the underlying programming language.  Taking this to the next step, Talend’s new data profiling product uses regular expressions as its main discovery language. They could, I guess, have invented yet another XML dialect and/or query-by-example dialogue, but instead they’ve taken the sensible (and cheaper) option and exposed the full power of raw regular expressions.

Will the great unwashed embrace regex in the same way they took to nested Excel functions, embarrassing their professional colleagues with yet more amateurish and often unmaintainable messy solutions, that just work? I think they just might…

What to do when Talend gets its knickers in a twist?

If you’ve done any significant amount of work with Talend you’ll undoubtedly have experienced situations where either the generated code/JETemitters or the GUI representation of a job become unstable like so…

Talend loses the plot!

The usual advice is to backup your projects (workspace/projectName) , delete the workspace/.Java (or .Perl) and workspace/.JETEmitters folders and restart Talend to force a rebuild of the generated code.  I’ve have varying degrees of success with this approach, from it fixing the problem to the other extreme of the project raising the dreaded “null pointer” error never to load again!  Also, in such a situation, don’t depend on the in-built backup facility (particularly if it’s showing errors on the export), make a copy of the project folder directly (in particular, the process folder).

I’ve found the most reliable method to untwist such messes is to create a brand new Talend installation (unzip download file into another folder), create a new project (don’t use the front screen’s import functionality); within this new project, use the import item facility, import the required jobs from the corrupted project and the resulting project should now work OK.

If the job wouldn’t compile due to its inability to find standard Talend classes (this applies to Java projects, switch to the code view and click on the red spots on the right hand side to see the errors), just saving the job, exiting and restarting Talend should fix this problem.

Sometimes however, as with the screenshot above, none of the above will work.  Remedying this required me to edit the job’s .item file’s XML directly (the bug that caused this was seen in a 2.4 RC release, it thankfully appears to have fixed in the production release).  Editing a job’s XML is not for the faint hearted, but it is possible and with the knowledge gained from studying the XML structure you may find such experience useful (as I did) in the mass-editing  of jobs.  In fact, I’ve used Talend’s own data integration functionality to mass generate Talend jobs, demonstrating again the power of this continually improving product!