Category Archives: Java

New universal SQLite JDBC library.

Both Talend (Java) and Kettle distribute the Zentus.com pure-Java SQLite JDBC driver and for most purposes this run-anywhere version is fine. But, if you really need to take advantage of SQLite’s speed then connecting using the native JNI version is a must.  Doing this was easy enough, just change over to using a generic JDBC connection specifying the required native jar and placing the associated dll/so on your system path.

But now there’s an easier way, the latest version (V052, in fact from V050 on) is a universal jar, it contains native JNI libraries for Windows, Linux and MacOS alongside the pure-Java version.  It will automatically pick the correct lib for the platform and fall back to the pure-Java version if required.  You can tell if it’s picked up the native lib by calling conn.getDriverVersion(); it’ll return “native” if it has.

To upgrade to this jar in Kettle see this, this time replacing the nested jar with sqlitejdbc-v052.jar.

For Talend:

  • Either rename the new V052 jar to sqlitejdbc_v037_nested.jar, replace the existing V037 jar in the ../lib/java folder with this new renamed file.
  • Or, you could edit the Java specific XML files in the various tSQlite component folders, replacing the references to the old nested V037 jar.
  • Or, and this is what I would do, don’t use the tSQLite components, replace them with tJDBC generic components, then you can pick whatever version of the driver you require, you could even change to a different database provider!

The Talend tradition of a separate set of components for each type of database, seems to be a hangover from its Perl-generating roots. It’s true that database specific components are required for certaing tasks such as  bulk-loading, ELTs and so on, but JDBC was designed to be generic and as long as the SQL syntax is compatible, it makes switching in an out database providers very easy.  So unless there’s a good reason, stick to using tJDBC.

Groovy as Talend’s scripting language

Although I had decided to use Talend (Java version) as my primary ETL tool I still had one major problem with it, its lack of a scripting tool.  Kettle (Pentaho PDI) has Javascript, Excel has VBA, Picalo has (well OK, is) Python and Talend in its Perl version has Perl.  I could have gone (and did experiment) with calling Javascript, Jython or JRuby via JSR223, but I wasn’t happy with the level of integration afforded by this, opting instead to make command line calls to Python (using SQLite as a data carrier).

Then, I discovered Groovy, or I should say rediscovered it, as I’d come across it many years ago when it was far less developed than is now, liked it then but couldn’t see a use for it at the time and promptly forgot about it.  Then it appeared wrapped in a Talend component, prompting me to do a quick visit to the Groovy website, which turned into a deep-dive into the language; I’d found my scripting tool!

Groovy (by the way what a terrible name for a language, or is that just me?), is not really a stand-alone language but more an extension to Java itself; offering the full power of Java but with addition of closures, builders and dynamic types.  In fact, over time Groovy has become more and more Java like (the biggest missing being lack of support for anonymous inner-classes).

To underline this convergence, Groovy is being developed under the separate JSR 241 rather than JSR 223. There’s full interoperability between both languages; Groovy  compiles down to JVM bytecode and can use Java classes and objects, Java can likewise use Groovy generated bytecode.  This allows for fast prototyping and development without compromising access to Java’s vast collection of libraries.

Here for example, is a piece of code to try out the JPalo library’s ability to access a Palo cube …


[sourcecode language='Java']

import org.palo.api.Connection;
import org.palo.api.ConnectionFactory;
import org.palo.api.Cube;
import org.palo.api.Database;
import org.palo.api.Element;
connection = ConnectionFactory.getInstance().newConnection("localhost","7777","admin","admin")
database = connection.getDatabaseByName("Demo");
;cube = database.getCubeByName("Sales");
rowElements = cube.getDimensionAt(0).getElementsInOrder();
columnElements = cube.getDimensionAt(1).getElementsInOrder();
dataSet = [rowElements,columnElements,]
dataSet << cube.getDimensionAt(2).getElementAt(0)
dataSet << cube.getDimensionAt(3).getElementAt(0)
dataSet << cube.getDimensionAt(4).getElementAt(0)
dataSet << cube.getDimensionAt(5).getElementAt(0)
// fetch data set
datas=cube.getDataArray(dataSet as Element[][])
connection.disconnect();
// parse the return string
rowcount = rowElements.length;
columncount = columnElements.length;
data=[]
heading=[]
// first row set to the row names (i.e. "Product name" followed by the country names )
heading << "Product"
for (i in 0..columncount-1) {
    heading << columnElements[i].getName()
   }
data << heading
// Now  out each line
for (i in 0..rowcount-1) {
    row = []
    row << rowElements[i].getName()
    for (j in 0..columncount-1) {
         row << datas[((i + (j*columncount)))]
         }
    data << row.flatten()
   }

//output to csv file
def csvOut= new FileOutputStream('c:/data/File.csv' )
for (lines in data) {
         lines.eachWithIndex{col,i ->
                             if (i > 0) {
                                 csvOut << ","
                             }
         csvOut << col
     }
     csvOut << "\n"
}
csvOut.close()
[/sourcecode]

This was done in the Groovy console as a proof of concept, it was then transferred to a tGroovy component where it was parametrised and instead of outputting to a CSV file, it was used to fill the globalBuffer structure (the structure used by tBufferOutput component).

Other things I managed to do with Talend tGroovy over a few days:

  • Extended SQLite with my own user-defined Palo functions.
  • Set-up a Talend job as an Excel accessible RESTful web service using Jetty.
  • Interfaced with Amazon S3.

Although I was very familiar with the S3 and the JPalo API, both SQLite UDFs and Jetty were new to me, and that’s were scripting proves it worth, giving the developer the maximum support with the minimum of background noise.  But it’s not just weird and wonderful new APIs that scripting helps expose but as a datasmithing tool, languages such as Groovy give analysts the ability to quickly de-construct and model datasets (for example, see Groovy’s SQL database support and collections’ functionality).

As a infamous Irish farming-pharma TV ad of my youth put it, “It’s a queer name but great stuff“.

Java – at the eye of a perfect storm

The “perfect storm” of ubiquitous broadband, powerful and cheap laptops, virtual machines, cloud-based services/infrastructure and open-source software is changing the nature of IT in a way that’s reminiscent of the revolution started by the IBM PC. Although a lot of emphasis has been put on the influence of consumer-focused services on the enterprise, the Web2.0 effect; there’s also traffic in the other direction.

Tools that were once the preserve of large multi-national or governmental organisations are now becoming available to a much larger audience at a fraction of the cost (either free via open source or pay-as-you-go via on-line services such as Amazon Ec2 or salesforce.com).

As a result of this leakage from the enterprise, I’m more and more using a skill that I though I’d left behind in the hallowed halls of “big business”, my knowledge of enterprise Java. Two of the tools that are at the centre of my datasmithing arsenal are Java to the core, Talend ETL and WaveMaker and a third Palo’s new ETL Server is built on a Java stack.

Talend

My first impressions of the “Java Project” version of Talend were, as they say in Texas, “all hat and no cattle”. I’ve stuck with it though, and have had the opportunity over the last few weeks to re-visit the product. Initially, my attitude to Talend was coloured by my experience with Kettle (aka Pentaho PDI), which under the direction of Matt Casters and the patronage of Pentaho has grown from strength to strength, but once I attuned myself to the idea that Talend is, in essence, a code generator, generating code in a language I know well, I became more comfortable with it.

What I like about Talend, is the ability to convert an ETL process into either a POJO or a WAR file representation of the solution, both stand-alone and fixed-in-time. Talend as a company or as a product could disappear in the morning, as could I, but the solution, cast in Java, will continue on regardless, a solution expressed in a standards based language, widely used and understood by a large number of IT professionals.

This is really important when you consider the ETL/BI products that have been swallowed by the rent-book collectors of the IT business, e.g. Essbase/Hyperion by Oracle, Business Objects by SAP (I might be a bit unfair to SAP, who continue to show real commitment to software R&D). The new owners of these once ground-breaking products will continue to milk the licence holders for “rent” long into the future , a situation that would be acceptable if they continue to offer value above a “protection” service more associated with that provided by your local hoodlum; but alas, most were bought to strengthen the purchaser’s control of their market, so I wouldn’t hold my breath.

I’ve always been a lazy programmer, so I’ve over the years developed numerous productivity aids to help automate development (i.e. reduce the boredom factor for me and the cost for my customers/employers) and to reduce errors (a cost to both sides). Code generation has been at the heart of many of these efforts, so I find myself at home with both Talend and Wavemaker’s approach.

WaveMaker

I’d not heard of Chris Keene’s WaveMaker until a few weeks ago. WaveMaker (previously know as ActiveGrid) belongs, along side Talend, Pentaho and Jedox, to a new breed of open-source businesses, and the product brings new life to Java web development. I was lucky to escape the worst of the J2EE nonsense and could never understand why an easy to use GUI builder like this never existed in the Java world, no wonder .NET continues to outgun J2EE in the market place.

Those of you with a background in VB6/VBA or VisualStudio will feel right at home here, but instead of desktop GUIs you’ll be building AJAX web applications. The resulting application is packaged as a WAR file which can be hosted by any standards based Java server (e.g. Tomcat or Jetty). It’s open source and built on industry standards, Hibernate,Spring and the Javascript Dojo framework.

Not only can WaveMaker act as a front-end to traditional databases, it’s designed to be equally at home with data served up by web services or POJOs. And, as Talend WAR projects and the Palo ETL Server (Jetty based) both expose Axis based web services, these three products are a match made in Java heaven.

Oracle Application Express

So, If you only came across WaveMaker recently, what did you intend to use as a Web/GUI front-end before this?

Well, for many tasks, Excel, backed by web-service aware VBA will continue to be an option, my xlAWS library and xLite code base will continue to be useful. Obviously, Excel is a natural front-end to Palo itself and VB6 is always there for quick and dirty Window’s GUI apps. I may also use Proto if circumstances warrant it.

For web front-ends, in the past I toyed with using Oracle Apex, then Rails, then JotSpot, then Zimki.

  • Rails taught me a lot about “good web app design” and introduced me to Ruby (and SQLite) but it didn’t offer me the speed of development and ease of deployment I’ve become accustomed to.
  • JotSpot got swallowed up by Google and spat out again as Google Sites minus the innovative “wiki database” capability.
  • Zimki, alas, is no more.
  • That left me with Oracle’s impressive Application Express (aka APEX aka HTML DB).

Application Express is excellent if your background is in Oracle databases and/or Oracle Forms and if you work in a Oracle shop and have not checked it out, then do so, you’ll be impressed. It’s standard with all versions since 10G and can be installed on V9 databases. It’s also the front-end for Oracle XE – the free edition of the database server.

So why jump ship to WaveMaker?

APEX is Oracle specific, closed source, costs a fortune once you outgrow Oracle XE, is a bit “odd” to configure, and, I’m not sure Oracle know what to do with the product (afraid it might cannabalise its lucrative J2EE business!).

WaveMaker embraces the world, is open source and is really easy to use, while still allowing access to the underlying code (both Java and Javascript) and CSS styling. And, it can be easily deployed.

No contest, I’m afraid.

PALO ETL Server, more sightings …

First day back after Christmas, snow falling outside.

More additions to the PALO ETL-Server SourceForge project, new version of the core and, a new web server – built using Jetty and Apache Axis. Axis is a SOAP handler so I looked around for the WSDL file to see what services are to be exposed and found a reference to a drillThrough service which I guess is the mechanism by which we’ll soon be able to drill back from a PALO cube to its source data tables. At the moment I’m just wandering through the source code, I’ll need to fire up an EC2 instance to give it a test run as the new server code doesn’t fully support Windows.

Happy New Year!

Using the latest Pure Java SQLite JDBC driver in Kettle

The bug in the pure Java SQLiteJDBC driver that caused an “out of memory” error when trying to connect to a SQLite database using standard windows drive letters (e.g. c:\kettle\mydata.db) is now fixed. The current version (V037) has also been updated to SQLite version 3.4.2. To use the latest driver within Kettle, download the file from here, go to the ../libext/JDBC folder, delete the included sqlitejdbc-v023-nested.jar and replace with sqlitejdbc-v027-nested.jar.

Update:

Matt Casters of the Pentaho Kettle team left a comment to say version 2.5.2 and version 3.0.0-RC2 will ship with the sqlitejdbc-v027-nested.jar

Facebook Apps using Ruby on Rails

This is what I love about Ruby and Ruby on Rails, once you learn the basics of Ruby and how a RoR app is put together you can use this knowledge to learn about other technologies, in this case Facebook Applications. The reason for this is, as soon as a new technology hits the street somebody is bound to either build a Ruby library or a RoR plugin targeting the new (and presumably cool) platform. In two excellent articles Stuart Eccles shows how to build a Facebook app using the rFacebook Rails extension. I guess I could have looked at a PHP or Java example but I chose (as I nearly always do) the Rails route as the layers of abstraction and the standard infrastructure of RoR apps allow me to quickly get an overview of the new technology in action but also, if I so desire, allow me to easily deep dive into anything that requires more detailed investigation.

In the late nineties I learned Java (and later .NET) for a similar purpose not as a primary development tool but because of its role as the “language of account” of most new technologies at the time. And although I still come across environments where the only examples are in a Java (or .Net or PHP) I know if I wait a few weeks some bright Rubyist will eventually “document” it in either Ruby or Rails.

Using the sample application built by Stuart I’ve added one of my favourite salad recipes,
Banana & Tomato in a Mustard Dressing
..enjoy.

Talend vs. Kettle (Pentaho PDI)

Over the last few weeks I’ve received a lot of traffic from Goggle searches comparing Talend and Kettle and also from Vincent McBurney’s ITtoolbox article comparing the two products, so where do I stand?

As ETL tools they take different approaches, Kettle is a meta data driven framework (which is in turn tightly integrated into an even larger BI framework, the Pentaho BI Project), while Talend is at the end of the day a code generator. The code generator nature of Talend does impart certain capabilities such as the ability to easily integrate into other BI platforms (such as JasperSoft and SpagoBI ) and as micro-ETL tool. On the other hand Kettle’s increasing integration into the Pentaho family makes it natural choice for larger BI implementations. So which one would I recommend ? It’s got to be Kettle and the main reason for that choice is the Pentaho PDI community.

My experience so far with the Talend community has not been good. Admittedly I’ve only asked two questions, one looking for more details on a statement regarding ERP and CRM connectors mentioned in a press release and the other a technical query about a new tSQLite connector. Neither were answered. OK I’m a big boy I can handle the rejection and I’m also technically savvy, I can eventually fix most of my own problems. But I’m sure if I posted similar questions on the Pentaho forums I would have received an answer, probably from Matt Casters if he though the general community would not be able to provide the answer.

So will this experience stop me from using Talend? Maybe not, the code generation capability may come in handy as a micro-ETL tool (but I do have other options); however, through an offline conversation with Samatar (an active member of the Kettle community) and Matt Casters I learned of a new XML input step and of a PDI alternative to the Talend tJavaFlex component, the “Modified Java Script Value” step. This allows for the creation of START, TRANSFORM (i.e. for-each-row) and END JavaScript scripts. As Kettle uses the Rhino scripting engine I have access to any Java API (e.g. Palo,S3, Google Spreadsheets) but with the flexibility of JavaScript; brilliant!

Not only that, since my experience with Talend tJavaFlex reintroduced me to the world of Java I though I might as well look into creating my own Kettle plugin and guess what, it looks easy enough. So I’m going to develop my first plugin to be based on the ScriptValueMod step, replacing the Rhino engine with the Java 6 ScriptEngineManager. This will give me access to the latest version of Rhino which supports E4X (will make handling XML must easier) and also opens the possibility of using JavaFX or indeed JRuby as Kettle scripting languages.

UPDATE:

I’ve received a reply to my tSQLite question!

UPDATE: July 2008

This post is now over a year old, Talend as a product and as a community has improved enormously in the last 12 months; so much so, I now use Talend  (Java) in preference to PDI for most of my datasmithing needs.

JavaFX – a GUI DSL

Having mastered JavaScript (OK master is too strong a word – having become comfortable with both its syntax and usage patterns) my next port of call is JavaFX the recently announced Flash/Silverlight competitor. What led me to JavaFX Script was not its role in this Flash/AJAX alternative platform (which unless  Sun improves the JRE download experience is dead in the water they may have a chance with the promised consumer JRE ) but its status as the 2nd scripting language to be supported by the Java 6 ScriptEngineManager – the first being JavaScript. Although the JavaFX platform is still in alpha, the important elements of the scripting language appear to be usable.

Why do I need another scripting language? Well JavaFX Script is a DSL (domain-specific language) and the domain in question is the GUI and as a GUI DSL, JavaFX is impressive.

In my quest for a micro ETL environment the lack of a fast and powerful GUI tool has been a problem. Admittedly, in my Excel/Sqlite xLite suite GUI generation is not a problem as VBA forms are both fast to develop in and feature rich, but the other two options (Java and Ruby) both lack a cost-effective GUI tool (I know there’s Swing and Tk but for somebody who’s used to the speed of development of tools such as VB forms or Oracle’s Application Express, neither appealed). JavaFX changes all that, it’s elegant, powerful, fast to develop in and portable. Further pushes me down the road of using Talend generated Java as my micro-ETL environment of choice.

It’s easy to dismiss JavaFX as hype but the technology behind the hype looks sound. The language borrows powerful features from other languages, object literals from JavaScript, list-comprehensions from Python/Erland/Haskell and first-class functions from JavaScript/Lisp all combined with the full power and glory of Java. The end result:

..is a declarative and statically typed programming language. It has first-class functions, declarative syntax, list-comprehensions, and incremental dependency-based evaluation. The JavaFX Script language makes intensive use of the Java2D swing GUI components and allows for easy creation of GUIs.

Go the community page for FAQs, tutorials and reference information. Timothy O’Brien has a good first pass at calling JavaFX from a Java program and this article by John Smart explains the basics of using ScriptEngineManager to call JavaScript.

I’ve got talend and I’m going to use it…

For the last few months I’ve being looking for my ideal ETL platform. That ideal would be open source, platform independent (well at least Windows and Linux), flexible, and easily deployable. It had looked like a combination of Kettle and my micro-ETL combinations of Ruby/SSQLite and Excel/SQLite would be the eventual “winners”. That was until I discovered (or rather rediscovered) Talend, to be more precise the just released Open Studio V2.0.

Having spent a few hours this weekend getting to know the new Java Project features I’ve come away a Talend fan-boy. I think I’ve found my ETL tool, one suitable for both heavy-duty server-side processing and client-side micro-ETL tasks. The tJavaFlex component allowed me to add missing functionality with ease; for example, I managed to..

…all within two hours of first starting to evaluate the product. (OK, I had evaluated the Perl version several months ago, so I wasn’t a total noobie!).

Because Talend is a code generator (Perl or Java) I can store my work as a ‘tool-independent-snapshot’ for future use, modify by hand if need be, and use it on any JRE or Perl supporting OS. It’s not that I’ll stop using the other solutions (the Excel/SQLite xLite utility has proved to be particularly useful and flexible) but Talend looks like it will become the tool-of-first-preference (along side Excel of course!) in the Gobán Saor’s datasmithing armory.