Category Archives: Palo

Why Larry hates the cloud, and my data trinity.

Last week Oracle certified Amazon EC2 as a supported platform, that same week Larry Elison attacked the concept of cloud computing as pure hype. Obviously, Larry is not happy with this whole cloud thing, and I think it’s not just the threat it poses to the software industry’s traditional licensing model that worries him, rather, as Robert X. Cringely points out in his “Cloud computing will change the way we look at databases” post, it’s the likelihood that it sounds the death-knell for large-scale traditional databases.

This new database paradigm is memory rather than disk centric, with the disk-based element acting as an archive/backup/restore mechanism which can easily be stored on commodity SAN devices ( e.g. Amazon’s ESB). Using MapReduce technology Google effectively holds the whole Internet in memory, not in one big super computer but in lots of cheap commodity servers.

But it’s not just in the realm of mega datasets that RAM based databases threaten traditional models. Excel is a memory-based database engine, so too in-memory OLAP tools such as Palo. Such products’ ability to handle large volumes of data has increased over the years, with the decrease in RAM costs and the appearance of cheap 64 bit machines (which are no longer limited to 2G/3G process working sets).

That doesn’t mean that we’ll throw away SQL databases in their entirety, SQL and the relational model will continue to be useful. But perhaps of greater use in local datastores/caches that as the building blocks for large scale datastores. For such local caches, less will be more; fewer features, easier to configure, more flexibility. That’s why I like SQLite; long after the dinosaurs of the database world have disappeared, I imagine SQLite databases will continue to survive, embedded in mobile phones, browsers, wherever a local datastore is required. And more than likely operating in memory rather than off disk.

By combining Excel with an in-memory SQLite database, linked to a Palo OLAP in-memory server, it’s possible to take advantage of three powerful data-processing technologies (spreadsheets, SQL, multi-dimensional cubes) all within your PC’s RAM. You could do serious datasmithing with such a combination on a pretty mediocre laptop, with most modern machines providing an excess of CPU power, no need for super fast disks, just as much memory as you can muster. And, with Windows on EC2, these three amigos will soon be capable of being used as a cloud bursting platform.

Excel, SQLite and Palo, my data trinity.

Cloudy skies, cloudy apps…

Just back from a break in Clifden, Connemara, summer is nearly over, the kids return to school today, back to work.

Aasleagh Falls, Co. Mayo

Aasleagh Falls, Co. Mayo

Counties Galway and Mayo were like the rest of the country last week, a tad wet, but unlike the developed east of the island, flooding was not a problem; a problematic drainage area is called a lake in the west.

This August has been the wettest and dullest I’ve ever experienced but at least I saw some sunshine earlier in the month thanks to Kristian Raue CEO of Jedox who kindly invited me to visit the company’s offices in Freiburg, Germany.  Freiburg is very green in both senses of the word, surrounded as it is by the Black Forest and its well deserved “eco-city” status.  Its also know as the warmest city in Germany, a reputation it thankfully lived up for this visitor from a rain-soaked Atlantic isle.

August morning, Frieburg Im Breisgau

August morning, Freiburg im Breisgau

If Freiburg left a positive impression on my mind, so too did Jedox.  The overall impression is of a company which intends to use a combination of quality, vision and the judicious use of open-source to build the Jedox brand into one associated with best-of-breed products and consultancy.  This vision can be seen in the evolution of Palo, from its “good enough” beginnings to its current near-best-of-breed 2.5 version, and from talking to some of those working on the product, best-of-breed status is not that far off.

Likewise, ETL-Server which is currently a Palo only “loader”, is to be further  developed into a true ETL tool, while continuing to offer MOLAP-centric specialisms.

I also got a glimpse of the next version of Worksheet Server. “Wow!”, is all I can say.

Existing web based spreadsheet products are fine for simple data analysis or basic data capture purposes but cannot compete with their client-based elder cousins when serious datasmithing is required.  Well, from the demo I saw of Worksheet Server in action, that’s about to change.  The look and, more importantly, the feel is similar to that of traditional spreadsheets, its interface with Palo is identical to that of the existing Excel add-in, and here’s the big one, its open source!  Game-changing or what?

But …

That might enable me to move a lot of my spreadsheet applications to the cloud, but what about those applications that are more suited to an MS Access type solution?

Then try out WaveMaker. It’s open source and built on industry standards, Hibernate,Spring and the Javascript Dojo framework but has the ease of GUI database development more usually associated with MS tools. The resulting applications are packaged as a WAR file which can be hosted by any standards based Java server (e.g. Tomcat or Jetty).  The latest version makes developing Ajax-fronted database applications even easier with the addition of layout templates.  Its existing ability to automatically bind interfaces to SOAP web services has been extended to REST web services by means of a new WSDL auto-discover tool.  And Chris Keene CEO of WaveMaker also informs me that …

We are also releasing a cloud-based IDE in October with Amazon – stay tuned…

We launched in February and will be announcing our first 7 figure deal this month. We run on Mac, Linux and Windows and are currently the #1 developer download on Apple.com (http://www.apple.com/downloads/macosx/development_tools/)

Our goal is to make it easy to build rich internet applications without complex coding – kind of a MS Access for the Web.

Jedox and Wavemaker the new breed of open-source businesses

Talend + SQLite + Groovy the new Oracle …

… well, at least for me.  Let me explain.

For most of my datasmithing career, I’ve had access to corporate Oracle databases and now with the availability of  Oracle10g  Express I can even run my own Oracle instances at home or on EC2.  The combination of a powerful SQL engine, expressive scripting language (PL/SQL) ,OS independence, web front-end (App Express) and the ability to communicate with Excel (via OO4O) made Oracle a natural fit for heavy-duty data manipulation.   But there was always one major problem, Oracle doesn’t play well with other data sources, necessitating a separate ETL bolt-on, which led me to play around with the likes of Kettle and Talend.  But having been seduced by these new shiny (and open source) “toys” I’ve found that rather than just been incidental add-ons they had the potential to totally replace Oracle.  The combination of Talend, SQLite and Groovy, is proving to be particularly magic.

So how will these three tools enable you to leave behind your Oracle past?

Talend (in its Java form) is a superb ETL tool, via JDBC is can access every database type on the planet, it has built-in web-service capability and access to a  multitude of APIs via its Java component for non-database data sources.  The addition of  Groovy makes the use of such Java APIs simpler and quicker and the same Groovy acts as a replacement for PL/SQL when a bit of “if-then-else” logic is required.  And although Talend offers a built-in option to plublish an ETL job as a WAR file exposing a SOAP web service, Java/Groovy also allows for the integration of the powerful, yet simple, Jetty API to embed a web server within Talend itself.  And all this for free, and better than free, open source.

So where does SQLite come in? And, didn’t you say that Excel integration was important, how will Excel communicate with Talend?

As very little corporate data is held in SQLite format, and Talend allows access to every major commercial/free database, the usefulness of SQLite might not be at first obvious.  But if you think of SQLite as a data cache, a fast and efficient local tabular datastore, with a powerful but well understood DSL (i.e. SQL) and a drop-dead-simple setup and backup regime (basically copying and creating files), maybe then you can see its attraction. The ability to extend the DSL by easily creating SQLite user defined functions (UDFs) within Talend using either Java or Groovy is also another powerful feature.

For example…

select customer_id, name,customer, sales_region, getpalodata(“SALES”,customer_id,”All Products”.”Total Sales”,”Euros”,”YTD”)  as customer_YTD, getpalodata(“SALES”,sales_region,”All Products”.”Total Sales”,”Euros”,”YTD”)  as region_total_YTD from list_of_top_customers;

… where getapalodata is a UDF that wraps calls to a Palo cube.

With this type of setup I can easily mix and match list/tabular data with multidimensional data points using SQL (something that Oracle also supports but only if you hand over a large wad of currency). In fact I can create a mini data warehouse, with Palo providing the pivot, ( as SQLite lacks star-query (or even multi-index query) support.)  SQLite would still host the conformed dimensions and the fact tables, but with the fact tables acting as feeds to Palo cubes, supporting finer-grained drill-throughs from cubes or for ad-hoc queries. This is powerful stuff, simple, free, powerful stuff.

… and the spreadsheet access?

A Talend sub-job such as this…

Talend Groovy Jetty web server

Talend Groovy Jetty web server

Example of Groovy code calling Jetty API

Example of Groovy code calling Jetty API

…would provide a simple RESTful (rather than SOAP) web service which could be accessed either with an Excel Web Query or via a VBA macro which would parse the result and allow for more control.  For example …

http://localhost:1234/sqlgateway?sql=select customer_id,name from all_customers&type=HTMLTable

… this would return a list of customers wrapped in an HTML table, or …

http://localhost:1234/job/extractProspects?Rep=JonesTom&Month=JAN&SourceCompany=AXA&type=HTMLTable

…this might call a Talend job called extractProspects, passing in JonesTom, JAN and AXA as context parameters, which would then return a list of prospects extracted from a feed supplied by AXA’s system.

What would the Talend job look like?

The job might operate something like this:

  • It would run either on the client as a service or on a LAN based server (or on a remote server, with a SSH VPN (or Hamachi) to provide security).
  • At start-up, do a bunch of ETL tasks, pulling data from remote sources and databases, transforming and aggregating data etc. Storing the resulting data in local SQLite databases.  It might also build Palo cubes or update larger enterprise databases.
  • The job would then setup a Jetty web server and await requests for data.
  • The requests might be a mixture of raw SQL or requests to run specific Talend transformations which would return a dataset directly to the calling client or maybe just acknowledge the request, queue it up for processing later, sending the resulting dataset by EMail or RSS feed when finished.
  • At a fixed time the service would shut it self down and requeue itself for the next day’s workload.

… or nothing at all like that, and that’s the point, build what you need, add the levels of security (or none at all) that fits your situation, all within a open framework, with zero lock-in (okay, still using Excel, anyone for OpenOffice, Google Apps or Zoho?).  You don’t even need your own server, host it on an EC2 instance, (if you bring up an instance for 10/12 hours every working day, it would cost about $20/$25 a month).

Now tell me that doesn’t make sense?

OLAP Cube as a Mind Map

If you’ve worked with OLAP technologies for any length of time you’ll undoubtedly have been in the situation where you’ve had to explain the concept of an OLAP Cube to a “newbie”.  If the person in question has come across Excel pivot-tables, then you can probably short-circuit the conversation some what, explaining that a pivot table is in essence an OLAP cube, maybe highlighting the differences between it and whatever OLAP tool you’re proposing; ragged hierarchies, ability to update cells and ‘spread’ values down hierarchies etc.  Even better if you can show the user an working cube populated with hierarchies and elements that match the user’s business.

But what if the person is a complete novice to the world of analytics and you don’t have a relevant demo cube that you can demonstrate, what then?  I guess, you could start by first trying to fry the victim’s user’s brain by explaining the concepts behind multi-dimensional spaces and/or a quick intro into the world of de-normalised databases, with examples of snow and star schemas. The glazed look in the eyes of your audience may however suggest this doesn’t always work.  And it’s not just business-users who have problems getting their heads around cubes, many programmers also have difficulty the first time they’re exposed to OLAP concepts.

I’ve found that Mind Maps offer a good way to help both me and the client to visualise the domain model that a cube will eventually address; having worked through the mind mapping process it’s then easier to take the user with you as you translate this model to a physical cube or star-schema.  Well Hugo (who’s back blogging after a long absence) promises to take this method a step further, his modifications to FreeMind will allow for the export of a mindmap as a Palo cube!  Check it out …

Groovy as Talend’s scripting language

Although I had decided to use Talend (Java version) as my primary ETL tool I still had one major problem with it, its lack of a scripting tool.  Kettle (Pentaho PDI) has Javascript, Excel has VBA, Picalo has (well OK, is) Python and Talend in its Perl version has Perl.  I could have gone (and did experiment) with calling Javascript, Jython or JRuby via JSR223, but I wasn’t happy with the level of integration afforded by this, opting instead to make command line calls to Python (using SQLite as a data carrier).

Then, I discovered Groovy, or I should say rediscovered it, as I’d come across it many years ago when it was far less developed than is now, liked it then but couldn’t see a use for it at the time and promptly forgot about it.  Then it appeared wrapped in a Talend component, prompting me to do a quick visit to the Groovy website, which turned into a deep-dive into the language; I’d found my scripting tool!

Groovy (by the way what a terrible name for a language, or is that just me?), is not really a stand-alone language but more an extension to Java itself; offering the full power of Java but with addition of closures, builders and dynamic types.  In fact, over time Groovy has become more and more Java like (the biggest missing being lack of support for anonymous inner-classes).

To underline this convergence, Groovy is being developed under the separate JSR 241 rather than JSR 223. There’s full interoperability between both languages; Groovy  compiles down to JVM bytecode and can use Java classes and objects, Java can likewise use Groovy generated bytecode.  This allows for fast prototyping and development without compromising access to Java’s vast collection of libraries.

Here for example, is a piece of code to try out the JPalo library’s ability to access a Palo cube …


[sourcecode language='Java']

import org.palo.api.Connection;
import org.palo.api.ConnectionFactory;
import org.palo.api.Cube;
import org.palo.api.Database;
import org.palo.api.Element;
connection = ConnectionFactory.getInstance().newConnection("localhost","7777","admin","admin")
database = connection.getDatabaseByName("Demo");
;cube = database.getCubeByName("Sales");
rowElements = cube.getDimensionAt(0).getElementsInOrder();
columnElements = cube.getDimensionAt(1).getElementsInOrder();
dataSet = [rowElements,columnElements,]
dataSet << cube.getDimensionAt(2).getElementAt(0)
dataSet << cube.getDimensionAt(3).getElementAt(0)
dataSet << cube.getDimensionAt(4).getElementAt(0)
dataSet << cube.getDimensionAt(5).getElementAt(0)
// fetch data set
datas=cube.getDataArray(dataSet as Element[][])
connection.disconnect();
// parse the return string
rowcount = rowElements.length;
columncount = columnElements.length;
data=[]
heading=[]
// first row set to the row names (i.e. "Product name" followed by the country names )
heading << "Product"
for (i in 0..columncount-1) {
    heading << columnElements[i].getName()
   }
data << heading
// Now  out each line
for (i in 0..rowcount-1) {
    row = []
    row << rowElements[i].getName()
    for (j in 0..columncount-1) {
         row << datas[((i + (j*columncount)))]
         }
    data << row.flatten()
   }

//output to csv file
def csvOut= new FileOutputStream('c:/data/File.csv' )
for (lines in data) {
         lines.eachWithIndex{col,i ->
                             if (i > 0) {
                                 csvOut << ","
                             }
         csvOut << col
     }
     csvOut << "\n"
}
csvOut.close()
[/sourcecode]


This was done in the Groovy console as a proof of concept, it was then transferred to a tGroovy component where it was parametrised and instead of outputting to a CSV file, it was used to fill the globalBuffer structure (the structure used by tBufferOutput component).

Other things I managed to do with Talend tGroovy over a few days:

  • Extended SQLite with my own user-defined Palo functions.
  • Set-up a Talend job as an Excel accessible RESTful web service using Jetty.
  • Interfaced with Amazon S3.

Although I was very familiar with the S3 and the JPalo API, both SQLite UDFs and Jetty were new to me, and that’s were scripting proves it worth, giving the developer the maximum support with the minimum of background noise.  But it’s not just weird and wonderful new APIs that scripting helps expose but as a datasmithing tool, languages such as Groovy give analysts the ability to quickly de-construct and model datasets (for example, see Groovy’s SQL database support and collections’ functionality).

As a infamous Irish farming-pharma TV ad of my youth put it, “It’s a queer name but great stuff“.