Category Archives: Ruby

LightSwitch & Hobo – the return of the 4GL?

Those of us of a certain age have fond memories of the golden era of 4GLs. These simple, but at the time revolutionary, tools enabled business-aware programmers (usually termed analyst-programmers) to quickly build & deploy line-of-business apps. They (both tools and devs) were primarily data-driven, data begat screens, screens begat more data and so on. The resulting apps where server delivered, using either green-screen terminals or client-side delivery apps (a bit like current-day client-side RIAs). The resulting UIs could best be described as “plain but functional”.

These halcyon days were replaced by a combination of  2-tier Windows apps and by the 3-tier enterprise platforms. The increasing sophistication & complexity of the new platforms forced programmers to become highly specialised, often losing their once close links with the business, and even losing sight of the value of business data as a resource in itself (still amazes me when I come across business application programmers with poor or non-existent SQL/RDBMS skills). The analyst-programmer (AP) was no more.

Many of those APs, like myself, managed to stay close to the business by either becoming business/data analysts or datawarehousing/BI specialists. ERP platforms such as SAP, with their complex configuration requirements, also created a welcoming home for business-focused IT refugees.

The need for quick’n'dirty line-of-business apps has not disappeared, but this service is now often being provided by tools such as MS Access and above all, by Excel. This is both good and bad; the good is the expansion of development skills outside of IT; the bad is the effective dis-arming of a large proportion of professional IT folks. To paraphrase, Marvin the Paranoid Android: “Brain the size of a planet, and all they give me is Excel!”

Most of us managed to make the best of the situation, learning to respect, if not love, Excel; becoming SQL wizards; MDX magicians and dimensional modellers par excellence. But the call of the 4GLs remained and we veterans continue to keep a watchful eye for something that will match and hopefully surpass them.

Oracle’s Application Express (aka HTML DB) offered those with Oracle skills hope (very similar in concept to SQL*Forms V3) and the open source tool WaveMaker is also excellent, styling itself, with good reason, the Powerbuilder for Web Enterprise.

In the last month I’ve come across two modern-day descendants of  these 4GL data-driven tools. Hobo and LightSwitch.

Hobo, is an open source extension to  the Ruby on Rails platform. It uses the same MVC architecture but takes the concept further in not just allowing you to build a relationship model for your data, but also enabling the  easy specification of  a lifecycle model and, here’s the biggy, automatically building (and re-building on change) a very respectable UI to present to the outside world. It also offers a starter authentication framework and lots of other useful helpers. All without writing a line of Ruby code, or having any idea of how RoR works!  The end result is a tool that can not only quickly, and iteratively,  build CRUD type applications, but can also handle simple workflow apps out of the box.

LightSwitch, is similar but yet very different. The same, in that it follows the traditional 4GL data to screen approach; presenting the user with a graphical tool to build data tables, to link to existing data sources and to create relationships between entities. Screens can then be generated that, like Hobo, are professional looking and easy to use. Again like Hobo, new ‘skins’ can be applied to change the look and feel.

If a more sophisticated solution is required, code can be added (VB.NET or C#) at predefined events and indeed the resulting project, being fully VS 2010 compliant, can be opened in VS Professional and built out from there. (A similar ability to get under the bonnet exists for Hobo as it is essentially Rails).

Where LightSwitch differs is the deployment methods used. The end result is a Sliverlight app which can either run client-side (with full access to the client’s environment e.g. interact with Office etc.) or as a sandboxed IIS browser app or via the Azure cloud. Same code, same project can easily migrate back & forth between all three options. (Hobo, being a Rails app could also be sort-of-localised using RubyScript2Exe, but it could very easily be cloud deployed using the EC2-based, dead-simple to use, http://heroku.com/)

The LightSwitch data modeller also allows for relationships between local databases, network databases, SQL Azure cloud databases and web service datasets to be built and maintained within the application. The need for mashups between local & central/remote data is a constant requirement for LOB developers and LightSwitch appears to made it very easy to implement.

This data mash-up ability and the option to interact with the client will be a major attraction, at least for corporate devs working largely with MS tools. I say will, as alas the current beta is  “molasses-in-January” slow. I thought initially it was just my 5yr laptop hitting the wall, but others with more modern & powerful hardware also found it so.

So do tools like Hobo and LightSwitch herald the return of the IT analyst/programmer? Probably not, different times; outsourcing, SaaS and packaged software have and will continue to reduce the number of business-facing IT staff. But their places are been taken by IT-aware business folks, citizen programmers, creators of time-assets and it is they that will likely be the beneficiaries of such tools.

Python the new VBA ?

These last two weeks, Python has been on my mind. First off, last week I decided to make time to fully investigate Picalo, an open-source Python-based data analysis tool, and then, this week, Google announced their long awaited cloud-computing offering, Google Apps Engine, with the language at its core.

Python was the first of the “LAMP generation” scripting languages that I decided to learn in any detail ( I had used Perl before that but only on a per-task basis (similar to how I’d used AWK)). I then invested time in learning PHP, then Ruby and finally JavaScript. And here I am, back where I started, with Python.

But it’s not the same Python I learned three years ago, not that it has changed that much, but my appreciation of the language has, largely due to my deep dives into other languages. For example, JavaScript’s treatment of functions as first-class objects, highlighted the same functionality in Python, something I’d missed (or rather, not fully understood) the first time I encountered the language. Likewise, Ruby’s RoR introduced me to a “best of breed” approach to web application design, something that can be used as a comparison aid when approaching new web frameworks such as Django.

But of course the scripting language that continues to power most of my datasmithing activities is Excel VBA. That’s why I was so excited to see a tool such as Proto utilise VBA as its scripting language. But, Microsoft has abandoned VBA, there will be no more Protos.

Also, Excel VBA is now a Windows only language. Windows, however, is no longer the ‘only’ business client OS (see how many Apple laptops you can spot the next time you’re in a business-class airport lounge, a few years ago it would have been zero, not any more), and is currently nowhere to be seen as a cloud computing platform (but that’ll change).

I’m at heart a table-oriented programmer, and I, like Picalo’s author Conan Albrecht, believe “data analysis is best done through scripting”; but not just data analysis, the T in ETL (Extract, Transform and Load) and the I in DI (Data Integration) and SI (Systems Interfacing) also benefit from a scripting approach.

So, what to adopt as a successor/companion-in-her-old-age to VBA, will it be Ruby, JavaScript, Python, Perl, even PHP?

It looks like it’ll be Python because it’s …

The runner up is of course Ruby, but its poor integration with Windows is a major problem and the datasmithing “prior art” of Picalo and Resolver makes Python hard to beat.

UPDATE Jan 2010:

To experience the best of both worlds, VBA & Python, my xLite (Excel combined with SQLite) datasmithing platform now allows Python to be used in conjunction with VBA.  Check it out here http://www.gobansaor.com/xlite

UPDATE:

Also, as Dan pointed out in the comments below, I’d not included Jython in my list of reasons for embracing Python. I must add it to my list of things to try out particularly as both my “classic” ETL tools, Talend and Kettle are JVM based.

Another thing to add to the (ever growing) list is Mike Pitarro’s SnapLogic python-based ETL tool. They have …

…just released a 2.0 Beta version with some major architectural enhancements. The SnapLogic model is very different from traditional ETL systems. It takes an approach that’s more like the web, based on loose coupling and HTTP interactions. We model data source, sinks, and transformations as URI addressable endpoints, and have a model where than can be chained together in pipelines to build transformation logic. We use a plugin architecture to make it easy to add custom components.

Zimki – the spirt lives on …

Although Zimki is to shut down on Christmas Eve, the ideas behind the service live on. Two new offerings, Horuku and AppJet, offer variations on the idea of hosted application development/deployment.

AppJet, funded by Paul Graham‘s Y-Combinator, is very similar to Zimki, being a server-side JavaScript platform. No details yet as to what sort of paid options will be offered (all accounts are free at the moment). Unlike Zimki there’s no plans to create an open-source version. I like the easy “build a Facebook app” feature; and I guess this is the sort of light-weight applications that they hope to attract.

Although Heroku uses Ruby-on-Rails technology, rather than JavaScript, it is closer to the original Zimki idea; but rather than take the hard (and ultimately unsuccessful in Zimki’s case) road of building an open-source platform from scratch, Heroku takes an already popular open-source project and offers it wrapped in a full on-line development and deployment environment. Again, being in beta, there’s no indication as to what pricing model it will operate under, but I would think that it will attract more “serious” projects than AppJet since anything developed under Heroku is pure Rails which means it can be migrated to any other Rails hosting environment; so no lock-in. The online editor is excellent and whatever about its merits as a hosting service it’s by far the easiest way to learn and explore Ruby and Rails, even easier than this…

If Facebook apps are your goal but you wish to use Ruby rather than AppJet’s JavaScript then not to panic, as being Ruby some bright young spark (no, not me I’m afraid) will already have done a lot of the hard graft for you…

Ruby plus Amazon S3 – Document Centric Database

I’ve said it before and I’m going to repeat myself; learning Ruby has proven to be a great investment, not so much for the language itself but for the insights it gives into other technologies. As soon as a new ‘cool’ technology or idea hits the street some smart Rubyist is bound to attack it, dice it up and serve it back up as easy to digest Ruby code.

Today, it’s the turn of Document Centric Databases done in the style of CouchDB, but replacing JavaScript/Erlang with Ruby and the bespoke data store with Amazon’s S3 service.

Anthony Eden‘s RDDB project is still very much alpha, but looking through the code it looks like it has lots of good ideas, including using EC2 instances as “map reduce workers” listening on Amazon SQS Queues; so the whole Amazon AWS stack might yet get staring roles. The actual data store can be varied, with both partitioned file system and RAM based options currently available alongside S3.

Other Amazon AWS related news, was the announcement today of an option to use European data centres to store S3 data (with a slightly higher charge than using North American locations and with the transfer of data between EU based S3 buckets and US based EC2 instances being no longer free). I’m guessing that the option to fire up European based EC2 servers can’t be far behind. Also, one piece of news I’d missed was that EC2 is now in unlimited beta i.e. it’s now open to all developers. So developers everywhere can, for less that the cost of a mobile text message, fire up their own dedicated and powerful Linux server. The day of a production ready, SLA backed, EC2 service is around the corner.

Google Spreadsheets – ETL tool

Although I’m a total Excel fanboy, I most admit I rarely use it any longer for personal stuff such as home budgets, tax calculations, what-ifs, to-do lists etc.; I now tend to use Google Spreadsheets. Likewise, personal notes, drafts and useful bits of code are stored using Google Docs rather than MS Word. Three main reasons for this shift to the cloud:

  • Google Docs & Spreadsheets are ‘good enough’ for most of the trivial lists and calculations I require in my personal life and indeed for most business purposes as well, at least those that don’t require a pivot table.
  • These spreadsheets and documents are important but not necessarily in the ‘state secret/I-could-tell-but-then-I’d-have-to-kill-you’ scale of things, by building them in Google Apps they are securely backed-up and easily accessible.
  • A lot of the spreadsheets are collaborative in nature, and in the collaboration field, Google Spreadsheets just gets better and better.

Today, Google announced further additions to their spreadsheet product. The AutoFill feature adds functionality I’ve come to expect from Excel, but with a twist, integration with Google Sets. But the additions that really caught my eye were the new data import functions. Now again, Excel has had web queries since Excel97, and it always amazed me why online pretenders to the throne tended to ignore the most common source of tabular data on the web, the HTML table; something to do with the great XML/Tables divide I guess!

Google now not only fixes this omission,providing access to HTML tables and comma/tab separated file, but also provides access to RSS/ATOM and generic XML sources. All that’s missing now are functions that can read other common online data files formats such as Excel, MSAccess, XBase and of course SQLite.

This addition of HTML import support and the AutoFill feature will further reduce the number of times I’ll need to fire up Excel for personal tasks, but the RSS/ATOM/XML import feature also has potential as a tool in my micro-ETL toolbox. Using Excel as my only micro-ETL tool is possible when the data is either already in Excel/CSV or accessible via a COM API or via ODBC drivers, otherwise I can call-in either Ruby, Talend, Kettle or even RSSBus. But now I’ve another option, if the data is public and published as RSS/ATOM or some other variation on XML, I can use Google Spreadsheets to fetch the data and import the resulting tabular dataset into Excel via a Web Query or via the GData API.

New Google Reader Search facilityOne other thing. While researching this post, looking up links etc. I used another new feature Google added today, Google Reader’s new Search facility. As most of my references are discovered via the blogs I subscribe to, the ability to restrict searches to that subset of the web is fantastic; I even used it to search through my own blog posts! If del.icio.us offered the same option it would make re-finding stuff even easier. I did try to use Google Co-Op to build a search engine restricted to my del.icio.us links but it didn’t seem to like the volume of links (4000 odd) I sent it.

Facebook Apps using Ruby on Rails

This is what I love about Ruby and Ruby on Rails, once you learn the basics of Ruby and how a RoR app is put together you can use this knowledge to learn about other technologies, in this case Facebook Applications. The reason for this is, as soon as a new technology hits the street somebody is bound to either build a Ruby library or a RoR plugin targeting the new (and presumably cool) platform. In two excellent articles Stuart Eccles shows how to build a Facebook app using the rFacebook Rails extension. I guess I could have looked at a PHP or Java example but I chose (as I nearly always do) the Rails route as the layers of abstraction and the standard infrastructure of RoR apps allow me to quickly get an overview of the new technology in action but also, if I so desire, allow me to easily deep dive into anything that requires more detailed investigation.

In the late nineties I learned Java (and later .NET) for a similar purpose not as a primary development tool but because of its role as the “language of account” of most new technologies at the time. And although I still come across environments where the only examples are in a Java (or .Net or PHP) I know if I wait a few weeks some bright Rubyist will eventually “document” it in either Ruby or Rails.

Using the sample application built by Stuart I’ve added one of my favourite salad recipes,
Banana & Tomato in a Mustard Dressing
..enjoy.

Like Excel macros? You’ll love this..

One of the most used (and abused) features of Excel is its macro recording facility. How many mundane and repetitive actions have been automated using this feature? How many people found the courage to program in VBA by using the recorder as their training-wheels? Well now iMacros (from German company iOpus GmpH) a Firefox extension and an IE add-on brings macro recording to the web browser. Although Greasemonkey already enables JavaScript programmers to automate Firefox, iMacros offers the same power to non-programmers.

The macros can be saved as bookmarks and can also be shared with others (although I couldn’t get the sharing feature to work). I’ve already automated a number of tedious tasks and had hoped to use it with Google Spreadsheets, alas either the nature of Goggle’s JavaScript or inbuilt protection within Goggle Apps stopped the recorder for working. Pity, as there’s three things missing from Google Spreadsheets as it stands, pivot table support, offline ability and macro support. Google Gears will undoubtedly solve the offline problem, charts are essentially a graphical pivot so I guess a table pivot must be a possibility, iMacros or something like it could perform the duties that VBA provides to Excel. As with Office macros, security issues may well dampen the parade – iMacros allows access to the PC’s file system and a macro can be invoked from a bookmarklet camouflaged as a standard link – but I’ll not worry until (or if) the product goes mainstream.

From an ETL point of view iMacros can act as a powerful web scraping tool and as a automated form-filler. There’s also a commercial version of the product ($499.00) that exposes the tool via an ActiveX API which means Excel/VBA can be used as a web scraping/ form filling environment. If the price tag is too steep then the excellent scRUBTt is both free /open source and is ideal if you’re scraping a lot of data on a frequent basis while for small or once off tasks this equally free and open source Firefox extension is good enough.

RoR Data Warehouse on EC2

If you’ve been putting off evaluating Ruby on Rails and you’re lucky enough to have an Amazon EC2 beta account then it’s your lucky day. Paul Dowman has just made a public AMI (think of it like a virtual machine spec from which you can create a running EC2 instance) with various Ruby on Rails goodies preloaded.

Features:

  • Automatic backup of MySQL database to S3 every 10 minutes.
  • Mongrel_cluster behind Apache 2.2, configured according to Coda Hale’s excellent guide, with /etc/init.d startup script
  • Ruby on Rails 1.2.3
  • Ruby 1.8.5
  • MySQL 5
  • Ubuntu 7.04 Feisty with Xen versions of standard libs (libc6-xen package).
  • All EC2 command-line tools installed
  • MySQL and Apache configured to write logs to /mnt/log so you don’t fill up EC2’s small root filesystem
  • Hostname set correctly to public hostname
  • NTP
  • A script to re-bundle, save and register your own copy of this image in one step (if you want to).

I’ve been meaning to try out Anthony Eden’s RoR based data warehousing tool for some time; no more excuses as I now can fire up an EC2 instance based on Paul’s AMI , install the ActiveWarehouse plugin and away I go. As ActiveWarehouse primarily uses techniques described in The Data Warehouse Toolkit it’s also a good learning tool for those new to data warehousing. All I need now is a sizeable publicly accessible dataset to populate the warehouse to get a true fell for its capabilities. There’s only so much you can do with the venerable Northwind database. Does anybody know of a ‘beefier’ alternative?

Talend vs. Kettle (Pentaho PDI)

Over the last few weeks I’ve received a lot of traffic from Goggle searches comparing Talend and Kettle and also from Vincent McBurney’s ITtoolbox article comparing the two products, so where do I stand?

As ETL tools they take different approaches, Kettle is a meta data driven framework (which is in turn tightly integrated into an even larger BI framework, the Pentaho BI Project), while Talend is at the end of the day a code generator. The code generator nature of Talend does impart certain capabilities such as the ability to easily integrate into other BI platforms (such as JasperSoft and SpagoBI ) and as micro-ETL tool. On the other hand Kettle’s increasing integration into the Pentaho family makes it natural choice for larger BI implementations. So which one would I recommend ? It’s got to be Kettle and the main reason for that choice is the Pentaho PDI community.

My experience so far with the Talend community has not been good. Admittedly I’ve only asked two questions, one looking for more details on a statement regarding ERP and CRM connectors mentioned in a press release and the other a technical query about a new tSQLite connector. Neither were answered. OK I’m a big boy I can handle the rejection and I’m also technically savvy, I can eventually fix most of my own problems. But I’m sure if I posted similar questions on the Pentaho forums I would have received an answer, probably from Matt Casters if he though the general community would not be able to provide the answer.

So will this experience stop me from using Talend? Maybe not, the code generation capability may come in handy as a micro-ETL tool (but I do have other options); however, through an offline conversation with Samatar (an active member of the Kettle community) and Matt Casters I learned of a new XML input step and of a PDI alternative to the Talend tJavaFlex component, the “Modified Java Script Value” step. This allows for the creation of START, TRANSFORM (i.e. for-each-row) and END JavaScript scripts. As Kettle uses the Rhino scripting engine I have access to any Java API (e.g. Palo,S3, Google Spreadsheets) but with the flexibility of JavaScript; brilliant!

Not only that, since my experience with Talend tJavaFlex reintroduced me to the world of Java I though I might as well look into creating my own Kettle plugin and guess what, it looks easy enough. So I’m going to develop my first plugin to be based on the ScriptValueMod step, replacing the Rhino engine with the Java 6 ScriptEngineManager. This will give me access to the latest version of Rhino which supports E4X (will make handling XML must easier) and also opens the possibility of using JavaFX or indeed JRuby as Kettle scripting languages.

UPDATE:

I’ve received a reply to my tSQLite question!

UPDATE: July 2008

This post is now over a year old, Talend as a product and as a community has improved enormously in the last 12 months; so much so, I now use Talend  (Java) in preference to PDI for most of my datasmithing needs.

JavaFX – a GUI DSL

Having mastered JavaScript (OK master is too strong a word – having become comfortable with both its syntax and usage patterns) my next port of call is JavaFX the recently announced Flash/Silverlight competitor. What led me to JavaFX Script was not its role in this Flash/AJAX alternative platform (which unless  Sun improves the JRE download experience is dead in the water they may have a chance with the promised consumer JRE ) but its status as the 2nd scripting language to be supported by the Java 6 ScriptEngineManager – the first being JavaScript. Although the JavaFX platform is still in alpha, the important elements of the scripting language appear to be usable.

Why do I need another scripting language? Well JavaFX Script is a DSL (domain-specific language) and the domain in question is the GUI and as a GUI DSL, JavaFX is impressive.

In my quest for a micro ETL environment the lack of a fast and powerful GUI tool has been a problem. Admittedly, in my Excel/Sqlite xLite suite GUI generation is not a problem as VBA forms are both fast to develop in and feature rich, but the other two options (Java and Ruby) both lack a cost-effective GUI tool (I know there’s Swing and Tk but for somebody who’s used to the speed of development of tools such as VB forms or Oracle’s Application Express, neither appealed). JavaFX changes all that, it’s elegant, powerful, fast to develop in and portable. Further pushes me down the road of using Talend generated Java as my micro-ETL environment of choice.

It’s easy to dismiss JavaFX as hype but the technology behind the hype looks sound. The language borrows powerful features from other languages, object literals from JavaScript, list-comprehensions from Python/Erland/Haskell and first-class functions from JavaScript/Lisp all combined with the full power and glory of Java. The end result:

..is a declarative and statically typed programming language. It has first-class functions, declarative syntax, list-comprehensions, and incremental dependency-based evaluation. The JavaFX Script language makes intensive use of the Java2D swing GUI components and allows for easy creation of GUIs.

Go the community page for FAQs, tutorials and reference information. Timothy O’Brien has a good first pass at calling JavaFX from a Java program and this article by John Smart explains the basics of using ScriptEngineManager to call JavaScript.

VBx – the future VBA?

With the future of VBA being a concern for many Office professionals, some of the  MIX07 announcements around dynamic language support in Silverlight may shed some light on a potential replacement for the VBA, VBx.

There’s life in the old dog yet, it may yet get to share the limelight with its cooler cousins Ruby and Python!

HacketyHack from WhyTheLuckyStiff

No it’s not a tip for the 5.50 at Punchestown it’s the latest project from _why (a legend in the world of Ruby, if a language as young as Ruby can have legends). HacketyHack is a framework to teach kids how to program, built using Ruby and the gekco browser engine, it’s free and it’s an ideal next step when your kids get a bit bored with MIT’s Scratch. It’s not as much “fun” as Scratch but it will guide kids into the “real world” of programming and as the code is pure Ruby, the sky’s the limit. If fact if any “real programmers” out there have been meaning to learn some Ruby to see what all the fuss is about, HackeyHack is an excellent way to start.

Want your kids to have the programming itch, then Scratch it!

Having looked in the past for a suitable introduction to programming for my 10 year old son I had come to the conclusion that the existing options (such as KPL) where too ‘wordy’ and not able to compete with the point and click powered online/gaming worlds that youngsters now inhabit. That was until I discovered Scratch, an education oriented programming language developed by MIT.

As an introduction to programing it is superb and it’s also fun. I could not drag my son away from it and not only that, within two days he had come in contact with (and understood) most of the important structures of any programming language, including loops, if-then-else statements, variables, encapsulation, events, messages (the tool is built in Smalltalk and it shows).

When he outgrows Scratch, he would then be capable of moving on to other more powerful education oriented languages such as Phrogram (the next version of KPL) or Squekland (built using another Smalltalk variation) or indeed a ‘real’ language such as Ruby.

I could have done with this 15 years or so ago when I was trying to get my head around OOP concepts (when I started coding, procedural languages such as Cobol, PL1, Fortran, MUMPS and various types of BASIC where the weapons of choice for commercial development).

I also think end-user focused language designers could also learn a lot from the design of the scratch IDE; perhaps if Microsoft were ever to retire the venerable VBA, they should replace it with a Scratch-like macro language?

Check out the video and definitely download it if you have kids in the house…

Build a Data Warehouse using Ruby On Rails

Nikola pointed out ActiveWarehouse , a new RoR Data Warehousing project. Haven’t tried it out myself but the author has posted an excellent tutorial (post deleted but can be found here on Wayback Machine) on using the plug-in to create an example warehouse. Along with its ETL and SQL Views sub-projects ActiveWarehouse provides a great foundation for Ruby based BI systems. Being a early stage RoR project the use of MySQL as the database layer is assumed but as it’s based on ActiveRecord adding support for other relational databases should be trivial.

Along side ROLAP support, the ability to load and retrieve data from MOLAP sources such as the open source Palo server would be a nice addition; while the ability to combine both MOLAP and ROLAP together under one platform (as Oracle now supports) would be fantastic.

What’s so good about SQLite?

…continuing my posts on Ruby and SQLite as a micro ETL environment.

I’ve written before that my most important take-away from Ruby On Rails was the language Ruby, but RoR also introduced me to SQLite. Although its typical use in Rails is as a development database I quickly realised this open source ,very fast and zero-configuration SQL database has a role to play in the real world of data analytics; a cross-platform and free alternative to the MS Access Jet database engine. Like MS Access, SQLite is not client-server based, so no separate services or jobs required on the host machine; a full database is held in a single file, making it easy to transfer from one machine to another.

Will SQLite replace MySQL, Oracle, SQLServer or indeed the venerable MS Access? No, but as an embedded database or as a light-weight local data-store it has many advantages:

  • SQLite supports most SQL92 syntax, including database triggers, makes skipping back and forth between MySQL, Oracle and SQLite much easier.
  • SQLite runs on all major OS and programming language platforms, e.g. a database populated on Windows by VBA can be read or modified under Linux by a Ruby program.
  • It’s free, in future the Jet engine will be no longer be distributed with the latest Microsoft Data Access Components, you’ll have to purchase Access 2007.
  • SQLite supports in memory tables, making it possible to create a local “ROLAP star” with the fact table file-based and the dimension tables loaded within memory.
  • Several free GUI administration tools.
  • SQLite does not enforce data type constraints. Any data can be inserted into any column.

Now, the last point may seem more like a bug than a bonus, but in the world of ETL, dirty data is a fact of life; those working at this end of the pipe must accept what’s thrown at them. A forgiving weakly-typed data store (like a late-binding and weakly-typed scripting language) can be just what the doctor ordered in such an environment. In my early career as an OLTP developer I would have baulked at the lack of strong-typing but now I side with Bruce Eckel’s essay on strong-typing versus strong-testing , productivity is what matters and the ability to test is the most important skill that a developer should possess.

See also SQLite as the MP3 of data

Why Ruby?

.…continuing my posts on Ruby and SQLite as a micro ETL environment.

Like many people my first introduction to Ruby was via Ruby On Rails and like others, RoR introduced me to a best of breed approach to developing scalable MVC based web sites. But as building scalable web sites is not what I do, the real take-away from Rails was the language Ruby.

My major skill is as a ‘hacker and hauler’ of data, what I call data smithing. In the past I would have used a combination of readily available ‘corporate tools’; a SQL database (usually Oracle/Access), procedural languages (usually PL/SQL, VBA and various Unix scripting shells (bash, awk, Perl)) and last but not least Excel. Usually ‘hacking’ away at the database level, occasionally using APIs but even those APIs tending to match the underlying data schema. But now with the explosion in web service based APIs, a new tool is required. This is where , at least for me, Ruby comes in.

Ruby is an open source, multi-paradigm, interpreted and increasing popular language, and all four attributes combined to make Ruby a natural choice for inclusion in my data smithing tool-bag.

Open Source:

Being open source, the language has been ported to all major OSs including Windows, has a vast array of 3rd party libraries (many ported from Perl) and very powerful supporting tool-sets such as a Rake and Gem. There is also a tool RubyScript2Exe which can transform a Ruby application into a standalone, compressed Windows, Linux or Mac OS X executable, i.e. no need to have Ruby installed on the host machine. Ruby runs not just under its own VM, but has also been ported to the .NET (Ruby.NET) and Java (JRuby) VMs and indeed with the support for embedded scripting languages within Java 5 (starting with Javascript) this trend will continue.

Multi-paradigm:

Ruby can be programmed as if it was a shell scripting language; one procedural command followed by another (ideal for quick scripts) or used as most of use use VBA; a scripting language to manipulate existing sophisticated and professional object libraries or using its powerful OOP capabilities the language can be (and is ) used to develop robust and easy to use object libraries.

Interpreted:

Data analysis and the associated ETL process is, at least in its discovery phase, a ‘hacking environment’, i.e. a lot of suck-it-and-see, both the inputs (raw data) and the outputs (reports and dashboards) tend to be fixed outside the control of the data analyst, in such situations a forgiving and flexible development language is essential. VBA provides that flexibility and so does interpreted Ruby.

Popular:

Ruby’s popularity has increased over the last two years attracting enthusiastic and bright programmers to play with it and to produce some really useful code. If a new service appear, such as Amazon’s S3, you can be guaranteed within a short period several high quality helpful Ruby libraries will also appear.

My next post I’ll explain why I like SQLite….