Category Archives: AmazonAWS

SQL – does exactly what it says on the tin

SQL how unloved it must feel sometimes, constantly being maligned, accused of being on the wrong side of the object-relational impedance mismatch,  lacking the glamour of OO programming languages that claim the moral high ground. Yet at the same time hewing and hauling most of the world’s structured data on its old but well fashioned back.

SQL is perhaps the world’s most popular DSL, a declarative language for the manipulation of tabular data, easy to learn yet capable of powerful (and sometimes complex) expressions.  And like the Ronseal ad, a SQL statement no matter how simple or complex, does exactly what it says, all the complexity of loops and iterations and the attendant errors, abstracted away, it just works!

SQL is both a programmer and an end-user tool; after Excel formulas, it’s the language most likely to be understood and used by “civilians”.  There are few enough such cross-over tools, so think twice before building a datastore that doesn’t offer a SQL API.  And I guess that’s what Amazon did. Although SimpleDB is not a relational database, they’ve decided to add a SQL API, following Google’s lead with its SQL front-end to the non relational big-table backed Google App datastore.

SQL is also the reason why I’ve integrated SQLite with Excel , leveraging SQL to manipulate tabular data with greater efficiency and fewer errors while still keeping the touchy-feely power of Excel.   I expose SQLite to Excel via UDFs rather than menu options or wizards, so that the transformation logic is visible and approachable (at least to those comfortable with excel formula “programming” and with basic SQL).

SQL is my weapon of choice because of my belief in the primacy of data. It is data that matters in the long run, not the algorithms or GUIs that temporarily use (and abuse) it.  In my time in Guinness Ireland I had the task of transferring master and historical transactional data from “legacy systems” into SAP ,Siebel and a new datawarehouse; data that had a decade and a half earlier been transferred by me  into those same legacy systems from even older systems. In fact, the data’s electronic lineage could be traced back to a 1960′s era ICL mainframe  (I have the original spec!) and I’m sure it existed in accountancy machine punch-cards  prior to that. Understand a business’s data and you’ll not just understand the business as it currently operates but also how it operated in the past and its future potential.

SQL abú.

Why not join me on Twitter at gobansaor?

Windows on EC2 = SMEs on EC2

The announcement that Win2003 is now an an option on EC2, is very significant, that and EC2′s exit from beta status with an SLA in tow, means that AWS is now very much more appealing to the great unwashed, the SMEs. i.e. the businesses who form the backbone of most of our economies.

Large companies and start-ups are comfortable in the world of Linux servers but most small companies are Windows to the core.  This may not be “right”, this may not be how it “should be”, but it is so.   Even within large companies, departmental computing is largely a Windows only enclave, with MS Office (and Excel in particular) as the backbone and MS SQL Server as the database of choice (or is that, no choice).

The other interesting thing is that my fear that EC2 SQL Server Standard instances would be licensed as per Oracle has not come to pass (Oracle while making a “big thing” of their recent EC2 cloud conversion, still insist on traditional licensing for EC2 database instances). SQL Server Standard is available on a pay-as-you-go model, brilliant!.

Even if running Win2003 as a server doesn’t catch your fancy and in fact you would much rather get rid of your existing Window’s laptop to be replaced by a cool new Apple Mac. Unfortunately you still need the ability to run Windows-only software, why not use EC2 as your on-demand pay-as-you-go Window’s desktop replacement?  Simply configure a Windows AMI with your required software (you may have to use something like this, if software is only available on CD); you could then use Jungle Disk to easily share data (via S3) between your new shiny Mac and the AMI.  Power up and down as required, easier than using VMWare or Parallels and @ 12.5c per hour, probably cheaper too.

Why Larry hates the cloud, and my data trinity.

Last week Oracle certified Amazon EC2 as a supported platform, that same week Larry Elison attacked the concept of cloud computing as pure hype. Obviously, Larry is not happy with this whole cloud thing, and I think it’s not just the threat it poses to the software industry’s traditional licensing model that worries him, rather, as Robert X. Cringely points out in his “Cloud computing will change the way we look at databases” post, it’s the likelihood that it sounds the death-knell for large-scale traditional databases.

This new database paradigm is memory rather than disk centric, with the disk-based element acting as an archive/backup/restore mechanism which can easily be stored on commodity SAN devices ( e.g. Amazon’s ESB). Using MapReduce technology Google effectively holds the whole Internet in memory, not in one big super computer but in lots of cheap commodity servers.

But it’s not just in the realm of mega datasets that RAM based databases threaten traditional models. Excel is a memory-based database engine, so too in-memory OLAP tools such as Palo. Such products’ ability to handle large volumes of data has increased over the years, with the decrease in RAM costs and the appearance of cheap 64 bit machines (which are no longer limited to 2G/3G process working sets).

That doesn’t mean that we’ll throw away SQL databases in their entirety, SQL and the relational model will continue to be useful. But perhaps of greater use in local datastores/caches that as the building blocks for large scale datastores. For such local caches, less will be more; fewer features, easier to configure, more flexibility. That’s why I like SQLite; long after the dinosaurs of the database world have disappeared, I imagine SQLite databases will continue to survive, embedded in mobile phones, browsers, wherever a local datastore is required. And more than likely operating in memory rather than off disk.

By combining Excel with an in-memory SQLite database, linked to a Palo OLAP in-memory server, it’s possible to take advantage of three powerful data-processing technologies (spreadsheets, SQL, multi-dimensional cubes) all within your PC’s RAM. You could do serious datasmithing with such a combination on a pretty mediocre laptop, with most modern machines providing an excess of CPU power, no need for super fast disks, just as much memory as you can muster. And, with Windows on EC2, these three amigos will soon be capable of being used as a cloud bursting platform.

Excel, SQLite and Palo, my data trinity.

Clouds no longer pass by Windows.

Amazon today announced that later this year, Windows Server woud be available on EC2. No details on cost and licensing etc. but this is major.  Up until now, that portion of the business world who are pure MS shops (a very large percentage especially amongst SMEs) were excluded from taking advantage of Amazon’s amazing (and getting more amazing everyday) EC2 platform

From my point of view, as with Oracle’s announcement last week, this releases yet more of my “legacy” skillset for deployment in the clouds. Although I’ve been involved with  *nix servers for 20 years or so, as corporate servers became more locked-down (and removed to the control of 3rd party data centres) I lost day-to-day experience of using them; in latter years my main ‘hands-on’ platform was Windows, either my own PC or local departmental NT servers. Windows on EC2 will allow me to use a whole new set of Windows only software (e.g. RSSBus or XLsgen) and of course SQLServer.

The lack of SQLServer on EC2 has been a major problem for me as a datasmith; there’s an awful lot of data out there sitting in SQLServer databases, but currently if I need to “cloud burst” such datasets I would have to first extract the data to, say, csv files and then load the data on to a Linux compatible database. But with a SQLServer instance running in the cloud, I could simply use SQLServer’s native backup/replication tools.  No more need to download data to my “ground-based” PCs resulting in quicker turnaround and fewer data security risks.

On the licensing front,  I’m presuming that the OS licence will be on a pay-as-you-go basis, but what about SQLServer and other server products?  Will MS do an Oracle on it, i.e. require a traditional upfront use-it-or-lose-it payment or will they the go the radical (but I thing inevitable) path of a licence-by-the-hour. 

First RedHat, then Sun, then Oracle and now Microsoft; the mighty beasts of our industry have acknowledged there’s a new mighty beast on the prowl, dressed as a humble bookseller no less!

Oracle embraces the cloud.

 

In a previous post I had wished for Oracle to clarify its position as regards the use of their databases on a cloud platform, well it looks like they have!

They have officially certified Amazon EC2 as a supported platform on which to run their software, not only that, they appear to be embracing the cloud big time, providing pre-configured AMIs and management tools.

For someone like me who has Oracle in the blood (since Version 5 in the 1980′s) this is very good news. As I’ve said before….

As for using Oracle on EC2, yes please. Most of my datasmithing career has been spent behind the wheel of an Oracle database, the front-ends might have been Excel or some BI package, the end results might have been SAP master data take-ons or an Essbase cube, but the blood and guts were always Oracle. And this was before Oracle Apex – think what wonders could have been achieved if I had access to such a product in the past.

Although the licensing is not a pay-as-you-go model, it’s a start, who knows some enterprising firm of DBAs might purchase enterprise licences and repackage access for those wishing to use it for ”cloud bursting” (adding utility resources to scale-out / scale-up).  Also, there’s Oracle’s free XE edition for low-volume datasets and for developers who need access to the enterprise editions, the usual “free to develop on” OTN licenses apply, except now there’s no need to first source a suitable spare machine or download a  multi-gigabyte install package and of course no more installation headaches, just fire up an Amazon EC2 AMI, easy peasy.

Oracle is also providing a Oracle Secure Backup Cloud tool which brings the power of Oracle backup and restore technology to S3.  This, combined with Amazon’s Elastic Block Store, makes the EC2 platform an ideal home for many Oracle database applications.

The major attractions to me of Oracle as a datasmithing tool (besides my 20+ years experience of using same) are…

  • Oracle Appliaction Express (aka APEX, previously known as HTML DB).  For fast, robust data-centric web apps for deployment within the firewall (or via VPN), it’s hard to beat (but also see WaveMaker). In a micro ETL environment, it provides a quick and easy means of distributing data cleansing tasks such as adding additional attributes or assigning hierarchies to dimensional data.
  • Oracle SQL engine/optimizer technology is fast, powerful and can handle anything you throw at it (as long as it’s valid SQL).
  • PL/SQL, the best DSL for data handling and data cleansing.
  • Oracle’s market position as a “safe and respectable” home for corporate data.

While I still have reservations about Oracle’s commitment to further develop (and patch) XE, at least its appearance at the heart of their cloud initiative reassures me that they are unlikely to abandon it totally.

Amazon’s SAN in the cloud is a mirage…

This morning I got very excited.  While quickly scanning the headlines of the 1000+ unread feeds that had accumulated in my Google Reader this week, one heading in particular caught my attention, “Amazon Elastic Block Store goes live!“.

The post from the Right Scale folks gives a detailed overview of the new  Amazon ‘SAN storage in the cloud’ service, aka Elastic Block Store, aka EBS.  Alas, this particular cloud offering was a mirage, the post was subsequently removed (but can still be viewed on Robert Scoble’s Shared Items) it seems the post was a work-in-progress and not intended for publishing, yet!

Why was I so excited?  Amazon EC2 had two major shortcomings when it launched 2 or so years ago; the first, ephemeral IP addresses, was solved by the new Elastic IP feature; the second, ephemeral storage volumes (when you shutdown an instance the disks are wiped!) is due to be solved by EBS.  With both of these problems solved, EC2, already near perfect, would be perfect.

The article does a good job of explaining the new service…

EBS starts out really simple: you create a volume from 1GB to 1TB in size and then you mount it on a device on an instance, format it, and off you go. Later you can detach it, let it sit for a while, and then reattach it to a different instance. You can also snapshot the volume at anytime to S3, and if you want to restore your snapshot you can create a fresh volume from the snapshot.

The thing that caught my eye in the above paragraph was the snapshot facility.  Snapshots are to be stored on S3 via an EC2-specific incremental-snapshot API.  This means the volumes will come with a built-in back-up facility. This is important as EBS drives reside in one availability zone (that of the instance that they are mounted against) and do not have the data replication security offered by S3.  It also means that disk systems can be restored quickly and simply from snapshots without the overhead  (and bugs!) of writing an S3 specific incremental backup and restore utility.

Back to waiting…

UPDATE: 20th August

Wait over…

Amazon S3; there’s a holdup on the buckets, Dear Liza…

Amazon’s S3 service has been down since 9.00am PDT but I only noticed an hour ago (2.30pm PDT) when a EC2 instance launch failed.

Am I worried? No, but as I become more and more dependent on such services, perhaps I will, but then again at least I’ll not be alone.  WordPress.com and countless others will be using the same excuse to their customers and unlike Renginald Perrin who had a different excuse every day for his train’s late arrival…

Ep.1   “Eleven minutes late, staff difficulties, Hampton Wick.”
Ep.1   “Eleven minutes late, signal failure at Vauxhall.”
Ep.1   “Eleven minutes late, staff shortages, Nine Elms.”
Ep.1   “Eleven minutes late, derailment of container truck, Raynes Park.”
Ep.1   “Eleven minutes late, seasonal manpower shortages, Clapham Junction.”
Ep.2   “Eleven minutes late, defective junction box, New Malden.”
Ep.4   “Eleven minutes late, overheated axle at Berrylands.”
Ep.4   “Eleven minutes late, defective axle at Wandsworth.”
Ep.5   “Eleven minutes late, somebody had stolen the lines at Surbiton.”

a whole industry will shout in unison “6 hours late (and counting), overheated axle on US Buckets…”

NX rather than VNC for EC2 Desktop

The various Amazon EC2 AMIs that I’ve built over the last few years are getting a bit long in the tooth. Most are based on Fedora 4 and nearly all are over-burdened with software I no longer use nor require. Time for some rationalisation.

I figure I need two ‘template’ AMIs, one containing the bare minimum of software, EC2 tools, Python, Perl and Java; the second loaded with the likes of Kettle, Talend, Hamachi VPN, OracleXE , Palo MOLAP Server and Palo ETL Server and a Gnome desktop accessible via VNC.

I’m deciding whether to use Centos or Ubuntu as the basis for one or both templates. I’m more familiar with Centos’s RedHat heritage but Ubuntu’s design goals of ease-of-use and ease-of-update appeal.  Since I was in the process of re-evaluating my EC2 builds I decided to also check out NX as an alternative to VNC. I had tried to install NX Server on a Fedora 4 instance a few years back, but had abandoned the effort having spent the best part of a day on it, reverting back to my VNC comfort zone.

This time I was able to use one of Eric Hammond’s Ubuntu AMIs with NX pre-installed.  Wow, what a difference! It’s much more responsive, even over my tempermental fixed wireless broadband connection. I also tried it using my backup ISDN line, again a huge improvement compared to using VNC. If you’re still using VNC to remotely access EC2 or any other remote server, you’ve got to check out NX.

Oracle in the cloud …

Oracle CorporationImage via Wikipedia 

… not yet, but Bill Hodak from Oracle has just opened a thread over on the Amazon AWS developer forums, looking for feedback on the use of Oracle in AWS projects. First there was Red Hat, then this week’s announcement from Sun and now Oracle; has Amazon managed to turn itself into the cloud provisioner not just for the hungry masses of start-ups and independent developers but for the technology elites?

As for using Oracle on EC2, yes please. Most of my datasmithing career has been spent behind the wheel of an Oracle database, the front-ends might have been Excel or some BI package, the end results might have been SAP master data take-ons or an Essbase cube, but the blood and guts were always Oracle. And this was before Oracle Apex – think what wonders could have been achieved if I had access to such a product in the past.

When EC2 first appeared I enthusiastically installed Oracle 10g Express, using a Hamachi VPN to tunnel the Apex front-end back to my PC (don’t ever expose an Oracle 10g server to the public internet, its architects assumed it would be used solely within the corporate firewall). I even used the power of Oracle’s redo logs to partially protect against the ephemeral nature of EC2′s disk storage.

It looked to me back then that EC2 could be an ideal hosting environment for Oracle Application Express (aka Apex, aka HTML DB), but for a few wee problems:

  • It’s not absolutely clear whether the Oracle 10G Express database licence covers its use in a virtual environment (sometimes the restriction of one database per server is stated as one per machine), a few attempts to look for a definitive yeah or neigh on the product’s support forums elicited no response. I’m guessing its fair-usage, but confirmation would be nice.
  • Oracle doesn’t appear to know what to do with Apex, you get the impression they’re afraid it’ll cannibalise its lucrative J2EE business.
  • 10g Express is severely hobbled as a database, not just the 4GB per server (or is that machine), it’s lacking any sort of updating service, serious security flaws remain unpatched and username/passwords are sent in plain text; making it suitable (and then only barely) for use within a firewall or VPN.
  • Once you outgrow Express, you’re into big money and even worse you might have to talk to a sales rep!

So what would I like to see Oracle offering on EC2? A paid AMI, preloaded with a variation of Express, minus the 4GB limit, with a “hardened” public internet facade, along with regular patches automatically applied. Optional add-ons…

  • Various levels of support, fixed monthly charge perhaps.
  • Ability to upgrade to the full Enterprise Editions, but again paid for via a combination of AMI hourly charges and optional month-to-month support charges.
  • Ability to purchase once-off consultancy, both from Oracle and third-party suppliers.

I’m not holding my breath though…

Oh, if you’re confused over the various “Express” terms used in the above, don’t blame me, blame Oracle, I thing the poor branding profile (constant name changes, copy cat names) is an indication of Oracle’s lack of commitment to both products.

UPDATE Sept. 22nd 2208

Looks like the Oracle Cloud has arrived..

SQLite – the ultimate data-smithing tool!

SQLite logo as of 2007-12-15Image via Wikipedia

Although my data-smithing tool box is full to the brim with powerful tools such as Talend, Kettle PDI, Picalo and Excel, all backed by the cloud infrastructure of Amazon’s S3, SImpleDB and EC2, there’s one simple yet powerful tool that I always seem to gravitate back to, that tool is SQLite.

Now obviously being a hewer of data, I need a SQL compliant database for data manipulation and SQLite performs that task with speed and ease. But it’s not just in the hewing, it’s in the hauling of data where SQLite also shines.

I use SQLite as the container for passing tabular datasets between (and within) my various tools, that data doesn’t even need to be clean (due to SQlite’s liberal manifest typing rules) just so long as it can be expressed as a table.

For example; a Talend job could store an extracted dataset in a SQLite file, pass that file on to a Python script for some special processing (for example extracting further data from a source not directly supported by Talend such as SAP or SimpleDB), and then pass the resulting SQLite database on to Excel or a similar tool to allow a business user to view and perhaps modify the data; finally Talend picking up the file again to load it into a corporate data warehouse.

Now you could use flat files to transport the data or store the intermediate results in a corporate database, but SQLite is as easy, if not easier than, flat files and offers the SQL processing capabilities of big-iron databases, but without the hassle of getting write access to an existing server or setting one up from scratch.

And I know there are other similar file based database offerings such as MS Access and the Java only HSQLDB, but neither match SQLite’s ubiquitousness, sheer simplicity and powerful data processing ability.

xlAWS – 100,000 downloads?

Not sure, but this morning I received my monthly AWS bill, and it was double its usual amount! When I investigated the extra cost it was due to 133GBs of downloads from my www2.gobansaor.com bucket. This is the S3 bucket in which I store the xlAWS zip file, xlAWS being a “library-of-sorts” of VBA/VB6 helper code for accessing Amazon S3 and SimpleDB.

It’s linked to from this page on my blog (which has had 200 or so hits this month) and from this AWS Community Code page. The excessive hits on the bucket started on the 28th of Feb , the day the xlAWS code was published on Amazon and continued through most of March. Talking the size of the zip file, 133GB represents approximately 100,000 downloads. I don’t have server logging enabled on the bucket, so I can’t be sure how much is due to the other public files in the bucket (all belonging to the VBA/Proto SQLite xLite project), but as that project has been available for months and is accessible only through my website (who’s stats show a consistent 5-10 downloads per week) I’m guessing the downloads are for xlAWS.

Who would have though that there would be such interest in VBA/VB6 code for accessing AWS services! I wonder was it the Excel VBA side of the house or the dispossessed (and p*ssed off) VB6 developer hoards who downloaded it the most? Leave a comment if you downloaded and used the library, I’d love to know.

Postgres Plus Cloud Edition is boring …

… and that’s good. That’s how I like my databases, boring, reliable, consistent, easy to use.

SimpleDB on the other hand is not boring, it’s an exciting new shiny thing that opens up a myriad of new possibilities; but first, I and the rest of the developer community, need to tool up and cast aside some of our cherished database design patterns (oh like, 3rd normal form, strong typing, joins, nothing major) and embrace a slightly different way of thinking, however, as much as I like a challenge, I also like to get things done.

That’s where EnterpriseDB’s new Postgres Plus Cloud Edition comes in, this is an Amazon Ec2/S3 hosted edition of their Oracle compatible PostgreSQL-based product that offers the scalability of SimpleDB but the familiarity of a traditional relational database. The “magic” is supplied by Elastra, who are also offering the same functionality against MySQL and standard PostgreSQL databases.

A Talend ETL job which I had been developing for a client, had been tested against a “normal” EnterpriseDB instance. This ETL job was part of a BI prototype trialling a Postgres Plus Cloud Edition (the new name for EnterpriseDB’s cloud offering) as the back-end database. So, I exported the job as a Java executable, fired up an EC2 instance, copied up the generated JAR files, changed the database’s hostname to that of the Postgres Plus “cloud” database, ran the ETL job and it worked. As I said, boring, nothing to report, it just worked.

Now you may be wondering what’s so special about these Elastra powered databases, surely EC2 is no different from any other Linux virtual machine, why not simply install a standard database? The problem with EC2, and it is a problem to those of us (i.e. practically every IT pro on the planet) who have come to expect highly reliable RAID backed disk storage, is the non-permanence of its disk systems.

When an EC2 instance is powered down or fails, the disk system is wiped!

That, combined with fixed (if generous) disk sizes (160GB, 850GB or 1690GB), means that often a clustered database environment is a necessity, adding considerably to the complexity. It’s this sort of complexity that SimpleDB and Elastra address.

The obvious use-case for both Elastra and SimpleDB is as data stores for OLTP applications but Elastra’s ability to handle S3-backed massive databases means the possibility of using EC2 as a data warehousing platform is also considerably strengthened. Although not obvious at first glance, SimpleDB could also act as an OLAP data store; SimpleDB massively indexed tuples as “sparse dimensions” pointing to S3 objects (SQLite databases?) that hold the fact data combined with dense/”partioning” dimensions (e.g. Time). Possible ? Yes. Fun to do? Yes. A solution that I can apply tomorrow? No, that’s why I’m glad EnterpriseDB and Elastra are delivery such a boring product!

UPDATE Ec2:

The other big EC2 missing – non-permanent IP addresses – has at last been addressed. EC2 now offers “Elastic IP Addresses”, addresses associated with an account not an instance. If the instance fails or is shut down, the IP address can either be immediately re-assigned to a new instance (no more waiting for Dynamic DNS propagation) or “reserved” for future use at a cost of USD0.01c per hour. Also, the new “multiple locations” facility puts the API changes in place to allow for location selection, hopefully a sign that we here in Europe will have “local” EC2 instances to match our European S3 buckets!

UPDATE EnterpriseDB:

It looks like IBM have invested in EnterpriseDB, possibly as a counter-weight against Sun’s acquisition of MySQL (EnterpriseDB’s targeting of Oracle’s customer base would also be an added benefit!).

xlAWS – Excel VBA Code for accessing Amazon’s S3 and SimpleDB

I’ve been using Amazon’s S3 service from within Excel for sometime now and as there are no libraries or examples for calling AWS services from VBA (or VB6) I had to roll my own. As with most things Excel, getting the job done always triumphs over elegance and industrial strength implementations, in other words it was all a bit of a “dog’s dinner”. To remedy this and to share my experience of using S3 from within a VBA/VB6 environment, I decided to re-factor the code and to assemble it into a more re-usable form; the end result is xlAWS.

It was going to be called xlS3, but while doing the exercise SimpleDB appeared on the scene, so I decided to try accessing it from Excel, particularly as both products have a lot in common; both “simple”, both “schema-less” data stores. Like the S3Helper code, the simpleDBHelper module is less of comprehensive library, more a collection of useful functions which (hopefully) make working with AWS a bit easier.

To use this code library, you’ll need to have a good grasp of the S3 and SimpleDB APIs and be reasonably proficient with VBA. This is not an end-user tool, it’s for VBA (or VB6) developers. There’s a README and some basic examples within the Excel VBA project to help you get started. Code is released “in the spirit” of LGPL, you can use it how you wish, but if you add something new to the “library” (or find/fix a bug) do let the rest of us know.

As I’ve not been able to find a pure VBA implementation of the HMAC-SHA1 hash algorithm (and I couldn’t see an implementation within the standard “Microsoft Enhanced Cryptographic Provider” ) I’ve wrapped the open source XySSL SHA1 HMAC C code in a VBA friendly DLL. This DLL (and the source, under LGPL) is included in the zip file as AWS authentication requires SHA1 HMAC signatures.

You’ll also obviously require an AWS account. Credentials are stored within the workbook’s custom properties and can be encrypted via a “key file” if required. If you intend to use this code within VB6 (or Proto) you’ll need to provide your own implementation of the AWSKeyData class in order to use a non-Excel persistence store.

You can download the project ZIP file from here.

Have fun.

UPDATE

Another alternative for calculating HMAC-SHA1 signatures in VBA/VB6 is a Google Checkout supplied COM DLL see http://bit.ly/9CIKtM

There’s the bones of a pure VBA HMAC-SHA1 implementation here http://www.eggheadcafe.com/software/aspnet/32187540/hmac-sha1-challenge.aspx

Dublin Bus and PALO ETL – the connection!

Dublin buses, as is the norm with most road-based public transport systems in our increasingly car-choked cities, tend to operate on the basis of “no sign of a bus for ages, then two or three arrive at the same time”. Palo MOLAP ETL options appear to be following the same pattern; we’ve been waiting for ETL support for ages and now we see three of them heading down the road towards us. There’s Palo’s own offering, then came Stratebi‘s Kettle Plugin and now Talend Version 2.3.0RC2 is offering a Palo output component.

Mind you, the Talend offering is very basic and I’ve not managed to get the Sratebi plugin to work, leaving Palo’s ETL Server as the front runner at the moment (drill-through capability is a winner in my book).

I’ve also been busy re-factoring my VBA SQLite and Amazon S3 code with the intention of publishing them as an Excel based micro-ETL platform. While cleaning up the Amazon AWS modules I’ve been playing with SimpleDB, I’m impressed, Excel combined with SimpleDB rocks!

I’ve also wrapped the open source XySSL SHA1 HMAC C code in a VBA friendly DLL, as searching for a VBA hmac sha1 hash implementation (essential for Amazon AWS access) has proved fruitless.

Hope to release the lot the end of next month.

UPDATE:

Thanks to Javier and Jorge from Stratebi I’ve managed to get the new Kettle Palo plugin to work. It seems that the TEST facility in the Kettle database connection dialogue throws an exception for Palo connections but the connections work fine in the actual Palo input/output steps. Did a quick test and it looks very easy to use and fits in well with the Kettle “way of doing things”.

CouchDB = IBM’s SimpleDB and S3 ?

What if you’re a major player in the IT world and suddenly the internet’s equivalent of your local bookshop releases a mould-breaking cloud-based database service, SimpleDB. This is on top of Amazon’s highly acclaimed document data store service, S3!

Well, if you’re IBM you hire Damien Katz the person behind CouchDB. I think 2008 could be the year that cloud-based database services really take off

SimpleDB + S3 = distributed document-centric database

I’m a database man. I’ve worked on or about most variations on the theme, from roll-your-own flat files, to hierarchical, to CODASYL network databases, to the current crop of relational and MOLAP platforms. Of late, I’ve being investigating what I think will be the future of database technology, the distributed document-centric database. Today, the future arrived in the form of Amazon’s new SimpleDB service.

Up until now Amazon’s S3 service offered one half of the future platform the “distributed document-centric” bit but it lacked the indexed structure part to make it a true database; but in combination with SimpleDB it’s now complete.

SimpleDB stores data in a Domain/Attribute schema-less and type-less structure having more in common with a spreadsheet than a traditional relational table. If you’ve worked with the likes of SQLite (manifest typing) or Excel (no predefined schema and manifest typing) then you’ll appreciate this is no hardship, quite the opposite in fact (I find the strong typing nature of most databases a real pain having worked recently on a SQLite combined with Excel project).

The distributed nature of SimpleDB may however pose some difficulty to those of us (i.e. almost everybody) raised in the world of ACID compliant databases. Because of the Brewer’s Conjecture effect, SimpleDB sacrifices consistency for availability and partition tolerance i.e. when you write something to the database, an immediate query may not return the updated value, subsequent queries will eventually return the new data, exactly when depends on the load and the availability of resources. Those of you already using S3 will already be living with this “feature”, and in practice you rarely notice it (most updates seem to appear immediately) but it will still pose design challenges to handle the edge cases.

The service is still in limited Beta, but the documentation is available and if you already used any other AWS product you’ll immediately feel at home. The pricing is again based on usage, the cost of storage is much higher than S3, being $1.50 per GB-month, but a GB of structured data is an awful lot of data (and the larger document style storage would be provided by S3).

If you’ve not yet tried out either S3 or EC2, now might be a good time to start, cloud computing has come down to earth, all thanks to an online book store, Amazon!