Category Archives: cloud

S3 as an Excel hub

In my Steam Powered PowerPivot post I explained the concept of powered-spoke based Excel systems. One of the hub options was Amazon’s Simple Storage Service, S3. I’ve been using S3 both as a simple backup service and as a “systems hub” since the service began in 2006. Back then, it was not necessarily the easiest to use or the most suitable, but it was cheap and reliable. The code I used to access S3 from Excel can be found here http://blog.gobansaor.com/2008/02/22/xlaws-excel-vba-code-for-accessing-amazons-s3-and-simpledb/

Over the years, lots of new features have been added to S3 that have made the service more useful and easier. The final big missing, for me at least, was the lack of multiple  authorisation credentials  for a single account i.e. the ability to set-up multiple users with their own access policies. Last year, with the launch of IAM (Identity & Access Management) Service, that last major problem was removed.

Although currently IAM is accessible only by the API that’s not a problem as my favourite tool for managing S3 (and AWS in general) is Python’s boto package, and it has that, and everything else you might wish to do S3 -wise, covered.

Although Boto does most of my heavy lifting (particularly on the server-side), I still have need for a VBA based S3 utility. Boto is fine for situations where I’m using microETL with Python enabled, but in many situations this is not ideal or even possible, particularly on the client-side of systems. Here, even a single file add-in may not be possible which often means a macro-enabled workbook is the only option.

MicroETL has a noSQL compile option which enables a lot of the code to operate without the SQLite libraries, thus allowing it to be used as a single-file VBA add-in or embedded in a workbook.

Up until now my S3 code depended on the xliteSha1Hmac.dll to provide SHA1-HMAC hashing (necessary to authorise access to S3) meaning that only pre-signed S3 URLS could be used in single-file deliverables. Even if this was not the case, prior to IAM, pre-signed URLs (usually with time limits) were often the only option as sharing an S3 account’s single authorisation credentials with all and sundry was not ideal from a security point of view. But post IAM, being able to set-up sub-accounts with specific access policies means the full power of S3 can now be used at client-level.

I’m in the process of adding a pure VBA HMAC-SHA1 facility (32 bit only at the moment, 64 bit continues to require the xliteSha1Hmac.dll). I’m also adding a JSON encode/decode module to allow tables to be more easily shared with non-Excel programs, currently I default to using tables encoded using ADO’s export XML schema to share via S3, but the JSON option will make it easier to share with a pure Python server (such as a PiClound based service, see my Expand Excel’s Horizons – look to the cloud post).

Cloud-shared tables can not only pass datasets back and forth but can pass SQLScripts (which are, after all, simply tables) for processing elsewhere.

For example, client workbooks not having Python  (or SQLite ) functionality installed could write a script, post it to S3; that script table could then be picked up by a “server” microETL workbook (or by a pure Python server), executed, and the results posted back to S3 to be picked up by the client (possibly waiting a response using a TIMER).

Server processes could also do the opposite; passing down SQLScripts for client-side execution.

When people think of utilising the cloud, this sort of scenario is not what comes to mind, thinking that Excel is a desktop product so it cannot take advantage of the enormous cost-advantages of cloud-based solutions. Not so, it’s now very easy to integrate the likes of S3 into existing processes, adding that cloud magic but keeping those existing works-so-don’t-fix-it processes firmly on the ground.

If you need help figuring out how to take advantage of S3 (or other AWS services) and how to integrate them with your existing technology base do contact me. I’ve been working with this technology for over 5 years (and have 30 odd years of experience with good-old reliable “ground-based systems”,

What’s with the bucket picture? S3 calls its primary data storage unit a bucket.

UPDATE:

IAM is now available from the AWS Management Console http://aws.typepad.com/aws/2011/05/identity-and-access-management-console-support.html


Expand Excel’s horizons – look to the cloud

http://www.flickr.com/photos/donabelandewen/470780785/I’ve added a new facility to microETL’s SQLScript  – the TIMER command.

 

 

 

TIMER takes up to 4 arguments:

  • 1st, the SQLScript to call.
  • 2nd, an optional SQLScript to call when the primary script finishes or is cancelled.
  • 3rd, the number of  milliseconds secondsto wait between each call to the primary script, defaults to 1.
  • 4th, the mode of the controlling user form; MODELESS (the default) users may continue to navigate about the workbook or MODAL, the user must exit the form before any access is allowed to the workbook. Note: in MODELESS mode, if the user attempts to change any workbook cells, the TIMER (and any other TIMERs running) will stop.

So how to demonstrate TIMER in action?

I thought I’d set my sights to a further horizon, leave the humble laptop behind; I’d go all modern and use the cloud.

A conversation with Giles Thomas of  Resovler One and now Dirigible fame (the first a desktop Python-power spreadsheet, the other a cloud-based one), lead to my discovery of PiCloud. This is an Amazon EC2 based service that allows the execution of Python code remotely; handling all the messy business of provisioning of servers and marshalling of code and results.

It is not only easier than using EC2 directly but can also be much cheaper to use, as PiCloud charge by the second (rather than EC2s “use any portion of an hour, pay for the full hour” payment terms).

I thought I’d give it a go.

I used the same data and web site as Daniel Ferry’s VBA ‘threading’  example; a list of properties to check against the realestateabc.com “What’s your home worth?” site; returning various bits of information such as estimated value, size, last sale date and price achieved etc.

Being Python I used the wonderful Beautiful Soup package to extract the needed data from the site’s HTML. I also submitted a separate request for each property, but if this were for real, I would most likely submit fewer jobs to handle multiple properties as most of the cost of each job in this example was probably due to set-up and tear down rather than the url fetching and parsing. Not to mention the time to submit 100 jobs in the first place.

Above you can see the two SQLScripts involved. The LaunchCloud script submits the jobs and makes a call to TIMER to run the CloudChecker script every second. The CloudChecker script checks for results from PiCloud and pastes them back to the source table. When no more results due, it issues a STOP command which forces the TIMER to stop.

Above is the pyGetDetails script which is passed to PiCloud (via the cloud.call() statement) to do the actual fetching and parsing of the URL associated with each property.

Download latest version of microETL here. The workbook is called PiCloud (2007/2010 format), requires Python 2.7 and a subscription to PiCloud.

LightSwitch & Hobo – the return of the 4GL?

Those of us of a certain age have fond memories of the golden era of 4GLs. These simple, but at the time revolutionary, tools enabled business-aware programmers (usually termed analyst-programmers) to quickly build & deploy line-of-business apps. They (both tools and devs) were primarily data-driven, data begat screens, screens begat more data and so on. The resulting apps where server delivered, using either green-screen terminals or client-side delivery apps (a bit like current-day client-side RIAs). The resulting UIs could best be described as “plain but functional”.

These halcyon days were replaced by a combination of  2-tier Windows apps and by the 3-tier enterprise platforms. The increasing sophistication & complexity of the new platforms forced programmers to become highly specialised, often losing their once close links with the business, and even losing sight of the value of business data as a resource in itself (still amazes me when I come across business application programmers with poor or non-existent SQL/RDBMS skills). The analyst-programmer (AP) was no more.

Many of those APs, like myself, managed to stay close to the business by either becoming business/data analysts or datawarehousing/BI specialists. ERP platforms such as SAP, with their complex configuration requirements, also created a welcoming home for business-focused IT refugees.

The need for quick’n'dirty line-of-business apps has not disappeared, but this service is now often being provided by tools such as MS Access and above all, by Excel. This is both good and bad; the good is the expansion of development skills outside of IT; the bad is the effective dis-arming of a large proportion of professional IT folks. To paraphrase, Marvin the Paranoid Android: “Brain the size of a planet, and all they give me is Excel!”

Most of us managed to make the best of the situation, learning to respect, if not love, Excel; becoming SQL wizards; MDX magicians and dimensional modellers par excellence. But the call of the 4GLs remained and we veterans continue to keep a watchful eye for something that will match and hopefully surpass them.

Oracle’s Application Express (aka HTML DB) offered those with Oracle skills hope (very similar in concept to SQL*Forms V3) and the open source tool WaveMaker is also excellent, styling itself, with good reason, the Powerbuilder for Web Enterprise.

In the last month I’ve come across two modern-day descendants of  these 4GL data-driven tools. Hobo and LightSwitch.

Hobo, is an open source extension to  the Ruby on Rails platform. It uses the same MVC architecture but takes the concept further in not just allowing you to build a relationship model for your data, but also enabling the  easy specification of  a lifecycle model and, here’s the biggy, automatically building (and re-building on change) a very respectable UI to present to the outside world. It also offers a starter authentication framework and lots of other useful helpers. All without writing a line of Ruby code, or having any idea of how RoR works!  The end result is a tool that can not only quickly, and iteratively,  build CRUD type applications, but can also handle simple workflow apps out of the box.

LightSwitch, is similar but yet very different. The same, in that it follows the traditional 4GL data to screen approach; presenting the user with a graphical tool to build data tables, to link to existing data sources and to create relationships between entities. Screens can then be generated that, like Hobo, are professional looking and easy to use. Again like Hobo, new ‘skins’ can be applied to change the look and feel.

If a more sophisticated solution is required, code can be added (VB.NET or C#) at predefined events and indeed the resulting project, being fully VS 2010 compliant, can be opened in VS Professional and built out from there. (A similar ability to get under the bonnet exists for Hobo as it is essentially Rails).

Where LightSwitch differs is the deployment methods used. The end result is a Sliverlight app which can either run client-side (with full access to the client’s environment e.g. interact with Office etc.) or as a sandboxed IIS browser app or via the Azure cloud. Same code, same project can easily migrate back & forth between all three options. (Hobo, being a Rails app could also be sort-of-localised using RubyScript2Exe, but it could very easily be cloud deployed using the EC2-based, dead-simple to use, http://heroku.com/)

The LightSwitch data modeller also allows for relationships between local databases, network databases, SQL Azure cloud databases and web service datasets to be built and maintained within the application. The need for mashups between local & central/remote data is a constant requirement for LOB developers and LightSwitch appears to made it very easy to implement.

This data mash-up ability and the option to interact with the client will be a major attraction, at least for corporate devs working largely with MS tools. I say will, as alas the current beta is  “molasses-in-January” slow. I thought initially it was just my 5yr laptop hitting the wall, but others with more modern & powerful hardware also found it so.

So do tools like Hobo and LightSwitch herald the return of the IT analyst/programmer? Probably not, different times; outsourcing, SaaS and packaged software have and will continue to reduce the number of business-facing IT staff. But their places are been taken by IT-aware business folks, citizen programmers, creators of time-assets and it is they that will likely be the beneficiaries of such tools.

LiteBI, Heavy ETL

Although my major BI interest is in micro-BI (or is that  workgroup-BI?)  i.e. data, perhaps cleansed and packaged elsewhere, available locally on a datasmith’s PC,with most likely an in-memory OLAP as the analysis tool; the possibilities of the “cloud” as a BI platform have not escaped me.

From a micro-BI perspective, the ability to act as a backup/mirroring tool or as ETL/marshaling tool (anybody for Hadoop and SQLite?) attracts. I’ve yet to make up my mind on BI delivered as a cloud PaaS but obviously many others believe it has a future.

My main worry with PaaS is not lock-in (which exists equally for in-house proprietary solutions) but the dangers of a Coghead-like lock-out.  My other doubts are more technical; believing, as I do, that in-memory offers significant advantages over traditional ROLAP (simplicity been the main one) and multi-tenant in-memory architectures are not yet a runner.  But last week I had a demo of new Spanish BI PaaS service, LiteBI, which might just change my mind.

Javier Giménez Aznar and his team previously worked on delivering Pentaho based datawarehouses to large Spanish corporations and government agencies, so they have a deep understanding of Mondrian ROLAP and are using that knowledge to build the LiteBI service, but this time with SMBs as the target customers rather than corporates. Pricing starts at €145 per month and is based on number of concurrent users, number of analytical spaces and the data volumes, so it’s not for very small firms more for the Medium in SMB.

Impressions? The cube designer, dashboard builders and the general UI are all very good and I would think would appeal to end-user datasmiths and, as such, will be a major up-front aid to selling this product.  But it was LiteBIs approach to the thorny issue of ETL and data loading that impressed me and also helped ease some of my Coghead-induced-fears.

BI technology stacks consist of three elements:

  • The “fancy” front-end; graphs,animated dashboads and so on.
  • The pivot engine; ROLAP or MOLAP or both.
  • The ETL process.
  • (Many would say there’s an important 4th, the data-warehouse, but not every BI effort requires one, but that’s another issue)

LiteBI is continuing to build yet more functionality into their UI and this “fancy” front-end is essential as it’s their “shop window”.

Mondrian provides their pivot engine, and again they continue to work on optimisations such as column-based datastores to increase speed and automate responsiveness tuning (end-users are very unforgiving of slow pivots).

But it’s in the 3rd area, that of the ETL process, that you realise the LiteBI team has real-world BI experience.  Data is loaded into LiteBI via an API, but with the ETL process itself happening on the customer side.

“Well,so what?” you may ask. The extraction of data has to obviously happen customer-side (even though not in the case of data being sourced from the likes of SalesForce.com). Yes, but it’s the transformations and data cleansing that adds true value to the ETL process and subsequently determines the quality and usefulness (as opposed to the speed or the “prettiness” of delivery) of the solution.

Part of the process of adopting LiteBI, is an ETL consultancy stage where a LiteBI partner company will provide on-site services to build this ETL layer, handling not just transformations but initial load and automating the subsequent delta uploads.

So the cost mounts up, but in reality you can’t do BI without this investment; there’s no ETL magic bullet.  Even still, Javier says the typical go-live time for a LiteBI project would be in the order of 3-4 weeks rather than the 3-4 months of similar on-site Pentaho projects.

The end-user ‘owning’ the ETL process makes the prospect of a service lock-out slightly less worrying as, at least, one would still have a good starting point for moving to another provider or back in-house. What I would really like to see would be the option to self-host LiteBI, which I guess would involve open sourcing large parts of the service (the automated optimisation strategies could, for example, be excluded from this open source version).

The load API comes packaged as a plugin to Kettle (aka PDI) and the intention is to offer a similar add-on for Talend in the near future. LiteBI also offers a white-label offering whereby 3rd party OLTP solution providers can use the service as their product’s BI suite.

Like the Skibbereen Eagle keeping its eye on the Czar of Russia, I too will be keeping a watchful eye on LiteBI and the march of on-demand BI in general.

Why not join me on Twitter at gobansaor?

Why Larry hates the cloud, and my data trinity.

Last week Oracle certified Amazon EC2 as a supported platform, that same week Larry Elison attacked the concept of cloud computing as pure hype. Obviously, Larry is not happy with this whole cloud thing, and I think it’s not just the threat it poses to the software industry’s traditional licensing model that worries him, rather, as Robert X. Cringely points out in his “Cloud computing will change the way we look at databases” post, it’s the likelihood that it sounds the death-knell for large-scale traditional databases.

This new database paradigm is memory rather than disk centric, with the disk-based element acting as an archive/backup/restore mechanism which can easily be stored on commodity SAN devices ( e.g. Amazon’s ESB). Using MapReduce technology Google effectively holds the whole Internet in memory, not in one big super computer but in lots of cheap commodity servers.

But it’s not just in the realm of mega datasets that RAM based databases threaten traditional models. Excel is a memory-based database engine, so too in-memory OLAP tools such as Palo. Such products’ ability to handle large volumes of data has increased over the years, with the decrease in RAM costs and the appearance of cheap 64 bit machines (which are no longer limited to 2G/3G process working sets).

That doesn’t mean that we’ll throw away SQL databases in their entirety, SQL and the relational model will continue to be useful. But perhaps of greater use in local datastores/caches that as the building blocks for large scale datastores. For such local caches, less will be more; fewer features, easier to configure, more flexibility. That’s why I like SQLite; long after the dinosaurs of the database world have disappeared, I imagine SQLite databases will continue to survive, embedded in mobile phones, browsers, wherever a local datastore is required. And more than likely operating in memory rather than off disk.

By combining Excel with an in-memory SQLite database, linked to a Palo OLAP in-memory server, it’s possible to take advantage of three powerful data-processing technologies (spreadsheets, SQL, multi-dimensional cubes) all within your PC’s RAM. You could do serious datasmithing with such a combination on a pretty mediocre laptop, with most modern machines providing an excess of CPU power, no need for super fast disks, just as much memory as you can muster. And, with Windows on EC2, these three amigos will soon be capable of being used as a cloud bursting platform.

Excel, SQLite and Palo, my data trinity.