Category Archives: news

Python the new VBA ?

These last two weeks, Python has been on my mind. First off, last week I decided to make time to fully investigate Picalo, an open-source Python-based data analysis tool, and then, this week, Google announced their long awaited cloud-computing offering, Google Apps Engine, with the language at its core.

Python was the first of the “LAMP generation” scripting languages that I decided to learn in any detail ( I had used Perl before that but only on a per-task basis (similar to how I’d used AWK)). I then invested time in learning PHP, then Ruby and finally JavaScript. And here I am, back where I started, with Python.

But it’s not the same Python I learned three years ago, not that it has changed that much, but my appreciation of the language has, largely due to my deep dives into other languages. For example, JavaScript’s treatment of functions as first-class objects, highlighted the same functionality in Python, something I’d missed (or rather, not fully understood) the first time I encountered the language. Likewise, Ruby’s RoR introduced me to a “best of breed” approach to web application design, something that can be used as a comparison aid when approaching new web frameworks such as Django.

But of course the scripting language that continues to power most of my datasmithing activities is Excel VBA. That’s why I was so excited to see a tool such as Proto utilise VBA as its scripting language. But, Microsoft has abandoned VBA, there will be no more Protos.

Also, Excel VBA is now a Windows only language. Windows, however, is no longer the ‘only’ business client OS (see how many Apple laptops you can spot the next time you’re in a business-class airport lounge, a few years ago it would have been zero, not any more), and is currently nowhere to be seen as a cloud computing platform (but that’ll change).

I’m at heart a table-oriented programmer, and I, like Picalo’s author Conan Albrecht, believe “data analysis is best done through scripting”; but not just data analysis, the T in ETL (Extract, Transform and Load) and the I in DI (Data Integration) and SI (Systems Interfacing) also benefit from a scripting approach.

So, what to adopt as a successor/companion-in-her-old-age to VBA, will it be Ruby, JavaScript, Python, Perl, even PHP?

It looks like it’ll be Python because it’s …

The runner up is of course Ruby, but its poor integration with Windows is a major problem and the datasmithing “prior art” of Picalo and Resolver makes Python hard to beat.

UPDATE Jan 2010:

To experience the best of both worlds, VBA & Python, my xLite (Excel combined with SQLite) datasmithing platform now allows Python to be used in conjunction with VBA.  Check it out here

 

UPDATE: July 2011:

For another method of integrating Python (this time .NET’s IronPython) with Excel/VBA see http://blog.gobansaor.com/2011/07/18/vba-multithreading-net-integration-via-hammer/
UPDATE:

Also, as Dan pointed out in the comments below, I’d not included Jython in my list of reasons for embracing Python. I must add it to my list of things to try out particularly as both my “classic” ETL tools, Talend and Kettle are JVM based.

Another thing to add to the (ever growing) list is Mike Pitarro’s SnapLogic python-based ETL tool. They have …

…just released a 2.0 Beta version with some major architectural enhancements. The SnapLogic model is very different from traditional ETL systems. It takes an approach that’s more like the web, based on loose coupling and HTTP interactions. We model data source, sinks, and transformations as URI addressable endpoints, and have a model where than can be chained together in pipelines to build transformation logic. We use a plugin architecture to make it easy to add custom components.

SimpleDB + S3 = distributed document-centric database

I’m a database man. I’ve worked on or about most variations on the theme, from roll-your-own flat files, to hierarchical, to CODASYL network databases, to the current crop of relational and MOLAP platforms. Of late, I’ve being investigating what I think will be the future of database technology, the distributed document-centric database. Today, the future arrived in the form of Amazon’s new SimpleDB service.

Up until now Amazon’s S3 service offered one half of the future platform the “distributed document-centric” bit but it lacked the indexed structure part to make it a true database; but in combination with SimpleDB it’s now complete.

SimpleDB stores data in a Domain/Attribute schema-less and type-less structure having more in common with a spreadsheet than a traditional relational table. If you’ve worked with the likes of SQLite (manifest typing) or Excel (no predefined schema and manifest typing) then you’ll appreciate this is no hardship, quite the opposite in fact (I find the strong typing nature of most databases a real pain having worked recently on a SQLite combined with Excel project).

The distributed nature of SimpleDB may however pose some difficulty to those of us (i.e. almost everybody) raised in the world of ACID compliant databases. Because of the Brewer’s Conjecture effect, SimpleDB sacrifices consistency for availability and partition tolerance i.e. when you write something to the database, an immediate query may not return the updated value, subsequent queries will eventually return the new data, exactly when depends on the load and the availability of resources. Those of you already using S3 will already be living with this “feature”, and in practice you rarely notice it (most updates seem to appear immediately) but it will still pose design challenges to handle the edge cases.

The service is still in limited Beta, but the documentation is available and if you already used any other AWS product you’ll immediately feel at home. The pricing is again based on usage, the cost of storage is much higher than S3, being $1.50 per GB-month, but a GB of structured data is an awful lot of data (and the larger document style storage would be provided by S3).

If you’ve not yet tried out either S3 or EC2, now might be a good time to start, cloud computing has come down to earth, all thanks to an online book store, Amazon!

Zimki – the spirt lives on …

Although Zimki is to shut down on Christmas Eve, the ideas behind the service live on. Two new offerings, Horuku and AppJet, offer variations on the idea of hosted application development/deployment.

AppJet, funded by Paul Graham‘s Y-Combinator, is very similar to Zimki, being a server-side JavaScript platform. No details yet as to what sort of paid options will be offered (all accounts are free at the moment). Unlike Zimki there’s no plans to create an open-source version. I like the easy “build a Facebook app” feature; and I guess this is the sort of light-weight applications that they hope to attract.

Although Heroku uses Ruby-on-Rails technology, rather than JavaScript, it is closer to the original Zimki idea; but rather than take the hard (and ultimately unsuccessful in Zimki’s case) road of building an open-source platform from scratch, Heroku takes an already popular open-source project and offers it wrapped in a full on-line development and deployment environment. Again, being in beta, there’s no indication as to what pricing model it will operate under, but I would think that it will attract more “serious” projects than AppJet since anything developed under Heroku is pure Rails which means it can be migrated to any other Rails hosting environment; so no lock-in. The online editor is excellent and whatever about its merits as a hosting service it’s by far the easiest way to learn and explore Ruby and Rails, even easier than this…

If Facebook apps are your goal but you wish to use Ruby rather than AppJet’s JavaScript then not to panic, as being Ruby some bright young spark (no, not me I’m afraid) will already have done a lot of the hard graft for you…

You say 100,000 I say 65,535! Let’s call the whole thing off!

According to this Google groups thread, Excel 2007 has a serious bug. Certain calculations (e.g. =850*77.1) that should yield 65535 are being rendered by Excel 2007 as 100,000. Brilliant, bloody brilliant!

I’ve been a fan of 2007 especially the new table handling features and the ability to handle more than 65536 rows, these are particularly useful for someone like myself who uses Excel as an ETL and data cleansing tool. Unlike many others, the new ribbon UI doesn’t bother me, it’s a slight annoyance, but within a day or so I’d mastered it. In fact, my wife who’s trained first time Excel users in both 2003 and 2007 reported back that novices found the new UI much easier to master.

But returning the wrong answer! “Well that beats Banagher”. I’ll not be recommending any client of mine to upgrade to Office 2007, until this is fixed.

If you don’t have a copy of Excel 2007, you can try it out using the Office 2007 online “test drive”, bugs and all.

Below is a screen shot of the bug in action using the online test drive version. Column A is set to =850*77.1, Column B uses =5.1*12850 and Column C is set to =100000. The first two should yield 65535 but all three display as 100,000! If you SUM() column A you get 196605 (see A7, which is the correct answer for 65535+65535+65535). But if you AVG() the affected cells you get 100,000, see B7. Also, see the various results when an affected cell is used in other calculations (E5 to H5).

Nirvanix targets Amazon S3 shortcomings

Let there be no doubt about it, Amazon’s S3 online storage system is wonderful; it’s secure (both from an technology point of view and from Amazon’s status as one of the web’s most trusted sites i.e. one you wouldn’t worry about giving your credit card to), it’s cheap, it’s pay-as-you-go and it has first mover advantage, but (there’s always a but) it has until now lacked competition. And because it lacked competition the various shortcomings (such as no support for HTTP POST file upload, no SLAs etc.) that S3 users complain about are handled by Amazon in what can best be described as ..

..we hear what you’re saying, we have it on a list; no, we’ll not tell if/when we’ll remedy this problem (or explain why it’s not possible to do so); and anyway if you don’t like it, who else provides anything comparable?

Okay, I’m being unfair here, I’m sure Amazon has very good reasons for how they do things and scalability and “keeping it simple” seem to be their development mantra; and this is a good thing for an online 24/7 storage infrastructure. But, as in all things in life, competition would help not just disillusioned users by offering another comparable service but would help Amazon prioritise items on its S3 roadmap.

Most would have assumed that when that competitor arrived it would either be Google or Microsoft, instead the first up to bat is Nirvanix, a San Diego startup which appears to be associated with another online storage player, MediaMax. Pricing is similar to S3, but with the option of purchasing extra SLA backed support packages, something that has been top of the list for many actual and potential S3 users. Other “missings” that Nirvanix addresses are;

  • File upload via HTTP POST, S3 restricts upload to HTTP PUTs which requires the use of a proxy server or the installation of client software.
  • File rename and move, S3 requires that a file is first deleted and then reloaded.
  • In-built support for media processing such as image resize/rotate for thumbnails.
  • Multi-tenant accounts, each S3 account supports only a single ‘user view’.
  • Files are indexed via tags and name, not just by name as is the case with S3.
  • Granular control of usage limits and reporting, S3 only offers ‘after-the-fact’ reporting.
  • Maximum file size of 256Gb compared to Amazon’s 5Gb.

The Nirvanix authentication method uses a much simpler and more traditional username/password over SLL approach than S3′s key-pair based URL signing method. This can be seen as either a weakness or a strength, but combined with Nirvanix’s support for POST file uploads, multi-tenant accounts and granular usage controls it makes building browser based clients much simpler.

S3′s industrial grade authentication is all fine and dandy but if the key becomes compromised, all’s lost, you could expose not just your data but your wallet if somebody used the compromised key to maliciously upload Terabytes of data. This single point of failure is perhaps my main complaint of S3′s current set-up.

So, am I getting ready to jump ship, no, at least not yet, as;

  • Amazon is still Amazon, they may be lacking SLAs but they have my trust.
  • S3′s role as a back-end to Amazon Ec2.
  • Friendly and effective forums offering excellent support provided by both the developer community and Amazon’s own staff.
  • CNAME support. (e.g. http://www2.gobansaor.com/)
  • Did I mention Ec2?

Should Amazon be worried? No, this is not a zero-sum game, in fact competition will help grow awareness and expand the market for all “cloud” based services.