Category Archives: EC2

Windows on EC2 = SMEs on EC2

The announcement that Win2003 is now an an option on EC2, is very significant, that and EC2′s exit from beta status with an SLA in tow, means that AWS is now very much more appealing to the great unwashed, the SMEs. i.e. the businesses who form the backbone of most of our economies.

Large companies and start-ups are comfortable in the world of Linux servers but most small companies are Windows to the core.  This may not be “right”, this may not be how it “should be”, but it is so.   Even within large companies, departmental computing is largely a Windows only enclave, with MS Office (and Excel in particular) as the backbone and MS SQL Server as the database of choice (or is that, no choice).

The other interesting thing is that my fear that EC2 SQL Server Standard instances would be licensed as per Oracle has not come to pass (Oracle while making a “big thing” of their recent EC2 cloud conversion, still insist on traditional licensing for EC2 database instances). SQL Server Standard is available on a pay-as-you-go model, brilliant!.

Even if running Win2003 as a server doesn’t catch your fancy and in fact you would much rather get rid of your existing Window’s laptop to be replaced by a cool new Apple Mac. Unfortunately you still need the ability to run Windows-only software, why not use EC2 as your on-demand pay-as-you-go Window’s desktop replacement?  Simply configure a Windows AMI with your required software (you may have to use something like this, if software is only available on CD); you could then use Jungle Disk to easily share data (via S3) between your new shiny Mac and the AMI.  Power up and down as required, easier than using VMWare or Parallels and @ 12.5c per hour, probably cheaper too.

Clouds no longer pass by Windows.

Amazon today announced that later this year, Windows Server woud be available on EC2. No details on cost and licensing etc. but this is major.  Up until now, that portion of the business world who are pure MS shops (a very large percentage especially amongst SMEs) were excluded from taking advantage of Amazon’s amazing (and getting more amazing everyday) EC2 platform

From my point of view, as with Oracle’s announcement last week, this releases yet more of my “legacy” skillset for deployment in the clouds. Although I’ve been involved with  *nix servers for 20 years or so, as corporate servers became more locked-down (and removed to the control of 3rd party data centres) I lost day-to-day experience of using them; in latter years my main ‘hands-on’ platform was Windows, either my own PC or local departmental NT servers. Windows on EC2 will allow me to use a whole new set of Windows only software (e.g. RSSBus or XLsgen) and of course SQLServer.

The lack of SQLServer on EC2 has been a major problem for me as a datasmith; there’s an awful lot of data out there sitting in SQLServer databases, but currently if I need to “cloud burst” such datasets I would have to first extract the data to, say, csv files and then load the data on to a Linux compatible database. But with a SQLServer instance running in the cloud, I could simply use SQLServer’s native backup/replication tools.  No more need to download data to my “ground-based” PCs resulting in quicker turnaround and fewer data security risks.

On the licensing front,  I’m presuming that the OS licence will be on a pay-as-you-go basis, but what about SQLServer and other server products?  Will MS do an Oracle on it, i.e. require a traditional upfront use-it-or-lose-it payment or will they the go the radical (but I thing inevitable) path of a licence-by-the-hour. 

First RedHat, then Sun, then Oracle and now Microsoft; the mighty beasts of our industry have acknowledged there’s a new mighty beast on the prowl, dressed as a humble bookseller no less!

Amazon’s SAN in the cloud is a mirage…

This morning I got very excited.  While quickly scanning the headlines of the 1000+ unread feeds that had accumulated in my Google Reader this week, one heading in particular caught my attention, “Amazon Elastic Block Store goes live!“.

The post from the Right Scale folks gives a detailed overview of the new  Amazon ‘SAN storage in the cloud’ service, aka Elastic Block Store, aka EBS.  Alas, this particular cloud offering was a mirage, the post was subsequently removed (but can still be viewed on Robert Scoble’s Shared Items) it seems the post was a work-in-progress and not intended for publishing, yet!

Why was I so excited?  Amazon EC2 had two major shortcomings when it launched 2 or so years ago; the first, ephemeral IP addresses, was solved by the new Elastic IP feature; the second, ephemeral storage volumes (when you shutdown an instance the disks are wiped!) is due to be solved by EBS.  With both of these problems solved, EC2, already near perfect, would be perfect.

The article does a good job of explaining the new service…

EBS starts out really simple: you create a volume from 1GB to 1TB in size and then you mount it on a device on an instance, format it, and off you go. Later you can detach it, let it sit for a while, and then reattach it to a different instance. You can also snapshot the volume at anytime to S3, and if you want to restore your snapshot you can create a fresh volume from the snapshot.

The thing that caught my eye in the above paragraph was the snapshot facility.  Snapshots are to be stored on S3 via an EC2-specific incremental-snapshot API.  This means the volumes will come with a built-in back-up facility. This is important as EBS drives reside in one availability zone (that of the instance that they are mounted against) and do not have the data replication security offered by S3.  It also means that disk systems can be restored quickly and simply from snapshots without the overhead  (and bugs!) of writing an S3 specific incremental backup and restore utility.

Back to waiting…

UPDATE: 20th August

Wait over…

Talend + SQLite + Groovy the new Oracle …

… well, at least for me.  Let me explain.

For most of my datasmithing career, I’ve had access to corporate Oracle databases and now with the availability of  Oracle10g  Express I can even run my own Oracle instances at home or on EC2.  The combination of a powerful SQL engine, expressive scripting language (PL/SQL) ,OS independence, web front-end (App Express) and the ability to communicate with Excel (via OO4O) made Oracle a natural fit for heavy-duty data manipulation.   But there was always one major problem, Oracle doesn’t play well with other data sources, necessitating a separate ETL bolt-on, which led me to play around with the likes of Kettle and Talend.  But having been seduced by these new shiny (and open source) “toys” I’ve found that rather than just been incidental add-ons they had the potential to totally replace Oracle.  The combination of Talend, SQLite and Groovy, is proving to be particularly magic.

So how will these three tools enable you to leave behind your Oracle past?

Talend (in its Java form) is a superb ETL tool, via JDBC is can access every database type on the planet, it has built-in web-service capability and access to a  multitude of APIs via its Java component for non-database data sources.  The addition of  Groovy makes the use of such Java APIs simpler and quicker and the same Groovy acts as a replacement for PL/SQL when a bit of “if-then-else” logic is required.  And although Talend offers a built-in option to plublish an ETL job as a WAR file exposing a SOAP web service, Java/Groovy also allows for the integration of the powerful, yet simple, Jetty API to embed a web server within Talend itself.  And all this for free, and better than free, open source.

So where does SQLite come in? And, didn’t you say that Excel integration was important, how will Excel communicate with Talend?

As very little corporate data is held in SQLite format, and Talend allows access to every major commercial/free database, the usefulness of SQLite might not be at first obvious.  But if you think of SQLite as a data cache, a fast and efficient local tabular datastore, with a powerful but well understood DSL (i.e. SQL) and a drop-dead-simple setup and backup regime (basically copying and creating files), maybe then you can see its attraction. The ability to extend the DSL by easily creating SQLite user defined functions (UDFs) within Talend using either Java or Groovy is also another powerful feature.

For example…

select customer_id, name,customer, sales_region, getpalodata(“SALES”,customer_id,”All Products”.”Total Sales”,”Euros”,”YTD”)  as customer_YTD, getpalodata(“SALES”,sales_region,”All Products”.”Total Sales”,”Euros”,”YTD”)  as region_total_YTD from list_of_top_customers;

… where getapalodata is a UDF that wraps calls to a Palo cube.

With this type of setup I can easily mix and match list/tabular data with multidimensional data points using SQL (something that Oracle also supports but only if you hand over a large wad of currency). In fact I can create a mini data warehouse, with Palo providing the pivot, ( as SQLite lacks star-query (or even multi-index query) support.)  SQLite would still host the conformed dimensions and the fact tables, but with the fact tables acting as feeds to Palo cubes, supporting finer-grained drill-throughs from cubes or for ad-hoc queries. This is powerful stuff, simple, free, powerful stuff.

… and the spreadsheet access?

A Talend sub-job such as this…

Talend Groovy Jetty web server

Talend Groovy Jetty web server

Example of Groovy code calling Jetty API

Example of Groovy code calling Jetty API

…would provide a simple RESTful (rather than SOAP) web service which could be accessed either with an Excel Web Query or via a VBA macro which would parse the result and allow for more control.  For example …

http://localhost:1234/sqlgateway?sql=select customer_id,name from all_customers&type=HTMLTable

… this would return a list of customers wrapped in an HTML table, or …

http://localhost:1234/job/extractProspects?Rep=JonesTom&Month=JAN&SourceCompany=AXA&type=HTMLTable

…this might call a Talend job called extractProspects, passing in JonesTom, JAN and AXA as context parameters, which would then return a list of prospects extracted from a feed supplied by AXA’s system.

What would the Talend job look like?

The job might operate something like this:

  • It would run either on the client as a service or on a LAN based server (or on a remote server, with a SSH VPN (or Hamachi) to provide security).
  • At start-up, do a bunch of ETL tasks, pulling data from remote sources and databases, transforming and aggregating data etc. Storing the resulting data in local SQLite databases.  It might also build Palo cubes or update larger enterprise databases.
  • The job would then setup a Jetty web server and await requests for data.
  • The requests might be a mixture of raw SQL or requests to run specific Talend transformations which would return a dataset directly to the calling client or maybe just acknowledge the request, queue it up for processing later, sending the resulting dataset by EMail or RSS feed when finished.
  • At a fixed time the service would shut it self down and requeue itself for the next day’s workload.

… or nothing at all like that, and that’s the point, build what you need, add the levels of security (or none at all) that fits your situation, all within a open framework, with zero lock-in (okay, still using Excel, anyone for OpenOffice, Google Apps or Zoho?).  You don’t even need your own server, host it on an EC2 instance, (if you bring up an instance for 10/12 hours every working day, it would cost about $20/$25 a month).

Now tell me that doesn’t make sense?

Amazon S3; there’s a holdup on the buckets, Dear Liza…

Amazon’s S3 service has been down since 9.00am PDT but I only noticed an hour ago (2.30pm PDT) when a EC2 instance launch failed.

Am I worried? No, but as I become more and more dependent on such services, perhaps I will, but then again at least I’ll not be alone.  WordPress.com and countless others will be using the same excuse to their customers and unlike Renginald Perrin who had a different excuse every day for his train’s late arrival…

Ep.1   “Eleven minutes late, staff difficulties, Hampton Wick.”
Ep.1   “Eleven minutes late, signal failure at Vauxhall.”
Ep.1   “Eleven minutes late, staff shortages, Nine Elms.”
Ep.1   “Eleven minutes late, derailment of container truck, Raynes Park.”
Ep.1   “Eleven minutes late, seasonal manpower shortages, Clapham Junction.”
Ep.2   “Eleven minutes late, defective junction box, New Malden.”
Ep.4   “Eleven minutes late, overheated axle at Berrylands.”
Ep.4   “Eleven minutes late, defective axle at Wandsworth.”
Ep.5   “Eleven minutes late, somebody had stolen the lines at Surbiton.”

a whole industry will shout in unison “6 hours late (and counting), overheated axle on US Buckets…”

NX rather than VNC for EC2 Desktop

The various Amazon EC2 AMIs that I’ve built over the last few years are getting a bit long in the tooth. Most are based on Fedora 4 and nearly all are over-burdened with software I no longer use nor require. Time for some rationalisation.

I figure I need two ‘template’ AMIs, one containing the bare minimum of software, EC2 tools, Python, Perl and Java; the second loaded with the likes of Kettle, Talend, Hamachi VPN, OracleXE , Palo MOLAP Server and Palo ETL Server and a Gnome desktop accessible via VNC.

I’m deciding whether to use Centos or Ubuntu as the basis for one or both templates. I’m more familiar with Centos’s RedHat heritage but Ubuntu’s design goals of ease-of-use and ease-of-update appeal.  Since I was in the process of re-evaluating my EC2 builds I decided to also check out NX as an alternative to VNC. I had tried to install NX Server on a Fedora 4 instance a few years back, but had abandoned the effort having spent the best part of a day on it, reverting back to my VNC comfort zone.

This time I was able to use one of Eric Hammond’s Ubuntu AMIs with NX pre-installed.  Wow, what a difference! It’s much more responsive, even over my tempermental fixed wireless broadband connection. I also tried it using my backup ISDN line, again a huge improvement compared to using VNC. If you’re still using VNC to remotely access EC2 or any other remote server, you’ve got to check out NX.

Oracle in the cloud …

Oracle CorporationImage via Wikipedia 

… not yet, but Bill Hodak from Oracle has just opened a thread over on the Amazon AWS developer forums, looking for feedback on the use of Oracle in AWS projects. First there was Red Hat, then this week’s announcement from Sun and now Oracle; has Amazon managed to turn itself into the cloud provisioner not just for the hungry masses of start-ups and independent developers but for the technology elites?

As for using Oracle on EC2, yes please. Most of my datasmithing career has been spent behind the wheel of an Oracle database, the front-ends might have been Excel or some BI package, the end results might have been SAP master data take-ons or an Essbase cube, but the blood and guts were always Oracle. And this was before Oracle Apex – think what wonders could have been achieved if I had access to such a product in the past.

When EC2 first appeared I enthusiastically installed Oracle 10g Express, using a Hamachi VPN to tunnel the Apex front-end back to my PC (don’t ever expose an Oracle 10g server to the public internet, its architects assumed it would be used solely within the corporate firewall). I even used the power of Oracle’s redo logs to partially protect against the ephemeral nature of EC2′s disk storage.

It looked to me back then that EC2 could be an ideal hosting environment for Oracle Application Express (aka Apex, aka HTML DB), but for a few wee problems:

  • It’s not absolutely clear whether the Oracle 10G Express database licence covers its use in a virtual environment (sometimes the restriction of one database per server is stated as one per machine), a few attempts to look for a definitive yeah or neigh on the product’s support forums elicited no response. I’m guessing its fair-usage, but confirmation would be nice.
  • Oracle doesn’t appear to know what to do with Apex, you get the impression they’re afraid it’ll cannibalise its lucrative J2EE business.
  • 10g Express is severely hobbled as a database, not just the 4GB per server (or is that machine), it’s lacking any sort of updating service, serious security flaws remain unpatched and username/passwords are sent in plain text; making it suitable (and then only barely) for use within a firewall or VPN.
  • Once you outgrow Express, you’re into big money and even worse you might have to talk to a sales rep!

So what would I like to see Oracle offering on EC2? A paid AMI, preloaded with a variation of Express, minus the 4GB limit, with a “hardened” public internet facade, along with regular patches automatically applied. Optional add-ons…

  • Various levels of support, fixed monthly charge perhaps.
  • Ability to upgrade to the full Enterprise Editions, but again paid for via a combination of AMI hourly charges and optional month-to-month support charges.
  • Ability to purchase once-off consultancy, both from Oracle and third-party suppliers.

I’m not holding my breath though…

Oh, if you’re confused over the various “Express” terms used in the above, don’t blame me, blame Oracle, I thing the poor branding profile (constant name changes, copy cat names) is an indication of Oracle’s lack of commitment to both products.

UPDATE Sept. 22nd 2208

Looks like the Oracle Cloud has arrived..

Postgres Plus Cloud Edition is boring …

… and that’s good. That’s how I like my databases, boring, reliable, consistent, easy to use.

SimpleDB on the other hand is not boring, it’s an exciting new shiny thing that opens up a myriad of new possibilities; but first, I and the rest of the developer community, need to tool up and cast aside some of our cherished database design patterns (oh like, 3rd normal form, strong typing, joins, nothing major) and embrace a slightly different way of thinking, however, as much as I like a challenge, I also like to get things done.

That’s where EnterpriseDB’s new Postgres Plus Cloud Edition comes in, this is an Amazon Ec2/S3 hosted edition of their Oracle compatible PostgreSQL-based product that offers the scalability of SimpleDB but the familiarity of a traditional relational database. The “magic” is supplied by Elastra, who are also offering the same functionality against MySQL and standard PostgreSQL databases.

A Talend ETL job which I had been developing for a client, had been tested against a “normal” EnterpriseDB instance. This ETL job was part of a BI prototype trialling a Postgres Plus Cloud Edition (the new name for EnterpriseDB’s cloud offering) as the back-end database. So, I exported the job as a Java executable, fired up an EC2 instance, copied up the generated JAR files, changed the database’s hostname to that of the Postgres Plus “cloud” database, ran the ETL job and it worked. As I said, boring, nothing to report, it just worked.

Now you may be wondering what’s so special about these Elastra powered databases, surely EC2 is no different from any other Linux virtual machine, why not simply install a standard database? The problem with EC2, and it is a problem to those of us (i.e. practically every IT pro on the planet) who have come to expect highly reliable RAID backed disk storage, is the non-permanence of its disk systems.

When an EC2 instance is powered down or fails, the disk system is wiped!

That, combined with fixed (if generous) disk sizes (160GB, 850GB or 1690GB), means that often a clustered database environment is a necessity, adding considerably to the complexity. It’s this sort of complexity that SimpleDB and Elastra address.

The obvious use-case for both Elastra and SimpleDB is as data stores for OLTP applications but Elastra’s ability to handle S3-backed massive databases means the possibility of using EC2 as a data warehousing platform is also considerably strengthened. Although not obvious at first glance, SimpleDB could also act as an OLAP data store; SimpleDB massively indexed tuples as “sparse dimensions” pointing to S3 objects (SQLite databases?) that hold the fact data combined with dense/”partioning” dimensions (e.g. Time). Possible ? Yes. Fun to do? Yes. A solution that I can apply tomorrow? No, that’s why I’m glad EnterpriseDB and Elastra are delivery such a boring product!

UPDATE Ec2:

The other big EC2 missing – non-permanent IP addresses – has at last been addressed. EC2 now offers “Elastic IP Addresses”, addresses associated with an account not an instance. If the instance fails or is shut down, the IP address can either be immediately re-assigned to a new instance (no more waiting for Dynamic DNS propagation) or “reserved” for future use at a cost of USD0.01c per hour. Also, the new “multiple locations” facility puts the API changes in place to allow for location selection, hopefully a sign that we here in Europe will have “local” EC2 instances to match our European S3 buckets!

UPDATE EnterpriseDB:

It looks like IBM have invested in EnterpriseDB, possibly as a counter-weight against Sun’s acquisition of MySQL (EnterpriseDB’s targeting of Oracle’s customer base would also be an added benefit!).

The WAN is the new LAN

While discussing SimpleDB ,Nick Carr points to the polar opposite views that the two computing behemoths, Google and Microsoft, hold as to the future direction of cloud computing. Google’s Schmidt sees an eventual 90/10 split with the cloud being the home to most data and processes while as expected, Microsoft’s Raikes points to the current reality and insists that the trend will continue to favour a PC centric view.

I’m not sure who’s right, but my instinct (or is that my prejudice) would be towards the Google view. But one thing I am sure of, is ,that as the the cloud (aka the Internet) and “personal computing devices” (aka desktops, laptops,PDAs, mobile phones) fight it out for dominance, the future of the business LAN as the prime computing backbone is looking increasingly untenable. For SMEs and consumers at least, the WAN (in the form of the Internet) is the new LAN.

Not that LANs will disappear totally, the necessity to provide local wireless access and the address limitations of IPV4, plus the need to share printers etc. will see to that (a least in the short-term, but mobile 3G networks, IPV6 and services such as PrinterAnywhere may eventually address these issues). Also, the ability to act a local cache for backups and data access will ensure the LAN’s continued existence at least until Korean levels of broadband speed/availability becomes the norm in the rest of the developed world.

But what about shared private data, email/calendar, backups, security and last but not least, business applications; the big five “business” reasons that lie behind the justification for must organisations’ (and some families’) LAN setups?

Shared Private Data

Fast ubiquitous broadband and online data stores such as S3, SimpleDB, Microsoft Live Workspace and eventually GDrive, will mean that for many small and medium companies the cost of maintaining in-house data servers will no longer make economic sense. Even large organisations, who have in many cases already out-sourced their data centres to the likes of IBM and are already operating VPNs over private and public WANs, may also move parts of their data infrastructure to the internet cloud. Added value online storage services such as provided by Google’s Docs and Spreadsheets will also drive individuals and organisations in this direction.

Email / Shared Calendars

One word Google Apps. Okay, that’s 2 words and a bit simplistic but GMail and Google Calendar and particularly the premium Google Apps versions represent the future shape of business communication systems. Add in Wiki-like collaborative tools such as Google Docs and Spreadsheets (and the long awaited Googlified JotSpot) suddenly the idea of any SME running its own Exchange servers becomes harder to justify.

Data Backups

Even in current setups, an effective backup policy requires that data be moved of-site, so online backup services are a natural progression. In essence the LAN is working as a local cache to quickly assemble the backup and prepare it for transportation to another location (the boss’s home study most likely!). Online backup will probably be the first cloud service that businesses adopt. But as transactional data increasingly gets recorded off-site most of an organisation’s data will already be “backed up”; so, future backup services will be of the intra-cloud, belt’n'braces type e.g. a service that makes encrypted copies of your data stored on one service and either stores them in another online location or maybe burns the data to DVD and deposits it in a physical secure store.

Security

LANs are seen as the modern data equivalent of a medieval town with its firewall playing the role of the town fortifications. But just as increased mobility. collaboration and newer technology put an end to the justification and utility of walled towns, a similar fate awaits the firewalled LAN.

The explosion in the number of workers (especially knowledge workers, free agents and senior executives) operating outside the local network means that companies must already address data security in the context of public networks. VPNs can of course bring the LAN environment to the mobile worker (even a home/tiny business can use something like Hamachi VPN). But VPNs will not extend the LAN but replace it; increasingly to be used as “private pipes” between trusted peers and cloud servers.

For example, I use Hamachi to communicate with my EC2 instances and to transfer data between my laptop and my main desktop PC; something I can do securely and effortlessly from my laptop using any private or public network. As such, the firewall that really keeps my data secure is the one on my laptop not the one built into my LAN router.

You might look at the recent spate of data loses as evidence that companies should batten down the hatches and throw away the key but I’d argue that it’s a failure to face up to and manage the risks (and opportunities) of mobile data that has caused most if not all of these breaches. The first step is to focus on the “Wifi-enabled, easily-stolen laptop connected to a dodgy airport public network” as the “standard” against which your firm’s (and family’s) data security will be judged and eventually tested.

Applications

For many small businesses the business applications they use tend to be either single user packaged apps or even more likely, Excel. Having a shareable cloud-based data store is all they require to abandon their LAN. But for those businesses that rely on sophisticated multi-user systems replacing in-house servers will be more difficult. There are three options as I see it:

  • Keep servers in-house but purchase or lease them as pre-configured “black boxes”. When a new version or bug fix is required, the vendor remotely updates the software; no on-site technical expertise required. Likewise, the vendor remotely monitors the hardware and slots in a new pre-configured box as required. You may argue that the LAN remains and yes it does, but this sort of setup would only be required where high-speed and reliable broadband is not yet available or where any interruption in server connection is not an option.
  • Use remote pay-as-you-go, invoke-as-you-need virtual servers such as Amazon’s EC2 or Scotland’s Flexiscale. Again, using pre-configured virtual machines that can be either purchased or leased from software vendors removing the need to have in-house server or application expertise.
  • And finally, the ideal for most companies, SaaS, Software as a Service, pioneered by Salesforce.com and now starting to gain traction across not just CRM, but accounting, and even full scale ERP. Even the mighty Sage is starting to feel the winds of change! Very small businesses are also well catered for, e.g. FreeAgentCentral for UK based freelancers.

Times they are a-changin’, migration of some or all data to the internet cloud is inevitable, large organisations will most likely build their own cloud, smaller businesses will need to adapt to the cloud-as-a-service model. Organisations need to start thinking about it now as all future IT investments need to factor this phenomenon in, even if the reaction is to reject it!

SimpleDB + S3 = distributed document-centric database

I’m a database man. I’ve worked on or about most variations on the theme, from roll-your-own flat files, to hierarchical, to CODASYL network databases, to the current crop of relational and MOLAP platforms. Of late, I’ve being investigating what I think will be the future of database technology, the distributed document-centric database. Today, the future arrived in the form of Amazon’s new SimpleDB service.

Up until now Amazon’s S3 service offered one half of the future platform the “distributed document-centric” bit but it lacked the indexed structure part to make it a true database; but in combination with SimpleDB it’s now complete.

SimpleDB stores data in a Domain/Attribute schema-less and type-less structure having more in common with a spreadsheet than a traditional relational table. If you’ve worked with the likes of SQLite (manifest typing) or Excel (no predefined schema and manifest typing) then you’ll appreciate this is no hardship, quite the opposite in fact (I find the strong typing nature of most databases a real pain having worked recently on a SQLite combined with Excel project).

The distributed nature of SimpleDB may however pose some difficulty to those of us (i.e. almost everybody) raised in the world of ACID compliant databases. Because of the Brewer’s Conjecture effect, SimpleDB sacrifices consistency for availability and partition tolerance i.e. when you write something to the database, an immediate query may not return the updated value, subsequent queries will eventually return the new data, exactly when depends on the load and the availability of resources. Those of you already using S3 will already be living with this “feature”, and in practice you rarely notice it (most updates seem to appear immediately) but it will still pose design challenges to handle the edge cases.

The service is still in limited Beta, but the documentation is available and if you already used any other AWS product you’ll immediately feel at home. The pricing is again based on usage, the cost of storage is much higher than S3, being $1.50 per GB-month, but a GB of structured data is an awful lot of data (and the larger document style storage would be provided by S3).

If you’ve not yet tried out either S3 or EC2, now might be a good time to start, cloud computing has come down to earth, all thanks to an online book store, Amazon!

Firefox tune up time again …..

This morning Firefox just got slower and slower; clicking on a link or a text box took ages to respond; using online WYSIWYG editors became next to impossible; I was also getting an error when attempting to connect to Google Sync.

I checked the usual suspects; internet connection OK; did a quick HijackThis scan and analysis to check if anything nasty was on the PC, nope, again OK; fired up IE7, it worked fine; launched Firefox in safe mode (disables add-ons and other extensions) but the problem persisted. All signs that the culprit was my Firefox profile.

This has happened before so I knew what to do.

Firefox Profile Dialog

From the command line I launched Firefox with the “-p” option which brings up the profile dialog, created a new profile and relaunched; everything back to normal, except of course all my bookmarks and my browser extensions were gone.

Reinstalling my extensions is easy enough and offers an opportunity to do some much needed spring cleaning. The first extension I always re-install is Google Sync for when its back in business I can then restore my old bookmarks and passwords (not my highly sensitive passwords I hasten to add, I use KeePass to manage those – never store financial passwords and the like in your browser’s profile).

The extensions I regard as must have are:

  • Google Browser Sync – keeps a backup of my bookmarks, remembers what tabs I had open last time and restores them if required, means I can easily flick between my laptop and my desktop. (And of course, it’s very useful when it comes to rebuilding a new PC or profile!). Google Sync to be discontinued.
  • Del.icio.us add-on – tag and search using my del.icio.us account. Why use both del.icio.us and Google Sync’d bookmarks? Well, I use bookmarks for my day-to-day commonly used links, while I use http://del.icio.us as my long-term KM memory bank.
  • S3Fox – for managing my backups and other file storage needs on my Amazon S3 account.
  • Flash Got – download manager used in conjunction with Free Download Manager. UPDATE: I’m now using DownThemAll (a Firefox plugin) rather than FDM mainly to do with FDM’s inability to handle certain ASP and PHP redirects, the prime example being downloads from SourceForge.
  • Google Toolbar – for searching blogs, quick link to Gmail, spell checker, page rank checker.
  • British English Dictionary – to use Firefox’s built-in spell checker (using this now, rather than Google Toolbar’s spell checker).
  • PDF Download – gives me control over how I access PDF links.
  • NoScript – allows me to control what JS/Java/Flash scripts run , also provides excellent XSS protection. Can be annoying sometimes, but I stick with it. To make it less annoying (but not as secure) go to Options and allow top-level sites by default (including 2nd level domains).
  • EC2 UI – for controlling my Amazon EC2 images.
  • I also install but not auto-enable several other add-ons such as Firebug (understand/debug the structure of a web page), iMacros (web browsing macro recorder/ screen scrapper) and SQLiteManager (manages my SQLite databases).

Ruby plus Amazon S3 – Document Centric Database

I’ve said it before and I’m going to repeat myself; learning Ruby has proven to be a great investment, not so much for the language itself but for the insights it gives into other technologies. As soon as a new ‘cool’ technology or idea hits the street some smart Rubyist is bound to attack it, dice it up and serve it back up as easy to digest Ruby code.

Today, it’s the turn of Document Centric Databases done in the style of CouchDB, but replacing JavaScript/Erlang with Ruby and the bespoke data store with Amazon’s S3 service.

Anthony Eden‘s RDDB project is still very much alpha, but looking through the code it looks like it has lots of good ideas, including using EC2 instances as “map reduce workers” listening on Amazon SQS Queues; so the whole Amazon AWS stack might yet get staring roles. The actual data store can be varied, with both partitioned file system and RAM based options currently available alongside S3.

Other Amazon AWS related news, was the announcement today of an option to use European data centres to store S3 data (with a slightly higher charge than using North American locations and with the transfer of data between EU based S3 buckets and US based EC2 instances being no longer free). I’m guessing that the option to fire up European based EC2 servers can’t be far behind. Also, one piece of news I’d missed was that EC2 is now in unlimited beta i.e. it’s now open to all developers. So developers everywhere can, for less that the cost of a mobile text message, fire up their own dedicated and powerful Linux server. The day of a production ready, SLA backed, EC2 service is around the corner.

Amazon EC2: S, L and XL – now we’re sucking diesel..

As of today, Amazon EC2 now supports two new Instance Types..

… a “Large” and an “Extra Large” instance type to complement the original instance type and provide more flexibility for EC2 users. The new instance types provide more memory, CPU, and instance storage, and are based on 64bit technology. EC2 users can now utilize these different instance sizes to support an even broader set of applications and use cases.

The Large Instance is equivalent to roughly four Small Instances (our original instance), and the Extra Large Instance is roughly equivalent to eight Small instances.

This increases the attractiveness of EC2 as a platform for micro ETL/BI activities, the extra memory accessible under the new 64bit instances makes the commissioning of pure in-memory on-demand open source PALO OLAP instances a real alternative. And it’s not just micro BI activities that could utilise this sort of service, many of the large BI implementations I’ve worked on in the past could easily be handled by this type of kit.

Also this week, /n Software announced the private beta of a Java version of their RSSBUS Server Engine; this could be a very useful on-demand micro ETL tool especially now that it will be capable of running under Linux (the current version requires a Windows IIS Server).

Nirvanix targets Amazon S3 shortcomings

Let there be no doubt about it, Amazon’s S3 online storage system is wonderful; it’s secure (both from an technology point of view and from Amazon’s status as one of the web’s most trusted sites i.e. one you wouldn’t worry about giving your credit card to), it’s cheap, it’s pay-as-you-go and it has first mover advantage, but (there’s always a but) it has until now lacked competition. And because it lacked competition the various shortcomings (such as no support for HTTP POST file upload, no SLAs etc.) that S3 users complain about are handled by Amazon in what can best be described as ..

..we hear what you’re saying, we have it on a list; no, we’ll not tell if/when we’ll remedy this problem (or explain why it’s not possible to do so); and anyway if you don’t like it, who else provides anything comparable?

Okay, I’m being unfair here, I’m sure Amazon has very good reasons for how they do things and scalability and “keeping it simple” seem to be their development mantra; and this is a good thing for an online 24/7 storage infrastructure. But, as in all things in life, competition would help not just disillusioned users by offering another comparable service but would help Amazon prioritise items on its S3 roadmap.

Most would have assumed that when that competitor arrived it would either be Google or Microsoft, instead the first up to bat is Nirvanix, a San Diego startup which appears to be associated with another online storage player, MediaMax. Pricing is similar to S3, but with the option of purchasing extra SLA backed support packages, something that has been top of the list for many actual and potential S3 users. Other “missings” that Nirvanix addresses are;

  • File upload via HTTP POST, S3 restricts upload to HTTP PUTs which requires the use of a proxy server or the installation of client software.
  • File rename and move, S3 requires that a file is first deleted and then reloaded.
  • In-built support for media processing such as image resize/rotate for thumbnails.
  • Multi-tenant accounts, each S3 account supports only a single ‘user view’.
  • Files are indexed via tags and name, not just by name as is the case with S3.
  • Granular control of usage limits and reporting, S3 only offers ‘after-the-fact’ reporting.
  • Maximum file size of 256Gb compared to Amazon’s 5Gb.

The Nirvanix authentication method uses a much simpler and more traditional username/password over SLL approach than S3′s key-pair based URL signing method. This can be seen as either a weakness or a strength, but combined with Nirvanix’s support for POST file uploads, multi-tenant accounts and granular usage controls it makes building browser based clients much simpler.

S3′s industrial grade authentication is all fine and dandy but if the key becomes compromised, all’s lost, you could expose not just your data but your wallet if somebody used the compromised key to maliciously upload Terabytes of data. This single point of failure is perhaps my main complaint of S3′s current set-up.

So, am I getting ready to jump ship, no, at least not yet, as;

  • Amazon is still Amazon, they may be lacking SLAs but they have my trust.
  • S3′s role as a back-end to Amazon Ec2.
  • Friendly and effective forums offering excellent support provided by both the developer community and Amazon’s own staff.
  • CNAME support. (e.g. http://www2.gobansaor.com/)
  • Did I mention Ec2?

Should Amazon be worried? No, this is not a zero-sum game, in fact competition will help grow awareness and expand the market for all “cloud” based services.

RoR Data Warehouse on EC2

If you’ve been putting off evaluating Ruby on Rails and you’re lucky enough to have an Amazon EC2 beta account then it’s your lucky day. Paul Dowman has just made a public AMI (think of it like a virtual machine spec from which you can create a running EC2 instance) with various Ruby on Rails goodies preloaded.

Features:

  • Automatic backup of MySQL database to S3 every 10 minutes.
  • Mongrel_cluster behind Apache 2.2, configured according to Coda Hale’s excellent guide, with /etc/init.d startup script
  • Ruby on Rails 1.2.3
  • Ruby 1.8.5
  • MySQL 5
  • Ubuntu 7.04 Feisty with Xen versions of standard libs (libc6-xen package).
  • All EC2 command-line tools installed
  • MySQL and Apache configured to write logs to /mnt/log so you don’t fill up EC2’s small root filesystem
  • Hostname set correctly to public hostname
  • NTP
  • A script to re-bundle, save and register your own copy of this image in one step (if you want to).

I’ve been meaning to try out Anthony Eden’s RoR based data warehousing tool for some time; no more excuses as I now can fire up an EC2 instance based on Paul’s AMI , install the ActiveWarehouse plugin and away I go. As ActiveWarehouse primarily uses techniques described in The Data Warehouse Toolkit it’s also a good learning tool for those new to data warehousing. All I need now is a sizeable publicly accessible dataset to populate the warehouse to get a true fell for its capabilities. There’s only so much you can do with the venerable Northwind database. Does anybody know of a ‘beefier’ alternative?