Category Archives: AmazonAWS

Excel as a book of record.

In the past I’ve talked about Excel as a tool to develop Really Simple Systems. Such “systems” usually occupy the middle ground between continuing to do a task by hand or  investing time/money in using a packaged/bespoke “proper system”.

When such systems are primarily reporting in emphasis, the justification for using Excel is usually straight forward and compelling (and getting even more compelling with the appearance on the scene of Excel 2010′s  PowerPivot.) But, alarm bells sound across the world of professional IT when Excel is proposed as a “book of record” i.e. when it’s to be used to store and maintain a business critical dataset. And, with some considerable justification, the nightmare that is linked (or even worse, shared) workbooks is very real indeed. But yet, businesses continue to do so, and do so quite successfully.

I myself record my business as a series of Excel documents (Invoices, Timesheets, Expenses) in a set of folders (Financial Years subdivided into major customers).  Essentially a document-oriented database.

In the past I simply then used a VBA powered workbook to open the required “documents” and extracted the data necessary for whatever report I required (VAT, year-end etc.).  To better automate (i.e. less bespoke VBA) this task I’ve have made changes to HAMMER to help with this and with similar tasks for clients.

The following list of commands will be added to the next release of HAMMER. (In the meantime these new functions can be previewed here …)

LISTOFFILES

This command takes a single argument, the folder to search, and will return a table of files in that folder and in any sub-folders. The result can then be used to select a list of files for further processing.

Example:

=HAMMER(“C:\a\rss”,”LISTOFFILES”)

_XLTOJSONDOC

This command takes a list of workbooks, opens each one, checks for a list of named ranges and generates a JSON document. The command is intended to be called from within a VBA macro (as opening and closing workbook breaks the “no side effects” rule of UDFs). Most “_” commands such as “_MD5″ etc. are likewise intended for “programming use”, but any command beginning with “_XL” must be restricted to macro (i.e. non-UDF) use.

See the example workbook FetchInventory for an example of this function in action. The function takes one argument (the name of the document to load) and expects a table where the last column is the full name of the workbook to open. Any columns in the source table will be copied to the new “JSON document” table with an additional column called “Document” which will hold a JSON document representing the key-name pairs and table(s) extracted from the workbook.

On opening a workbook, it is searched for a named range with the name of document concatenated with “_fields” (e.g. INVENTORY_fields). The value of this range is expected to be a CSV list of fields and tables to load. A single Excel “document” could contain multiple logical documents (each specified by its own “_fields” list) .

See the PartsInventory_bin4 for an example of a multi-document workbook (INVENTORY and EXAMPLE). The EXAMPLE document in this workbook also demonstrates the various types of tables handled.

Example:

lJSONObjects = oHammer.HAMMER(“C:\a\rss\StockTake1″,”LISTOFFILES”,”Select name,fullname from table2 limit 1″,”SQL”,”INVENTORY”,”_XLTOJSONDOC”)

lReturn = oHammer.HAMMERToRange(lJSONObjects,”Sheet2!A27″)

… will output

JSONDOCVIEW

This command is where the previous commands are leading to, i.e. extracting some real information value from your documents. It converts JSON documents into Excel friendly tables. It is, in essence, a Map function as in MapReduce. In a previous example I used a Python Map and a SQL Reduce, here, both Map and Reduce are via SQL (the command uses a series of SQL commands to perform its task).

Before I describe the function let me explain why I use an intermediate JSON format. I could just extract the data directly from each document and either store directly in Excel or create tables in SQLite of Access to hold this data. And in fact, that’s what I would have done in the past (seeExcel as a document-oriented NoSQL database). Now , however, I tend to favour using a free-format (i.e. no need for a fixed database schema) structure like a JSON document, so as the source documents evolve over time (which tends to happen not just during design stages but as the system matures) this will not break older documents.

So, for example, original Invoice workbooks might not have a backing time-sheet while newer Invoices do. As long as new and old documents share a core sub-set of data fields they can continue to be analysed together.

The command takes 5 arguments and a driving table (a record so far for HAMMER commands, most have a max of two arguments). The driving table’s last column is assumed to contain the JSON document to process, columns prior to this (if any) will be output unchanged for each resulting row.

The first argument specifies the name of the “inner” table to fetch (if any). Most real life documents consist of header details (the “outer”document) and one or more tables (“inner” details). Invoices, time-sheets,  stock-takes, all tend to follow this pattern. This command will effectively join each document’s outer details to a single inner table (if more than 1 inner table, a call for each one is required).

The second (field list in SQL format) and third (SQL where predicate format) arguments specify what inner fields to extract (if blank, then all) and what restrictions to impose (if any). So “InvNo, Date”,”InvNo > 12″ would only fetch documents where the InvNo > 12 and only include the InvNo and Date fields.

The fourth and fifth arguments do the same for the outer table (i.e. Header data).

If any of the columns specified  (inner or outer) can not be found, or if the predicates (inner or outer) result in no selection, no error is returned, the document simply returns no rows. Likewise if an inner table is specified and no such table exists, then no rows are returned for that document – in other words this is not an outer join, which is not usually a problem as in most cases a “header” without detail lines is meaningless. If an outer join is required, then extract the headers (outers) and details (inner tables(s)) separately and join using SQL.

Example:

=HAMMER(“Select Name,FullName,Document from invoice_docs”,”SQL”,”table_2″, “[PART NUMBER],QTY”, “QTY >30″, “Bin_Number”, “Bin_Number > 1″, “JSONDOCVIEW”)

would result in:

For more complex JSON objects use the JSON command to incrementally parse the text or use the VBA JSON module within microETL. But for most situations (especially if you control the expected format) JSONDOCVIEW should handle it.

As JSON is fast becoming the preferred transport format for web and mobile applications having the ability to parse and produce JSON form within Excel is very useful. It is possible, for example, to use a simple web technology such as http://robla.net/jsonwidget/ to craft another type of Really Simple System. This time with the collection happening on the web (most likely using AWS S3 pre-signed forms, so no HTML server required – keep it simple) but with the control and reporting remaining within Excel (a variation on my Steam Powered Server idea).

For an example of a really simple system  download this.

Latest version of HAMMER including the above commands now released …

SQL – does exactly what it says on the tin

SQL how unloved it must feel sometimes, constantly being maligned, accused of being on the wrong side of the object-relational impedance mismatch,  lacking the glamour of OO programming languages that claim the moral high ground. Yet at the same time hewing and hauling most of the world’s structured data on its old but well fashioned back.

SQL is perhaps the world’s most popular DSL, a declarative language for the manipulation of tabular data, easy to learn yet capable of powerful (and sometimes complex) expressions.  And like the Ronseal ad, a SQL statement no matter how simple or complex, does exactly what it says, all the complexity of loops and iterations and the attendant errors, abstracted away, it just works!

SQL is both a programmer and an end-user tool; after Excel formulas, it’s the language most likely to be understood and used by “civilians”.  There are few enough such cross-over tools, so think twice before building a datastore that doesn’t offer a SQL API.  And I guess that’s what Amazon did. Although SimpleDB is not a relational database, they’ve decided to add a SQL API, following Google’s lead with its SQL front-end to the non relational big-table backed Google App datastore.

SQL is also the reason why I’ve integrated SQLite with Excel , leveraging SQL to manipulate tabular data with greater efficiency and fewer errors while still keeping the touchy-feely power of Excel.   I expose SQLite to Excel via UDFs rather than menu options or wizards, so that the transformation logic is visible and approachable (at least to those comfortable with excel formula “programming” and with basic SQL).

SQL is my weapon of choice because of my belief in the primacy of data. It is data that matters in the long run, not the algorithms or GUIs that temporarily use (and abuse) it.  In my time in Guinness Ireland I had the task of transferring master and historical transactional data from “legacy systems” into SAP ,Siebel and a new datawarehouse; data that had a decade and a half earlier been transferred by me  into those same legacy systems from even older systems. In fact, the data’s electronic lineage could be traced back to a 1960′s era ICL mainframe  (I have the original spec!) and I’m sure it existed in accountancy machine punch-cards  prior to that. Understand a business’s data and you’ll not just understand the business as it currently operates but also how it operated in the past and its future potential.

SQL abú.

Why not join me on Twitter at gobansaor?

Windows on EC2 = SMEs on EC2

The announcement that Win2003 is now an an option on EC2, is very significant, that and EC2′s exit from beta status with an SLA in tow, means that AWS is now very much more appealing to the great unwashed, the SMEs. i.e. the businesses who form the backbone of most of our economies.

Large companies and start-ups are comfortable in the world of Linux servers but most small companies are Windows to the core.  This may not be “right”, this may not be how it “should be”, but it is so.   Even within large companies, departmental computing is largely a Windows only enclave, with MS Office (and Excel in particular) as the backbone and MS SQL Server as the database of choice (or is that, no choice).

The other interesting thing is that my fear that EC2 SQL Server Standard instances would be licensed as per Oracle has not come to pass (Oracle while making a “big thing” of their recent EC2 cloud conversion, still insist on traditional licensing for EC2 database instances). SQL Server Standard is available on a pay-as-you-go model, brilliant!.

Even if running Win2003 as a server doesn’t catch your fancy and in fact you would much rather get rid of your existing Window’s laptop to be replaced by a cool new Apple Mac. Unfortunately you still need the ability to run Windows-only software, why not use EC2 as your on-demand pay-as-you-go Window’s desktop replacement?  Simply configure a Windows AMI with your required software (you may have to use something like this, if software is only available on CD); you could then use Jungle Disk to easily share data (via S3) between your new shiny Mac and the AMI.  Power up and down as required, easier than using VMWare or Parallels and @ 12.5c per hour, probably cheaper too.

Why Larry hates the cloud, and my data trinity.

Last week Oracle certified Amazon EC2 as a supported platform, that same week Larry Elison attacked the concept of cloud computing as pure hype. Obviously, Larry is not happy with this whole cloud thing, and I think it’s not just the threat it poses to the software industry’s traditional licensing model that worries him, rather, as Robert X. Cringely points out in his “Cloud computing will change the way we look at databases” post, it’s the likelihood that it sounds the death-knell for large-scale traditional databases.

This new database paradigm is memory rather than disk centric, with the disk-based element acting as an archive/backup/restore mechanism which can easily be stored on commodity SAN devices ( e.g. Amazon’s ESB). Using MapReduce technology Google effectively holds the whole Internet in memory, not in one big super computer but in lots of cheap commodity servers.

But it’s not just in the realm of mega datasets that RAM based databases threaten traditional models. Excel is a memory-based database engine, so too in-memory OLAP tools such as Palo. Such products’ ability to handle large volumes of data has increased over the years, with the decrease in RAM costs and the appearance of cheap 64 bit machines (which are no longer limited to 2G/3G process working sets).

That doesn’t mean that we’ll throw away SQL databases in their entirety, SQL and the relational model will continue to be useful. But perhaps of greater use in local datastores/caches that as the building blocks for large scale datastores. For such local caches, less will be more; fewer features, easier to configure, more flexibility. That’s why I like SQLite; long after the dinosaurs of the database world have disappeared, I imagine SQLite databases will continue to survive, embedded in mobile phones, browsers, wherever a local datastore is required. And more than likely operating in memory rather than off disk.

By combining Excel with an in-memory SQLite database, linked to a Palo OLAP in-memory server, it’s possible to take advantage of three powerful data-processing technologies (spreadsheets, SQL, multi-dimensional cubes) all within your PC’s RAM. You could do serious datasmithing with such a combination on a pretty mediocre laptop, with most modern machines providing an excess of CPU power, no need for super fast disks, just as much memory as you can muster. And, with Windows on EC2, these three amigos will soon be capable of being used as a cloud bursting platform.

Excel, SQLite and Palo, my data trinity.

Clouds no longer pass by Windows.

Amazon today announced that later this year, Windows Server woud be available on EC2. No details on cost and licensing etc. but this is major.  Up until now, that portion of the business world who are pure MS shops (a very large percentage especially amongst SMEs) were excluded from taking advantage of Amazon’s amazing (and getting more amazing everyday) EC2 platform

From my point of view, as with Oracle’s announcement last week, this releases yet more of my “legacy” skillset for deployment in the clouds. Although I’ve been involved with  *nix servers for 20 years or so, as corporate servers became more locked-down (and removed to the control of 3rd party data centres) I lost day-to-day experience of using them; in latter years my main ‘hands-on’ platform was Windows, either my own PC or local departmental NT servers. Windows on EC2 will allow me to use a whole new set of Windows only software (e.g. RSSBus or XLsgen) and of course SQLServer.

The lack of SQLServer on EC2 has been a major problem for me as a datasmith; there’s an awful lot of data out there sitting in SQLServer databases, but currently if I need to “cloud burst” such datasets I would have to first extract the data to, say, csv files and then load the data on to a Linux compatible database. But with a SQLServer instance running in the cloud, I could simply use SQLServer’s native backup/replication tools.  No more need to download data to my “ground-based” PCs resulting in quicker turnaround and fewer data security risks.

On the licensing front,  I’m presuming that the OS licence will be on a pay-as-you-go basis, but what about SQLServer and other server products?  Will MS do an Oracle on it, i.e. require a traditional upfront use-it-or-lose-it payment or will they the go the radical (but I thing inevitable) path of a licence-by-the-hour. 

First RedHat, then Sun, then Oracle and now Microsoft; the mighty beasts of our industry have acknowledged there’s a new mighty beast on the prowl, dressed as a humble bookseller no less!