Category Archives: S3

Excel as a book of record.

In the past I’ve talked about Excel as a tool to develop Really Simple Systems. Such “systems” usually occupy the middle ground between continuing to do a task by hand or  investing time/money in using a packaged/bespoke “proper system”.

When such systems are primarily reporting in emphasis, the justification for using Excel is usually straight forward and compelling (and getting even more compelling with the appearance on the scene of Excel 2010′s  PowerPivot.) But, alarm bells sound across the world of professional IT when Excel is proposed as a “book of record” i.e. when it’s to be used to store and maintain a business critical dataset. And, with some considerable justification, the nightmare that is linked (or even worse, shared) workbooks is very real indeed. But yet, businesses continue to do so, and do so quite successfully.

I myself record my business as a series of Excel documents (Invoices, Timesheets, Expenses) in a set of folders (Financial Years subdivided into major customers).  Essentially a document-oriented database.

In the past I simply then used a VBA powered workbook to open the required “documents” and extracted the data necessary for whatever report I required (VAT, year-end etc.).  To better automate (i.e. less bespoke VBA) this task I’ve have made changes to HAMMER to help with this and with similar tasks for clients.

The following list of commands will be added to the next release of HAMMER. (In the meantime these new functions can be previewed here …)

LISTOFFILES

This command takes a single argument, the folder to search, and will return a table of files in that folder and in any sub-folders. The result can then be used to select a list of files for further processing.

Example:

=HAMMER(“C:\a\rss”,”LISTOFFILES”)

_XLTOJSONDOC

This command takes a list of workbooks, opens each one, checks for a list of named ranges and generates a JSON document. The command is intended to be called from within a VBA macro (as opening and closing workbook breaks the “no side effects” rule of UDFs). Most “_” commands such as “_MD5″ etc. are likewise intended for “programming use”, but any command beginning with “_XL” must be restricted to macro (i.e. non-UDF) use.

See the example workbook FetchInventory for an example of this function in action. The function takes one argument (the name of the document to load) and expects a table where the last column is the full name of the workbook to open. Any columns in the source table will be copied to the new “JSON document” table with an additional column called “Document” which will hold a JSON document representing the key-name pairs and table(s) extracted from the workbook.

On opening a workbook, it is searched for a named range with the name of document concatenated with “_fields” (e.g. INVENTORY_fields). The value of this range is expected to be a CSV list of fields and tables to load. A single Excel “document” could contain multiple logical documents (each specified by its own “_fields” list) .

See the PartsInventory_bin4 for an example of a multi-document workbook (INVENTORY and EXAMPLE). The EXAMPLE document in this workbook also demonstrates the various types of tables handled.

Example:

lJSONObjects = oHammer.HAMMER(“C:\a\rss\StockTake1″,”LISTOFFILES”,”Select name,fullname from table2 limit 1″,”SQL”,”INVENTORY”,”_XLTOJSONDOC”)

lReturn = oHammer.HAMMERToRange(lJSONObjects,”Sheet2!A27″)

… will output

JSONDOCVIEW

This command is where the previous commands are leading to, i.e. extracting some real information value from your documents. It converts JSON documents into Excel friendly tables. It is, in essence, a Map function as in MapReduce. In a previous example I used a Python Map and a SQL Reduce, here, both Map and Reduce are via SQL (the command uses a series of SQL commands to perform its task).

Before I describe the function let me explain why I use an intermediate JSON format. I could just extract the data directly from each document and either store directly in Excel or create tables in SQLite of Access to hold this data. And in fact, that’s what I would have done in the past (seeExcel as a document-oriented NoSQL database). Now , however, I tend to favour using a free-format (i.e. no need for a fixed database schema) structure like a JSON document, so as the source documents evolve over time (which tends to happen not just during design stages but as the system matures) this will not break older documents.

So, for example, original Invoice workbooks might not have a backing time-sheet while newer Invoices do. As long as new and old documents share a core sub-set of data fields they can continue to be analysed together.

The command takes 5 arguments and a driving table (a record so far for HAMMER commands, most have a max of two arguments). The driving table’s last column is assumed to contain the JSON document to process, columns prior to this (if any) will be output unchanged for each resulting row.

The first argument specifies the name of the “inner” table to fetch (if any). Most real life documents consist of header details (the “outer”document) and one or more tables (“inner” details). Invoices, time-sheets,  stock-takes, all tend to follow this pattern. This command will effectively join each document’s outer details to a single inner table (if more than 1 inner table, a call for each one is required).

The second (field list in SQL format) and third (SQL where predicate format) arguments specify what inner fields to extract (if blank, then all) and what restrictions to impose (if any). So “InvNo, Date”,”InvNo > 12″ would only fetch documents where the InvNo > 12 and only include the InvNo and Date fields.

The fourth and fifth arguments do the same for the outer table (i.e. Header data).

If any of the columns specified  (inner or outer) can not be found, or if the predicates (inner or outer) result in no selection, no error is returned, the document simply returns no rows. Likewise if an inner table is specified and no such table exists, then no rows are returned for that document – in other words this is not an outer join, which is not usually a problem as in most cases a “header” without detail lines is meaningless. If an outer join is required, then extract the headers (outers) and details (inner tables(s)) separately and join using SQL.

Example:

=HAMMER(“Select Name,FullName,Document from invoice_docs”,”SQL”,”table_2″, “[PART NUMBER],QTY”, “QTY >30″, “Bin_Number”, “Bin_Number > 1″, “JSONDOCVIEW”)

would result in:

For more complex JSON objects use the JSON command to incrementally parse the text or use the VBA JSON module within microETL. But for most situations (especially if you control the expected format) JSONDOCVIEW should handle it.

As JSON is fast becoming the preferred transport format for web and mobile applications having the ability to parse and produce JSON form within Excel is very useful. It is possible, for example, to use a simple web technology such as http://robla.net/jsonwidget/ to craft another type of Really Simple System. This time with the collection happening on the web (most likely using AWS S3 pre-signed forms, so no HTML server required – keep it simple) but with the control and reporting remaining within Excel (a variation on my Steam Powered Server idea).

For an example of a really simple system  download this.

Latest version of HAMMER including the above commands now released …

PowerPivot mini-me server!

I’ve been extending microETL’s functionality to make using it as a “Steam Powered” server easier and more automated. The major addition this week is an Excel friendly JSON parser. This will allow me to use JSON objects for interchange messages and for configuration files. I was going to use XML as there’s already an element of support within Excel for the format via XML Maps and the like, but I decided that JSON was too common a format to ignore, so I modified this existing VB6 example for my purposes. The major change, other than VBA-enabling it, is the use of detached  ADO Recordsets to handle list-of-lists structures (aka tables).

So what’s the likely use for my brand new shiny JSON objects? (By the way don’t be put off by the talk of objects and the like, JSON is essentially a simple text based format see here for a description.)

I’ve demonstrated that Excel with the help of microETL can function as a Web Service endpoint; whereby, a web request is send to the TCP/IP port Excel is listening on; Excel then processes that request (in the meantime blocking any further requests as Excel is single threaded); and then issues a response with the necessary data and/or error messages.

This is not ideal. Excel is not a server product and while it can be used as one in very low-traffic environments it has the potential to be a serious bottle-neck. But, if you approach this problem using Excel as responder to a queue rather than as a always-on TCP/IP port listener, it becomes surprisingly fit-for-purpose as a server.

What do I mean by a queue-responder? An example will best explain.

Say, you’ve developed a very useful PowerPivot model that you wish to share with your co-workers. Unfortunately they’re a mixed bunch, some have access to Excel 2010 and potentially PowerPivot but most are still on Excel 2003, and another set are outside contractors who use a web gateway to access your internal systems.

Now when I say model, I mean the data relationships, cleansed data and measures you’ve developed not the imagery created on the workbook rendered via pivot tables, charts or the like. If you need to share at this level you can do so by sending the workbook to those colleagues with Excel 2010 (0r 2007 for non-dynamic view) or by investing in a SharePoint farm. Neither option is a runner in this case, as you don’t have the money nor technical resources to set-up SharePoint and the detail data behind the model is not suitable for sharing. So what to do?

Well if you had  microETL “server edition”  installed (remember,this is vapour-ware but most of the building blocks are in place) you would:

  1. Create one or more flattened PivotTables based on your super-duper model. Static elements would be “column” fields while “client selectable” elements would be assigned to either “pages” or “slicers“.
  2. Run the “service discovery” microETL function which would create one or more “service manifests” based on the flattened pivots and associated pagers and slicers. This manifest would be published as a JSON object and would be “discoverable” via a standard “discovery service”. The function would auto-build the SQLScripts necessary to handle the discovery service and the “services” associated with each flattened pivottable.
  3. Start a TIMER on the now microETL-server-enabled workbook. It would check a “resource” for any “requestMD5hash.request” JSON messages. If either an equivalent “requestMD5hash.busy” ( & not older than a certain time) or “requestMD5hash.response” exists then ignore; if not, create a “requestMD5hash.busy” file; open the .response file; process the required service using the data parameters supplied and write back a .response file.
  4. Repeat until  the TIMER is disabled.

So what’s this “resource” and what’s the requestMD5hash stuff? The “resource”, in its simplest implementation, would be a “client accessible” folder; so it could be a Windows or Samba shared folder, or a folder on the same machine as the Excel “server” with the JSON messages being push/pulled to/from that folder by a simple web server. But it could also be an AWS S3 bucket or an AWS SQS queue. In fact, the .request->.busy->.response cycle is a simple queue implementation.

The requestMD5hash stands for the MD5 hash of the .request JSON message. This will uniquely indentify the message (i.e. the hash of the the service, the parameters and optionally any security IDs or time stamps) and will be used to both track the request and to provide a simple type of request caching.

Also as the MD5 hash can be calculated on the client-side so those situations where the “resource” is a “dumb folder” can be handled, as it provides a server-free means of generating an Unique ID (rather than the more usual server powered “here’s my request and a server sends back an ID” cycle).

So what about the clients? Those with Excel, would use an cut-down version of the microETL add-in (or macro-enabled workbook) to handle: The Request (encode in JSON) -> Wait for a response or come back later -> Response (decode JSON) cycle.

Those who access via a web gateway would delegate this cycle to a web app. The app might repackage the JSON response as a pretty web page or simply as a basic HTML table or XML Map (for import into un-macro’d Excel or other spreadsheets) or pass on the JSON as is.

The web app could also handle the wait period by doing some AJAXy “please wait” magic with a timeout message like “Excel servers not responding – Please resubmit at a later stage”. In this case if the server managed to catch up before the resubmit the response would be waiting and recognised via its “.request” MD5 hash.

What about security?

The “discovery” service would have the option of linking in with a simple User/Role facility, which would:

  1. Filter the “slicer” elements at the discovery stage (a manifest would default to sending a list of single column tables each one associated with a slicer, listings all the possible elements that can be filtered through that slicer). A security filter could restrict such lists.
  2. Validate the filters requested to ensure that the “allowable” slicer elements where not doctored to include blocked elements. (The slicer lists provided by the discovery service would ideally be used by the client workbooks or web apps to populate lookup lists to provide an element of client-side validation.)

Would it scale?

If you mean would it scale to handle your growth as you went from a small firm to a mega world-wide empire, no it wouldn’t. But it could scale to handle linear growth by simply making a copy of the workbook and starting it on another Excel process, or on another machine if front-ended by a simply round-robin assigner. The MD5 identifiers would also enable a simple caching mechanism enabling the handling of traffic spikes. Any client who contracts to interact with such servers would operate in a two message cycle (send the request, then send a separate series of one or more “is it ready” requests). This overhead is handled by the client and/or dumb-hubs and/or intermediate web-services, allowing Excel to concentrate on churning the model.

Anyway enough, it’s 20C outside which is very unusual for early April in Ireland, I’m off to act as a “server” of different sort, on barbecue duty …

S3 as an Excel hub

In my Steam Powered PowerPivot post I explained the concept of powered-spoke based Excel systems. One of the hub options was Amazon’s Simple Storage Service, S3. I’ve been using S3 both as a simple backup service and as a “systems hub” since the service began in 2006. Back then, it was not necessarily the easiest to use or the most suitable, but it was cheap and reliable. The code I used to access S3 from Excel can be found here http://blog.gobansaor.com/2008/02/22/xlaws-excel-vba-code-for-accessing-amazons-s3-and-simpledb/

Over the years, lots of new features have been added to S3 that have made the service more useful and easier. The final big missing, for me at least, was the lack of multiple  authorisation credentials  for a single account i.e. the ability to set-up multiple users with their own access policies. Last year, with the launch of IAM (Identity & Access Management) Service, that last major problem was removed.

Although currently IAM is accessible only by the API that’s not a problem as my favourite tool for managing S3 (and AWS in general) is Python’s boto package, and it has that, and everything else you might wish to do S3 -wise, covered.

Although Boto does most of my heavy lifting (particularly on the server-side), I still have need for a VBA based S3 utility. Boto is fine for situations where I’m using microETL with Python enabled, but in many situations this is not ideal or even possible, particularly on the client-side of systems. Here, even a single file add-in may not be possible which often means a macro-enabled workbook is the only option.

MicroETL has a noSQL compile option which enables a lot of the code to operate without the SQLite libraries, thus allowing it to be used as a single-file VBA add-in or embedded in a workbook.

Up until now my S3 code depended on the xliteSha1Hmac.dll to provide SHA1-HMAC hashing (necessary to authorise access to S3) meaning that only pre-signed S3 URLS could be used in single-file deliverables. Even if this was not the case, prior to IAM, pre-signed URLs (usually with time limits) were often the only option as sharing an S3 account’s single authorisation credentials with all and sundry was not ideal from a security point of view. But post IAM, being able to set-up sub-accounts with specific access policies means the full power of S3 can now be used at client-level.

I’m in the process of adding a pure VBA HMAC-SHA1 facility (32 bit only at the moment, 64 bit continues to require the xliteSha1Hmac.dll). I’m also adding a JSON encode/decode module to allow tables to be more easily shared with non-Excel programs, currently I default to using tables encoded using ADO’s export XML schema to share via S3, but the JSON option will make it easier to share with a pure Python server (such as a PiClound based service, see my Expand Excel’s Horizons – look to the cloud post).

Cloud-shared tables can not only pass datasets back and forth but can pass SQLScripts (which are, after all, simply tables) for processing elsewhere.

For example, client workbooks not having Python  (or SQLite ) functionality installed could write a script, post it to S3; that script table could then be picked up by a “server” microETL workbook (or by a pure Python server), executed, and the results posted back to S3 to be picked up by the client (possibly waiting a response using a TIMER).

Server processes could also do the opposite; passing down SQLScripts for client-side execution.

When people think of utilising the cloud, this sort of scenario is not what comes to mind, thinking that Excel is a desktop product so it cannot take advantage of the enormous cost-advantages of cloud-based solutions. Not so, it’s now very easy to integrate the likes of S3 into existing processes, adding that cloud magic but keeping those existing works-so-don’t-fix-it processes firmly on the ground.

If you need help figuring out how to take advantage of S3 (or other AWS services) and how to integrate them with your existing technology base do contact me. I’ve been working with this technology for over 5 years (and have 30 odd years of experience with good-old reliable “ground-based systems”,

What’s with the bucket picture? S3 calls its primary data storage unit a bucket.

UPDATE:

IAM is now available from the AWS Management Console http://aws.typepad.com/aws/2011/05/identity-and-access-management-console-support.html


Steam Powered PowerPivot

In previous posts I described various options for enabling PowerPivot models (or indeed any Excel model) as the hub in a hub’n'spoke application. This thin’n'core approach is the norm for web apps as it was for the client server applications that preceded them.

But as we know, Excel “applications” rarely follow this pattern.

They tend to follow a more peer-to-peer topology, often with each player being an “equal”; collaboration being the primary aim rather than service provision.

But even in such scenarios there usually is a hub at the centre providing the communication linkages. Most likely the company’s email system or a “shared drive” of some sort. In fact, on many corporate teams or projects, you know you’ve been accepted into the fold when the project manager says the magic words “I’ll get you access to the shared drive”.

But sometimes, hub-like service provision or a single controller role is required and in such situations one of the spokes takes over that responsibility. For example, in a budgeting scenario one spoke may be the source of historical stats, in effect, performing a service role.

If classic client-server uses the powered-hub-driving-the-client-spokes like the driven wheels of a car as an analogy, then many Excel systems could be compared to a steam engine’s driven wheels; where the hub provides a centre to rotate about but it’s a particular spoke that does the driving due to its linkages to the piston arm.

Such systems built around the sending back and forth of workbooks or CSV files via email or shared drives are common and most of us have seen them in action. Most tend to be manual, in that they require user action on both client and server side, but with a bit of VBA magic one or both can be automated. Options for doing so on the server side are:

  • A VBA loop that sleeps and checks for change, process any changes and sleeps …
  • Trigger a microETL.SQLScript using a COM automation call from a external program like that used in this scenario.
  • Trigger a script using an RTD Server that checks for new activity.
  • Use a “Javascript” timer as per Excel Hero’s neat trick.

RTD servers are rarely encountered in the wild as they only appeared on the scene in Excel XP and were not the easiest things to code or understand but now new tools such as ExcelDNA makes coding RTDs much more approachable and useful.

Daniel Ferry‘s JSTimer idea is a simpler VBA alternative to an RTD and is non-blocking unlike a pure VBA sleep-check-sleep loop. It also has the advantage that no add-in is required so the client-side could be a single-file macro-enabled workbook.

Using a shared folder or email as the conduit for passing data back and forth and using Excel files or CSVs as the data artefacts may be the obvious choices but other “more modern” alternatives exist.

For example, for situations where a shared folder isn’t an option why not use something like Amazon’s S3? S3 is secure (much more secure than external email), reliable, simple to use and extremely cheap. A modern-day alternative to an FTP server. Ideal for big corporates,SMEs and sole-traders.

Orchestrating the processing of files on S3 is very simple using tools such as Python’s Boto (another good reason for having Python in-built in microETL). For the client-side, pure VBA could  be used, see my xlAWS code.

Likewise, using workbooks as the medium of data exchange is usually the first option, with CSV files a close second. But since Excel 2003 Professional, Excel has had excellent support for the exchange of XML files via XML Maps. Although not useful for highly complex schemas or very large datasets (both of which are anyway the work of the devil ;) ), XML Maps are really good at handling relatively simple dialog and configuration inter-change requirements.

The process of importing/exporting data via XML can be easily automated using basic VBA but can also be safely left under manual control. XML files can also be generated and read by non-Excel components.

So when you’re confronted with a PowerPivot model that is not suitable for direct sharing, due to:

  • technical constraints e.g. non-2010 clients
  • or business reasons e.g. detail data behind model, or the model itself, needing to be kept secret
  • or model complexity reasons i.e. too complex for a single PowerPivot model …

… investing in a top-of-the-range SharePoint farm is not your only option. By utilising some of the enormous power that comes as standard with Excel such as VBA, XML Maps, RTDs etc. further empowered by an Excel-focused ETL tool such as microETL a lot can be achieved on a relatively small budget.

If you need help designing or configuring such a scenario or need help in understanding where PowerPivot can fit in to your decision making and information distribution requirements, contact me. My job in life is to provide than help …

Download latest version of microETL from http://www.gobansaor.com/microetl

Windows on EC2 = SMEs on EC2

The announcement that Win2003 is now an an option on EC2, is very significant, that and EC2′s exit from beta status with an SLA in tow, means that AWS is now very much more appealing to the great unwashed, the SMEs. i.e. the businesses who form the backbone of most of our economies.

Large companies and start-ups are comfortable in the world of Linux servers but most small companies are Windows to the core.  This may not be “right”, this may not be how it “should be”, but it is so.   Even within large companies, departmental computing is largely a Windows only enclave, with MS Office (and Excel in particular) as the backbone and MS SQL Server as the database of choice (or is that, no choice).

The other interesting thing is that my fear that EC2 SQL Server Standard instances would be licensed as per Oracle has not come to pass (Oracle while making a “big thing” of their recent EC2 cloud conversion, still insist on traditional licensing for EC2 database instances). SQL Server Standard is available on a pay-as-you-go model, brilliant!.

Even if running Win2003 as a server doesn’t catch your fancy and in fact you would much rather get rid of your existing Window’s laptop to be replaced by a cool new Apple Mac. Unfortunately you still need the ability to run Windows-only software, why not use EC2 as your on-demand pay-as-you-go Window’s desktop replacement?  Simply configure a Windows AMI with your required software (you may have to use something like this, if software is only available on CD); you could then use Jungle Disk to easily share data (via S3) between your new shiny Mac and the AMI.  Power up and down as required, easier than using VMWare or Parallels and @ 12.5c per hour, probably cheaper too.