Death of the Star Schema?

With the release of the next version of PowerPivot around the corner (mid March I think), I’ve been re-acquainting myself with its new features. Most of the current version’s annoyances have been remedied (no drill-thru, no hierarchy support for example); and the additional enhancements to the DAX language (crossjoins, alternate relationships etc) make modelling m0st any problem possible (and generally easy).

The more I come to know PowerPivot, the more I believe that modelled data warehouses‘ days are numbered. I didn’t say data warehouses per se, rather those that attempt to centrally model end user reporting structures (usually as star-schemas).

There will continue to be a need for centrally controlled data warehouses (or at least simplified data views (and/or copies) of operational datasets, either provided by system vendors of by in-house IT) to bridge the raw-to-actionable data gap. But I suspect the emphasis will change from providing finished goods to providing semi-processed raw materials.

So, will the star-schema become redundant? No, as it still a valid method of modelling a reporting requirement in order to make many queries simpler to phrase (this obviously applies to SQL , but also to DAX queries). But, those who build them will be doing so closer to the problem at hand, and specific to that problem (I’ve discussed this before in http://blog.gobansaor.com/2011/01/08/slowly-changing-dimensions-time-to-stop-worrying/).

For many reports the barely modified operational data model will be all that’s required (for example, DAX doesn’t require “fact” header/detail tables to be flattened to detail level, as would be the case with a classic star).

“Good Enough” models will become the norm; classic “Everything You Ever Wanted to Know” centralised models a luxury for most (especially as such models tend to “age” very quickly).

If you’re about to invest or re-furbish your data warehouse or your reporting data sub-systems, don’t do so without first taking a serious look at PowerPivot. This is a game-changer, not just for full-stack Microsoft BI shops, but for any business that finds that their reporting datasets invariably end-up in Excel.

If you need any help evaluating PowerPivot or modelling your reporting needs in PowerPivot,  I’m for hire.

Update:

Just in case you think I’m an dimensional-model un-believer or likely to abandon my star-schema roots read this….

Star Schemas: to explore strange new conformed dimensions, to seek out new measures, to boldly go where no Excel spreadsheet has gone before.

PowerPivot VBA Refresh Code – Bug Fix

Just a quick post to alert those of you using my PowerPivot Refresh code to a bug in its “refresh a single table” logic. Under certain circumstances, linked tables (i.e. those on the “many” side of a relationship) will fail to refresh if specified individually (fine when part of a refresh all). See here for the details behind the bug, and thanks again to Rob Parker for bringing it to my attention. I’ve updated the sample code with the fix, download it here.

I’ve also tested the code against the just released SQLServer 2010 RC0 version of PowerPivot and it appears to work.

I’ve updated the equivalent PPREFRESH code in HAMMER and this is now part of the tool’s latest release (V1.3.4(Beta)). My previous post, Excel as a Book of Record, previewed the most important new commands available in this release. Alongside those, I’ve also added the following:

  • ISTABLE, if the previous COMMAND’s result  or the previous argument is not a table, this will abend the command sequence.
  • ISARG, as above, but this time checks for an argument (a HAMMER parameter may either be a table, a command or an argument- aka, an ARG).
  • ISOK, previous argument must be the string value “OK”.
  • TABLESARETHESAME, will fail if the last two tables are not identical. Intended mainly for automated regression testing.
  • ARGSARETHESAME, as above. but this time for ARGS.
  • _GUID, will return a globally unique identifier.

Download the latest version of HAMMER from here …

Excel as a book of record.

In the past I’ve talked about Excel as a tool to develop Really Simple Systems. Such “systems” usually occupy the middle ground between continuing to do a task by hand or  investing time/money in using a packaged/bespoke “proper system”.

When such systems are primarily reporting in emphasis, the justification for using Excel is usually straight forward and compelling (and getting even more compelling with the appearance on the scene of Excel 2010′s  PowerPivot.) But, alarm bells sound across the world of professional IT when Excel is proposed as a “book of record” i.e. when it’s to be used to store and maintain a business critical dataset. And, with some considerable justification, the nightmare that is linked (or even worse, shared) workbooks is very real indeed. But yet, businesses continue to do so, and do so quite successfully.

I myself record my business as a series of Excel documents (Invoices, Timesheets, Expenses) in a set of folders (Financial Years subdivided into major customers).  Essentially a document-oriented database.

In the past I simply then used a VBA powered workbook to open the required “documents” and extracted the data necessary for whatever report I required (VAT, year-end etc.).  To better automate (i.e. less bespoke VBA) this task I’ve have made changes to HAMMER to help with this and with similar tasks for clients.

The following list of commands will be added to the next release of HAMMER. (In the meantime these new functions can be previewed here …)

LISTOFFILES

This command takes a single argument, the folder to search, and will return a table of files in that folder and in any sub-folders. The result can then be used to select a list of files for further processing.

Example:

=HAMMER(“C:\a\rss”,”LISTOFFILES”)

_XLTOJSONDOC

This command takes a list of workbooks, opens each one, checks for a list of named ranges and generates a JSON document. The command is intended to be called from within a VBA macro (as opening and closing workbook breaks the “no side effects” rule of UDFs). Most “_” commands such as “_MD5″ etc. are likewise intended for “programming use”, but any command beginning with “_XL” must be restricted to macro (i.e. non-UDF) use.

See the example workbook FetchInventory for an example of this function in action. The function takes one argument (the name of the document to load) and expects a table where the last column is the full name of the workbook to open. Any columns in the source table will be copied to the new “JSON document” table with an additional column called “Document” which will hold a JSON document representing the key-name pairs and table(s) extracted from the workbook.

On opening a workbook, it is searched for a named range with the name of document concatenated with “_fields” (e.g. INVENTORY_fields). The value of this range is expected to be a CSV list of fields and tables to load. A single Excel “document” could contain multiple logical documents (each specified by its own “_fields” list) .

See the PartsInventory_bin4 for an example of a multi-document workbook (INVENTORY and EXAMPLE). The EXAMPLE document in this workbook also demonstrates the various types of tables handled.

Example:

lJSONObjects = oHammer.HAMMER(“C:\a\rss\StockTake1″,”LISTOFFILES”,”Select name,fullname from table2 limit 1″,”SQL”,”INVENTORY”,”_XLTOJSONDOC”)

lReturn = oHammer.HAMMERToRange(lJSONObjects,”Sheet2!A27″)

… will output

JSONDOCVIEW

This command is where the previous commands are leading to, i.e. extracting some real information value from your documents. It converts JSON documents into Excel friendly tables. It is, in essence, a Map function as in MapReduce. In a previous example I used a Python Map and a SQL Reduce, here, both Map and Reduce are via SQL (the command uses a series of SQL commands to perform its task).

Before I describe the function let me explain why I use an intermediate JSON format. I could just extract the data directly from each document and either store directly in Excel or create tables in SQLite of Access to hold this data. And in fact, that’s what I would have done in the past (seeExcel as a document-oriented NoSQL database). Now , however, I tend to favour using a free-format (i.e. no need for a fixed database schema) structure like a JSON document, so as the source documents evolve over time (which tends to happen not just during design stages but as the system matures) this will not break older documents.

So, for example, original Invoice workbooks might not have a backing time-sheet while newer Invoices do. As long as new and old documents share a core sub-set of data fields they can continue to be analysed together.

The command takes 5 arguments and a driving table (a record so far for HAMMER commands, most have a max of two arguments). The driving table’s last column is assumed to contain the JSON document to process, columns prior to this (if any) will be output unchanged for each resulting row.

The first argument specifies the name of the “inner” table to fetch (if any). Most real life documents consist of header details (the “outer”document) and one or more tables (“inner” details). Invoices, time-sheets,  stock-takes, all tend to follow this pattern. This command will effectively join each document’s outer details to a single inner table (if more than 1 inner table, a call for each one is required).

The second (field list in SQL format) and third (SQL where predicate format) arguments specify what inner fields to extract (if blank, then all) and what restrictions to impose (if any). So “InvNo, Date”,”InvNo > 12″ would only fetch documents where the InvNo > 12 and only include the InvNo and Date fields.

The fourth and fifth arguments do the same for the outer table (i.e. Header data).

If any of the columns specified  (inner or outer) can not be found, or if the predicates (inner or outer) result in no selection, no error is returned, the document simply returns no rows. Likewise if an inner table is specified and no such table exists, then no rows are returned for that document – in other words this is not an outer join, which is not usually a problem as in most cases a “header” without detail lines is meaningless. If an outer join is required, then extract the headers (outers) and details (inner tables(s)) separately and join using SQL.

Example:

=HAMMER(“Select Name,FullName,Document from invoice_docs”,”SQL”,”table_2″, “[PART NUMBER],QTY”, “QTY >30″, “Bin_Number”, “Bin_Number > 1″, “JSONDOCVIEW”)

would result in:

For more complex JSON objects use the JSON command to incrementally parse the text or use the VBA JSON module within microETL. But for most situations (especially if you control the expected format) JSONDOCVIEW should handle it.

As JSON is fast becoming the preferred transport format for web and mobile applications having the ability to parse and produce JSON form within Excel is very useful. It is possible, for example, to use a simple web technology such as http://robla.net/jsonwidget/ to craft another type of Really Simple System. This time with the collection happening on the web (most likely using AWS S3 pre-signed forms, so no HTML server required – keep it simple) but with the control and reporting remaining within Excel (a variation on my Steam Powered Server idea).

For an example of a really simple system  download this.

Latest version of HAMMER including the above commands now released …

HAMMER Alongside, as a COM Server

Although it has always been possible to call HAMMER from within VBA via the Application.Run method, this is a somewhat clunky way of doing so and it can also be very inefficient, particularly for tight loops. But now, with this release (V 1.2.0 (Beta) ) core functionality is exposed as a COM Server, which means easier and more efficient interfacing between VBA and the .NET multi-threaded enabled world of the datasmith’s HAMMER.

I call this “HAMMER Alongside”, to differentiate it from HAMMER Inside where I use HAMMER internals to craft stand alone XLLs. With this COM Server method, the standard HAMMER add-in (installed or just-in-time registered) is needed alongside either a VBA add-in or a VBA enabled workbook. The extra “moving parts” are easily justified by the extra flexibility that this method allows (particularly to those with a reasonable grasp of VBA, but perhaps lacking any familiarity with the .NET world).

By learning a small amount of IronPython (here’s a good starting place) it’s relatively easy to hook up VBA to any .NET library through HAMMER, without having to invest time and money in learning heavy duty development environments such as Visual Studio. Using “Internal Threads, it’s also possible to take advantage of .NETs multi-threading capability from VBA. (Note: both these options require .NET4).

So with VBA becoming a first-class language with regards to HAMMER, why offer Python as an alternative scripting choice? Well, one reason is to offer a means to access .NET’s power as per the last paragraph. But, the primary reason, is related to the ability of HAMMER transformations to be “detached” from Excel and run via HAMMER’s command line tools. The .NET 4 version of the command line tool supports Python, so it is possible to initially build out a micro-ETL transformation within Excel (using Python to perform the business logic alongside SQL) and then transfer that logic to the command-line tools with a minimum of modification.

This could be useful , for example, when a PowerPivot model moves to the server, its related HAMMER-powered micro-ETL processing could do likewise (most enterprise ETL tools support call-outs to command-line processes).

The COM server can be accessed from VBA only via late-binding like so:

Set comServ = CreateObject(“hammerCOMServerV1″)

The server’s methods are:

  • hammer(…) – works the same way as the UDF version.
  • hammerVersion() – returns HAMMER version, again the same as UDF version.
  • hammerVersionOK(version) – e.g.  isOK= comServ.hammerVersionOK(120) will return TRUE if the current version is >= V1.2.0
  • arrayResize(anArray) enables the creation of “toFit” UDFs.
  • arrayToSheet(anArray) likewise for “toSheet” UDFs.
  • arrayToRange(anArray,pasteToWhereString) enables “toRange” UDFs.
  • hammer_ppRefresh_inline(optional table,optional timeout) – refreshes a PowerPivot model, again the same as its UDF equivalent.

Download the latest version of HAMMER from here …

HAMMER on the Range

The HAMMER function is at heart an array formula. For those of you familiar with Excel array processing (likely to be a minority) this makes perfect sense, as HAMMER’s main purpose in life is to process and return tabular data, and prior to 2007′s Excel Tables (actually 2003′s somewhat similar Lists), arrays were Excel’s only nod to the existence of tables as units of data.

The functionality introduced by Excel Tables is a huge improvement and if you haven’t checked it out I recommend you do; it and 2010′s PowerPivot are two of the most important enhancements to Excel since the PivotTable.

Nevertheless, if you are to write UDFs to handle tabular data you’re still talking arrays (although you can reference a table within a UDF, returning tabular data still requires an array). Array formulas also continue to a very powerful skill for those who wish to master Excel and again if you’ve not done so, do check them out. But for those of you who have an aversion to  CTRL+SHIFT+ENTER, HAMMER (and HAMMER Inside UDFs) offers a number of array helper functions:

  • HAMMERtoFit – will resize the selected destination range to fit the returned array (result remains an array). Has the disadvantage that the underlying function call will be called twice if the array area requires resizing.
  • HAMMERtoSheet – will output the array to a new sheet as a non-array table. This (and a properly sized array call to HAMMER) is the fastest method of returning variable sized datasets to Excel.
  • HAMMERtoRange – will paste the array to the range address specified (as a non-array table), will also run an optional VBA macro before/after the paste.

The first two functions have been available for some time. The toRange helper is new to this release (V1.1.1). Before I describe the new function in detail, it’s useful to understand how HAMMER (and  any C based XLL ) sees and serves-back array data. An array range passed to such a function will be converted into a two dimensional multi-type array. The two types of data that will be passed over are: Doubles  (all numerics and dates), and Strings (everything else). For those of you familiar with the Excel Object Model, that’s the equivalent of the VALUE2 property of the Range object. The apparent date conversion to a Double (e.g. 01-JAN-2001 passed as 36892) is, in fact, not a conversion, as dates in Excel are always simply numbers with date formatting applied.

Likewise,when data is returned to Excel, the data comes back in the same “raw” format, requiring date or currency formatting to be applied to the relevant cells.

The HAMMER function and the two helper functions toSheet and toFit operate to these conventions. The third helper function HAMMERtoRange (along with the XLRANGE set of commands) can work around this restriction (at a certain performance cost).

HAMMERtoRange:

The function is called in the exact same way as the parent HAMMER function but with an additional argument. This argument is a comma separated list of 1,2 or 3 elements.

If only one element supplied it’s assumed to be the address to paste the table to; if two elements, first is the address, 2nd is the name of a VBA macro to run after the data is pasted; if 3 elements, the 1st is a macro to run prior to the paste , 2nd is the address, and 3rd is  the post-paste macro.  Examples:

  • =HAMMERtoRange(…,”Sheet2!a4″)
  • =HAMMERtoRange(…,”MyNamedRange”)
  • =HAMMERtoRange(…,”Sheet2!a4,myCleanupMacro”)
  • =HAMMERtoRange(…,”,Sheet2!a4,”)
  • =HAMMERtoRange(…,”mySetupMacro,Sheet2!a4,myCleanupMacro”)
  • =HAMMERtoRange(…,”onlyASetupMacro,Sheet2!a4,”)

The address is specified as a “string” i.e. doesn’t reference a range object (otherwise the result would be an infinite loop). If you do mange to initiate a loop, exit it using the Esc key, if you don’t, the loop will eventually end in one of two ways, either with a nice exception or a, not so nice, failure of Excel. The toFit helper will never loop as it is using Excel native array functionality and Excel will protect against this; the toSheet has the potential to loop but less likely than the toRange.

You may find the HAMMERtoRange function much slower that the other helpers (due to its dependence on COM Automation rather than the C API) but only for large datasets. Making use of the pre/post macros doesn’t require much VBA skills as the outputs of “record macro” sessions are ideal for this type of processing.

The address examples above,such as “Sheet1!A1″, are the “top left-hand corner” of the resulting table. The function will clear any data within the “.CurrentRegion” of that cell, but will first check if an Excel Table already occupies that region, and if so, reconstitute a table of the same name after the new dataset is pasted.

If a cell is sourced from a SQLite date column (actually a date “cell” as SQLite, like Excel, uses manifest typing) the data will be formatted as a date (this is unlike the behaviour of toFit and toSheet). This will not normally be obvious if the source data is loaded into HAMMER via a range (see discussion above) as the data will have been delivered to SQLite as a Double. To get around this either:

  • Update the affected column within HAMMER using the SQLiteDate function via SQL e.g. “Update table1 set datejoined = SQLiteDate(datejoined)”
  • or, use the pre/post macros to format the date columns as dates
  • or, load the dataset using these newly added COMMANDS: XLRANGE or EXCELRANGEASTEXT.

XLRANGE: This command expects an address in “string” format (i.e. not a referenced range) from which it will load a table using Excel Range’s Value property. Using Value rather than Value2 means the internal function logic will know if a cell is a date or not and store that data in SQLite date format. The address can either be a “top left-hand corner” type single-cell address or a full range; if a single-cell address, then the cell’s “.CurrentRegion”  is taken to be the extent of the table to be loaded.

XLRANGEASTEXT: will load data using the Cell object’s Text property, so what you see is what you get. FOr example, if a cell has a percentage format, a value of 1 will be loaded as the text “100%” not as the numeric 1.

XLRANGEASVALUE2: is like XLRANGE but uses the .VALUE2 property i.e. simulates a “normal” range load.

All of these are much slower than a straight range load (ASTEXT is particularly slow). Given the potential performance hit you might well ask why offer a ASVALUE2 option (the others at least provide the potential useful service of preserving type and structure information)? The answer lies in a feature that ‘non-referencing string addresses’ offer, i.e. changes to datasets ‘pointed at’ by such addresses will not trigger a recalculate (as Excel is not aware of any relationship). This can be useful when building up long lists or tables for processing by HAMMER without any (potentially, long running) processing being triggered by each cell change (saves having to set the workbook to manual calculation and allows for the build out of several source tables prior to activating a re-calc).

Download latest version of HAMMER from here: