Category Archives: SQLite

microETL’s SQL function

At the heart of microETL is the SQL() command (or more correctly microETL.SQL()). The primary purpose of microETL is to allow the power of SQL to be used within Excel and SQL() is how that is delivered. The command can either be issued from a cell, like a standard excel formula, or by using the microETL SQL Menu option or as a command in a SQLScript. It can be called from a VBA macro (using Application.Run(“microETL.SQL” … ) but not if that macro is to be used as a User Defined Function (aka user defined formula). It can also be called from within another SQL statement using the equivalent SQLite user defined function also called SQL().

Example: click for a larger image (animated GIF)

=SQL() in action

SQL(sqlStatement,selectDestination(OR arg1), arg2 …. argn)

The 1st argument is the SQL statement(s) to execute. These can be any valid SQLite statements (see SQL as understood by SQLite); exceptions are the ATTACH statement and in some circumstances the DROP TABLE statement. The statement will be issued against the default microETL in-memory database i.e. an empty database that is instantiated in memory at workbook start-up and any tables or data created during the workbook session are discarded at shutdown.

So, if the database is ephemeral where will my data be stored? In the spreadsheet most likely. The other alternatives  are to use the SQLite UDFs load_CSV() and write_CSV() to load/save CSV files, or see http://blog.gobansaor.com/2011/05/11/accessing-sqlite-databases-from-excel-via-microetl/ for other options.

In order to make transferring data between Excel and the SQL environment easier and simpler a pre-processor has been added to the function. There are three types of pre-processor commands:

  • The TABLE(…) command tells SQL() to go fetch the relevant Excel range, loads the “table” into SQLite; when the SQL command finishes it writes back the data (if changed) to the originating Excel address, and cleans up the temporary SQLite-side table. Nearly all SQL commands can be issued against the resulting tables, including UPDATE, DELETE, DROP and CREATE TRIGGERS etc.
  • Two other pre-processor commands are CELL(…) and CELLS(…). Similar to TABLE(…), data is loaded from the Excel addresses specified and placed in the SQL  statement. CELL takes a single cell reference while CELLS will iterate through a multi-cell range and insert a comma-separated list into the SQL statement (in a format suitable for IN (…) expressions).
  • Positional substitution tokens in the form :1 to :n

Note these are pre-processor commands not SQLite commands and can only be accessed using the SQL() function. The pre-processor “functions” may appear to work like a normal SQLite function call but are less flexible in some ways e.g. no spaces allowed, parser expects “TABLE(” not “TABLE  ( ” but more flexible in others so a function requiring a table name can use the pre-processor to provide it e.g. =SQL(“Select load_CSV(‘TABLE(Sheet!A1)’,'c:\data\test.csv’);)”).

The sequence of events when a TABLE() pre-processor command is detected in a SQL statement is as follows:

  • SQLite SAVEPOINT is set if the SQL()  statement is not already running in the context of a previous SQL() function call. This wrapping in a transaction is the reason for ATTACH commands not been allowed. The purpose of the transaction is two-fold; 1st it will clean up any temporary tables and other updates in the case of a function failure, 2nd, SQLite inserts/updates are much faster when executed within a transaction, especially if writing to a disk-based database.
  • If the text between “TABLE(” and the next “)” is a valid range representation (or a named range pointing at such) a check is carried out to see whether that same range has been referenced before in this context (same context being within the same set of SQL statements or those of a prior “wrapping” SQL() function call). If already “handled” replace the TABLE(…) with the name of the temporary table associated with this range. If not, load the table data (as long as the range is an Excel table, or points to the top left-hand corner of a potential table) into a new temporary SQLite table and replace the TABLE(…) with its name.
  • When the last SQL statement within the current context is executed and if the temporary table has been modified (INSERT’ed, UPDATEd,DELETEd or DROPped ) the table is written back to the source range (or range cleared if a DROP statement).
  • The SAVEPOINT is released.

The remaining arguments to the SQL() function are also processed by the pre-processor, the arguments passed replace positional substitution tokens in the form :1 to :n where :1 is replaced by 2nd argument and :n is replaced by the n+1th argument.

The 2nd argument (that is the arg corresponding to the :1 token) is special. If no “:1″ token is found within the SQL Statement, the argument is assumed to be a destination for a SQL SELECT statement (think of it as a SELECT INTO). It can be either a string representation of a workbook range or a named range pointing at one. The range can be either a full range, top left-hand corner range or an Excel Table. (If none of the above, the SELECT will populate an ADO detached RecordSet and use the argument as the name of that recordset).

A number of helper functions have been added to SQLite within microETL. Note: these are not preprocessor commands and so can be used by any SQL issued within microETL. (There are a number of SQL issuing functions other than SQL() and there’s also the SQL calls issued within Python functions.)

Using the Menu option SQL

The main SQL helper functions are …

x(formula,p1,p2,p3…)

Where formula is a spreadsheet formula or built-in function, followed by a variable list of parameters. The formula can take advantage of microETL “placeholders” e.g. x(“20+ :1 + :2″,20,40) will return 80, x(“upper(:1:)”,”abc”) returns ABC.

The formula’s text, with any placeholders replaced, is passed to Excel’s Application object’s Evaluate method, so not withstanding some of the limitations associated with this function (see here) it’s possible to use most Excel formulas within a SQL statement.

Only built-in functions (i.e not User Defined Functions) can be called in this manner.  To call a UDF use the udf() (or u() /mdf())  function.

When specifying the formula do not include the leading “=”, this will ensure formula is evaluated inline along with the rest of the SQL statement. If preceded by “=”, the formula is passed back as text. This can sometimes be useful either for testing purposes or to get around Application.Evaluate limitations.

Example: click for larger view:

udf(functionName,p1,p2,p3…)

This allows any VBA public function (i.e. a UDF) in the active workbook  to be called from within a SQLite SQL statement. The u() and mdf() functions are similar but the mdf() function will only run functions in the microETL project while u() will default to the microETL project but will also take references to UDFs in other workbooks.

Py(address,p1,p2,p3…)

The address can be either a string representation of an Excel range e.g. “Sheet2!A3:Sheet2!D6″ or a named range. If the range is a single cell the function will expect to find the full Python script within, if a multi-cell range then the script will utilise cells as Python indents.

The scripts are in fact anonymous parameter-less Python functions. Parameters are passed via xLite “placeholders” which are replaced before the script is passed to the Python interpreter e.g.

If in  this case 2nd parameter is less than 40, return 1st parameter increased by 10% otherwise return unchanged.

SQLScript(address,p1,p2,p3…)

See previous post http://blog.gobansaor.com/2011/03/04/sqlscript-microetls-sql-sequencer-utility/

SQL(sqlStatement,p1,p2,p3…)

This allows the microETL.SQL() function (along with its pre-processor goodness) to be called from an environment where the pre-processor is not supported e.g. within Python. It’s also useful to process a series of SQL statements held in a table.

Various Hierarchy functions

See this Handling Flat, Parent-Child and Nested Set Hierarchies and this.

write_CSV(table,filename)

This will write out the contents of the table in classic CSV format (UTF-8 encoded). Useful for external SQLite tables that may be too big to fit in memory but need transformation before loading as a CSV into the likes of PowerPivot.

load_CSV(table,filename)

Will load UFT-8 encoded CSV files into a table. Again, if a CSV source is too big to load into memory (thinking fact tables here) but needs some SQL love’n'care can be loaded into an external table (i.e. in a disk-based database) bypassing the need to fit in Excel’s working memory.

sqliteDate(dateTimeString)

Will attempt to convert a string to an ISO DateTime string ( the format used internally by SQLite i.e. “YYYY-MM-DD hh:mm:ss”). Usually microETL will handle date conversions between Excel and SQlite and back again (but when a date is pasted back it’ll be in Excel numeric representation, the column may need to be formatted to the required date/time representation). Sometimes however the date format of imported “string dates” may not be recognised (for example, Excel loaded CSVs often fail to recognise dates correctly). This function will usually solve the problem.

ISODate(Year,Month,Day)

Will take a date passed as 3 numbers (e.g. ISODate(2010,12,31)) and convert to SQLite date format.

REGEXP

I’ve also added REGEXP (regular expression) support so you can do stuff like so…

SELECT * FROM Foo WHERE account_id REGEXP ‘[0][0-9]*[x]‘

or

=SQL(“SELECT regexp(‘[0][0-9]*[x]‘,’0987f85x’);”) -> returns 0 for false (should be all digits between leading zero and final ‘x’).

PowerPivot Powered Budgets & Targets

Last week I used PowerPivot for the first time in a classic budgeting scenario. The existing, Excel based, system was straining at the seams due to recent changes in the organisation (merger, followed by lots of changes, resulting in the replacement of long-established reporting and budgeting hierarchies).

The budget process followed the, more or less, classic lines of using adjusted previous years figures to create high-level targets for coming years, agreeing those targets with various interests and then “driving” these high-level targets down various hierarchies to inform front-line staff what would be expected of them in the years to come.

PowerPivot did a marvellous job of providing the necessary figures required to set the initial budgets and to help inform those who must agree them. This involved lots of moving-annual-totals, percentage-increases and so on. A dream compared to the purely excel methods employed in the past.

The “driving down” logic was however kept outside PowerPivot, as the hierarchies involved were unbalanced in the main and the rules were complex but already existed and worked (and probably more importantly were understood and agreed by the various “interested parties”) in Excel. All that was required was the creation of the numerous “flattened cross-join tables” to support the existing logic. The various SQLite “hierarchy helpers” I detailed in my previous Handling Flat, Parent-Child and Nested Set Hierarchies post did most of the heavy lifting. As the process involved the “cross-join” of  hierarchies at various levels I used a great deal of “nested set” SQL to achieve the required result; simple enough, but did become tedious and made the resulting logic somewhat un-approachable for those with limited SQL. What I needed was another “helper function”. The CROSS_JOIN_HIER function was born!

FUNCTION: CROSS_JOIN_HIER

First argument is the hierarchy table name (see previous post for description of this table). The 2nd is the output table name. This table must already exist and have at least the same number of columns as the number of subsequent arguments.

The remaining arguments specify the source of the tables (single column lists) to cross-join in order to populate the output table. The arguments can be of three types.

  • The name of an existing table. This table must consist of single column named “Name”. This allows for complex (or perhaps, simple lists in a different sort order) to be generated outside the function call.
  • A request for a list of nodes from some level within a hierarchy. Such requests can consist of one of the following:
    • Integer between 0 and 99. If the column name associated with this argument (e.g. arg 1 implies column 1 of output table, arg 2 column 2 etc…) is the same name as a valid hierarchy, fetch all nodes at that level (e.g. if 0, then fetch top level nodes, if 1, 2nd level etc..). Allows for up to 99 levels (i.e max value  98). Level 99 is special, it’ll return all “leaf nodes” i.e. those nodes with no children.
    • aNode,(+ or -)integerValue e.g. “Beer,+2″, in this case fetch all nodes 2 levels below Beer in the hierarchy corresponding with column name. The value before the comma must be a valid node name (assumes names are unique within hierarchies). A value after the comma of “-1″ will fetch parent node, a value of “-2″ will fetch grandparent. In both “+” and “-” if the level to navigate down/up is greater than the levels available, the last available level is returned.
    • In both cases above, the name of the column can be over-ridden by prepending the hierarchy name such that: “Product,Beer,+1″ will drive down 1 level from “Beer” in the Product hierarchy. And, “Product,99″ will return a leaf nodes of the Product hierarchy.
  • A comma separated list to create a “manual” level. For example, “Budget,Actual”, “2010,2011,2012″ or “Beer”

The resulting single column tables are then cross-joined with the resulting multi-column table (a column for each “source” argument) which is used to populate the output table.

This is a “stored procedure” like function, so should be called using “Select function(arg1,arg2 …);” syntax.

Example: click for larger version.

Download microETL from http://www.gobansaor.com/microetl and unpack to a folder, locate alberto_hier workbook (2007/2010 format) and go to the makeBudgetTable sheet.

The microETL project  is password protected; if you need access to the code just email me I’ll send you on the password.

Handling Flat, Parent-Child and Nested Set Hierarchies

Alberto Ferrari describes a method of handling un-balanced hierarchies in PowerPivot. As usual Alberto comes up with an elegant solution to something that looks near impossible. For more of the same, check out  his book, co-authored with Marco Russo, Microsoft PowerPivot for Excel 2010: Give Your Data Meaning. It’s the best book out there for those from a data modeling background trying to make sense of this new form of ROLAP or indeed for those from a spreadsheet background who wish to gain a better insight into PowerPivot-oriented data modelling.

PowerPivot has problems with un-balanced hierarchies (aka variable-depth parent-child hierarchies) as it’s essentially a relational engine. Relational databases have traditionally sucked at handling such hierarchies which are usually implemented using an Adjacency List Model.

When I moved from network CODASYL databases to relational databases all those years ago it was my biggest complaint about the ‘new order’ of the RDBMs. I guess it’s my network database heritage that makes working with MOLAP engines such as Palo seem so natural. Such databases are kind to hierarchies and to the users of hierarchies.

Most of the major commercial databases have now extended their SQL offering to handle hierarchy navigation, you can see the SQL Server’s way in Alberto’s post. But SQLite offers no help, so in order to make handling for what are in fact very common requirements for reporting and analysis datasets, I’ve extended the SQLite SQL available within microETL with 4  useful helper functions.

All 4 work with tables of the following structure:

Create table whatEver (HierarchyName text, Parent text,Child text,Name text,Level int,setID int,setLeft in,setRight int);

Where:

  • HierarchyName allows for multiple hierarchies to be stored in the one table.
  • Parent is the parent node for “this Child”. Can be text or a numeric surrogate key. Top level nodes will be set to null (or = “”).
  • Child is the node identifier.
  • Name is the textual identifier. If not populated, uses Child value.
  • Level is the depth from top of the current path. Levels = 0 are top level nodes. Not strictly required, but so useful it’s worth populating.
  • setID like Level is not required to be populated but can be useful in Nested Set queries.
  • setLeft is the left-hand side of the Nested Set range associated with this node.
  • setRight is the right-hand side of the Nested Set range.

What’s all this Nested Set stuff? As I pointed out above, SQL engines tend not to agree on a method of hierarchy navigation, if indeed they offer any method. Nested Sets is the classic data modelling method to get around this problem (if you look at hierarchy tables in cross-RDBMs packaged applications you’ll often find a variation on this method).

See this post Managing Hierarchical Data in MySQL for a good description of the technique.

There’s no need to populate these ‘set columns’ unless required they’re there just in case.

Function: make_pc_hier

This function is for handling the opposite problem to that which Alberto faced. It takes an existing flattened hierarchy and turns it into a “proper” parent-child one (hence the pc element of the name). Why would you do that? Well, MOLAP tools such as Palo not only handle unbalanced hierarchies, they usually store hierarchies using an adjacency list model. Also, if you wish to use Nested Set SQL, you’ll first need the model in parent-child format.

The function’s 1st argument is the name of the hierarchy table; if it exists it must be in the format specified above, otherwise a table of that name with the required format will be created by the function.

The 2nd argument is the hierarchy’s name.

The remaining arguments are the source columns for the hierarchy in the sequence of level 0 to level n.

The function is in fact a aggregation function (like SUM()) so can be used to group the required columns from a table where they are repeated (such as a fact table on which the hierarchy has been denormalised).

Example (click on it to see larger version):

Function: make_pc_hier_from_tree

This is similar to make_pc_hier() except the source table must be in a ‘tree format’. It is also an aggregate function but the use of group by would, in this case, not make sense. The use case for this is the easy creation of hierarchy dimensions from a tree structure for use in the likes of Palo.

Example:

Function: flatten_pc_hier

This takes four mandatory and an optional fifth argument. First two are the usual hierarchy table name and hierarchy name. The 3rd is the output table to receive the flattened table. The first column of this output table will be populated with the base element Child IDs (i.e. the deepest level in the nodes’ paths). The remaining columns must match the number of the next argument. This 4th argument specifies the maximum depth of the flattened hierarchy, so if the hierarchy is to be, for example, Sector,Brand,Product, then this would be 3.

The optional 5th argument if set to Y will not fill unused columns in an unbalanced hierarchy, otherwise the final nodes are filled out to the right with the last non-blank Name. This is a “stored procedure” like function, so should be called using “Select function(arg1,arg2 …);” syntax.

Example:

Function: nested_set_hier

This takes the two usual arguments, hierarchy table and hierarchy name. Its function is to populate the “nested set” fields of the supplied table to make a valid Nested Set. Again, it’s a “stored procedure” function so should be called using the “Select function(arg1,arg2 …);” syntax.

Example

Download Example:

Download  microETL from http://www.gobansaor.com/microetl and locate the hier97-2003.xls file.

The microETL project  is password protected; if you need access to the code just email me I’ll send you on the password.

Update:

There’s now an Alberto_hier workbook (in 2007/2010 format) with a worked example of building Alberto’s modified dimension. This workbook also includes a PowerPivot cube demonstrating the required measure (plus an equivalent Nested Set SQL Query).

For more on handling nested sets in PowerPivot see http://blog.gobansaor.com/2011/03/10/powerpivot-nested-set-measures/

Python Powered PowerPivot

In my previous post I described a method of exposing PowerPivot (or indeed any Excel model) as a simple Web Service. As it goes, it’s elegant enough, but the requirement to have two processes to handle the web service (Excel itself and a JSDB.exe sub process) adds a degree of fragility, So, for example, not “playing” with the serving Excel process (as it will likely lock the service) or having to clean up the sub process window after shut down. It can also be slow, especially noticeable for quick requests (such as a AJAX validating request), as the process flows consists of:

  • JSDB.exe stores the URL supplied parameters into a SQLite WebService.db’s table called paraTable and commits the data to disk.
  • A COM request from JSDB to Excel to set the _Status cell to “Waiting”
  • Another COM request to set the <servicename> cell to a value (any value will do, I usually send the URL supplied parameters in JSON format).
  • The  microETL.SQLScript function associated with the service is activated by the cell’s change event.
  • The SQLScript runs and drops if exists, the WebService.db’s outTable; it then creates and fills and new outTable to suit the service’s tabular output and commits the data to disk.
  • The SQLScript writes back a value (any value other than “Waiting”) to the _Status cell. The JSDB.exe has, in the meantime, been looping making COM calls to check the value of this cell and when eventually set to not “Waiting” will read the values stored in the outTable and serve the data back to the waiting requesting agent.

As you can see, a lot of inter-process COM calls (slow) and writing/data from disk (SQLite is fast, but you’re still limited by the speed of your PC’s disk). Also, although JSDB’s built-in web framework is very good and compact, it doesn’t offer the simplicity and speed of development offered by the compact web frameworks available in languages such as Ruby (e.g. Sinatra) or Python (e.g Bottle). But, as you may or may not be aware, microETL has the ability to use Python as a scripting language, which opens up the possibility of using something like Bottle to web enable Excel. Which is exactly what I’ve done!

But first; people often ask why embed a Python interpreter into microETL? Can I not just use VBA as I quite obviously know it inside out ;) Well of course for all things Excel-focused that is what I do, and for most other transformation or calculations that require some programming I would also opt for VBA. No reason not to, it works and works quickly. However, for some situations, VBA, even if it can handle them, is tough going. A good example would be the loading of a large XML feed; a common enough problem for users of PowerPivot as PowerPivot itself doesn’t offer a XML loader (except for oData feeds) so most folks fall back on Excel to do the heavy lifting.

For small XML documents Excel’s XML Map feature is ideal and again no reason not to use it (unless you’re using <”Excel 2003 Pro”, in which case some MSXML2.DOMDocument VBA code would be required). If however, the document is large (a few tens of thousands of XML “rows” for example) XML Map (and the VBA DOM object) will start to strain. Often you’ll get a result but only after a very long time, other times you or Excel will give up the ghost and quit.

You could then turn to MSXML’s implementation of SAX2 and use VBA to do “just-in-time stream” parsing of the XML document (will save on memory usage and should also result in less CPU abuse). See here for an example to get you started. But for real speed of handling you need a no nonsense non-validating streaming parser such as the venerable Expat. (I’ve posted before of using Expat to create a SQLite XML parsing Virtual Table.).

Yeah right, I hear you say, I’m going to start writing C code to parse my XML?

No, this is where Python comes in. C and Python are first cousins, C is the silent, unapproachable but very powerful one and Python is the friendly one who helps you get things done (and saves you from having to engage with the fearsome C). If there’s something useful out there written in C somebody will have wrapped it in Python and so it is with Expat (in fact it’s part of standard Python).

Download microETL from http://www.gobansaor.com/microetl for an example of a microETL powered Python script to download, parse, and insert into SQLite’s in-memory database, a large XML document. The resulting table is then pasted to an Excel range (Data!A1).

The XML document in the Expat example is in turn provided by an example of Excel Web Service using Python’s Bottle as the server.

To try out both examples:

  1. Unzip download into a folder; 1st run the PyWebService97-2003.xls workbook (not the WebService97-2003.xls workbook, it’s the JSDB example server). Locate the WebServiceControl sheet and click the Start Server button; the Excel process will go into “Not Responding” mode while listening on port 8080 on 127.0.0.1. You may be asked by your firewall to unblock Excel, you can safely do so as the port will only be accessible by processes running on the same machine.
  2. Open an other Excel process and load the Expat.xls workbook; locate ExpatControl sheet and click the Process XML  button. The button macro will issue a http://127.0.0.1:8080/xml call. Having downloaded and parsed the XML the resulting table is pasted to the Data sheet.
  3. If you wish, you could try using XML Map Import (find it under Developer Tab in 2007/2010) using the same end point to compare and contrast. (Warning: depending on the PC, this could take anything from a while to forever). For another comparison try a Web Query using the /table end-point (which returns the Currency table in  HTML table fromat) or loading a text file from the /CSV/Currency end-point which serves up a comma-separated file.
  4. To close down the PyWebService server, issue a http://127.0.0.1:8080/exit via your browser  (or click the Close Server button in the Expat.xls workbook).

Note:

You must have Python 2.7 (note the 2.7) installed on you PC. You can safely install different versions of Python side-by-side. It is possible to package the required python dlls and modules along with the rest of the code, but I haven’t gotten around to automating this yet and I keep forgetting how to do it, so I haven’t done it, so I’m lazy, so sue me ;)

If you’re running Excel 64bit this will not work as the 64bit bindings are not included but it will work if you’re running Excel 32bit  on a 64bit OS. I’m in the process of merging the 64bit and 32 bit versions of microETL and will republish the link when it’s 64bit enabled. But even then, you must have the 64bit version of Python 2.7 installed otherwise it’ll get very messy.

Python standard error and log files can be found in the /Python folder, each spreadsheet gets its own set. If the server on port 8080 doesn’t start, check the error log, if you see lots of errors to do with “sockets” then most likely some other service on your machine is using port 8080. In this case, go to the Python sheet in the PyWebService97-2003.xls workbook, find the mention of 8080 and change to 8081 or whatever.

Like the previous example, this, being Excel, is single threaded. Bottle may offer you the option of using the likes of Apache as your server, but you must stick with the standard Python Built-in HTTP development server. Each call to the server will/must block. Yes I know Excel 2010 now has multi-threaded  capabilities but not within VBA macros, which is what ultimately drives all this.

UPDATE:

For another variation on using Python (this time IronPython) within Excel/VBA see http://blog.gobansaor.com/2011/07/18/vba-multithreading-net-integration-via-hammer/

UPDATE:

For another variation on a Python powered Excel Web Service (this time IronPython) see http://blog.gobansaor.com/2011/09/21/exposing-an-excel-powerpivot-model-as-a-web-service/

 

 

 

Tag Index – SQLite Star Query Part IV

In a previous post I put forward an approach to using SQLite as the basis for a Tag Cube i.e an alternative pivoting mechanism using a ‘tag’ index to  enable multi-column star-queries against a central fact table. Although the appearance on the scene of PowerPivot has negated the raison d’être for “Tag Cubes”, the concept of a “Tag Index” still has relevance.

A tag index differs from a normal b-tree index in that it is optimised for multi-columnar access (OLAP type queries) rather than single-column primary/foreign key access (OLTP type queries). In essence it’s somewhat similar to columnar indexes over relational databases such as will be available in the next version of SQLServer (see here, for this and other exciting news re the future of PowerPivot in the enterprise).

SQLite doesn’t offer  columnar indexes but it does have something similar; FTS3 – Full Text Search virtual tables; and this is what I use to implement TAG indexes.

The basic idea is to create an FTS3 virtual table that indexes the primary search columns of a fact table (or very large dimensional table). This virtual table is then used as a first level filter ‘index’ for the fact table. To make this simpler to use (for example, to make it possible to be used as a fact table in a SQLPower Wabit Mondrian schema) I wrap the fact table and FTS3 ‘index’ table using a virtual table of my own. An example will best explain it:

If you have a fact table such as this:

CREATE TABLE orderFacts (CustomerID, ProductID, CustomerType, InvoiceDate, Qty,Value);

To create a Tag index:

CREATE VIRTUAL TABLE orderFactsFTS using FTS3();

Populate this ‘index’ with the rowid and a space delimited concatenation of the relevant ‘key’ columns (docid & content are the default columns of an FTS3 virtual table):

INSERT into orderFactsFTS (docid,content) select rowid,CustomerID||’ ‘||ProductID||’  ’||InvoiceDate from orderFacts;

Finally, create a ‘wrapping’ virtual table using my TAGindex virtual table function:

CREATE VIRTUAL TABLE  orderFactsTAGs using TAGindex (orderFacts, orderFactsFTS,3,CustomerID,ProductID,InvoiceDate, CustomerType,Qty,Value);

The TAGIndex parameters are:

  • The fact table name.
  • The FTS3 ‘index’
  • The number of columns that the TAGIndex virtual table will use as part of the tag index. In this case 3, as we’ve didn’t index customerType (too low a cardinality to bother, perhaps) nor the two ‘measure’ columns of Qty and Value.
  • The column list from the fact table to expose via the TAGIndex virtual table. Columns that are part of the index must appear at the beginning of the list and must match in number the value specified by the 3rd argument above.

We can then access the resulting table like so:

SELECT InvoiceDate, sum(Value)
from orderFactsTAGs
where CustomerId=2098 and ProductID=’FX-010-23′ and Qty >10
group by InvoiceDate;

The same result could be achieved without the TAGindex virtual table using :

SELECT InvoiceDate, sum(Value)
from orderFacts
where CustomerId=2098 and ProductID=’FX-010-23′ and Qty >10
and rowid in
(SELECT docid from orderFactsFTS
where orderFactsFTS match “2098 FX-010-23″)
group by InvoiceDate;

There are a number of potential problems with this approach:

  • What happens if there’s a product and a customer with the same 2098 key? The FTS3 filter will be less effective but the result will still be correct due to the ‘filter twice’ provided by the standard SQL WHERE predicates. Obviously it would be better in that case to add new columns to the table to ensure uniqueness of keys across key domains but it’s not essential.
  • If a key value begins with a “-” the FTS3 engine will treat that as “not =”  resulting in an incorrect result. Unlikely, but if likely, same solution as above, new column to doctor the key to suit.
  • High cardinality amongst key values, e.g. Invoice number, causing a very large, potentially slow, index. This is the same problem that columnar-store indexes face. Best not index such columns if index bloat occurs but it most cases (e.g. such columns being a minority) this shouldn’t be a problem.

So what are the potential use cases for TAGindex tables?

Here’s the code of the TAGindex virtual table function (implemented as a SQLite extension):

TAGindex.c

And, here’s a already built dll (in SQLite3.exe issue “.load TAGindex.dll” command before using):

TAGindex.dll

For  another example of a SQLite virtual table (this time combining Expat XML streaming parser with SQLite) see http://blog.gobansaor.com/2010/06/16/sqlite-xml-streaming-virtual-table-via-expat/