While the potential of column-oriented DBMSs within BI projects is obvious given the popularity of MOLAP ( a form of column-oriented data store) the potential for the other new kid on the block, the document-oriented database, is less so. One such DBMS,CouchDb, is the latest wunderkid to bubble to the surface, helped by the database’s RESTful inteface , its abandonment of XML in favour of JSON, the use of Javascript (replacing a bespoke language) as its “view” language and its use of Erlang and MapReduce algorithms. (A CouchDb view is, as far as I can tell, like a combination of a Function Based Index and a Materialized View).
Where I see CouchDb’s place in a BI project is at the messy end (or should I say start) of the ETL pipe,the operational data store (ODS). Not an ODS in the high-church Inmon sense, i.e. not a normalised logical-data-model-made-real but more a easily explorable source-data archive/audit facility. If all your data comes from one or two operational systems (e.g. ERP and CRM) the need for an ODS may not arise, simply use the operational systems themselves (or direct copies in a separate database), using conformed dimensions to provide the necessary glue. If, however, a large amount of your data comes not from traditional OLTP systems but from ‘document sources’ then something like CouchDb might come in useful.
Typical document sources might be: Excel Spreadsheets, XML/JSON/CSV responses from SaaS APIs, scraped web pages, PDF/MsWord forms, MSAccess or SQLite databases; even audio/video content (e.g. market research interviews with customers which are then “codified” and stored as customer dimension attributes).
You could of course use a traditional RDBMS to hold this information especially if the database supported full-text search or has native support for semi-structured data; however, due to the huge amount of storage space that non-structured data can soak-up, CoachDb’s open source Google inspired MapReduce architecture, with its ability to cheaply scale-out, might be more suitable. Given its alpha level status, CouchDb is currently only suitable for testing or evaluation, but if you have a pressing need for such a scalable document store you could use Amazon’s S3. Although S3 is essentially just a key/value pair store, that value can be any blob of data you wish; it is in effect a massively scalable and keenly-priced document-oriented data store.
Being key/value pairs, the only indexing option is the key and although meta-data tags can be associated with each pair this data is not indexed for fast retrieval. The use of a local database to provide meta-data based filters/indices is the obvious solution; another less obvious approach would be to use a online tagging service such as del.icio.us. The use of del.icio.us would of course raise privacy/security issues but these could be mitigated by using the privacy option in del.icio.us and by using behind-the-firewall URLs which could then be redirected to the correctly signed S3 URL via a LAN proxy.
[...] it’s the turn of Document Centric Databases done in the style of CouchDB, but replacing JavaScript/Erlang with Ruby and the bespoke data store with Amazon’s S3 [...]
[...] network databases, to the current crop of relational and MOLAP platforms. Of late, I’ve being investigating what I think will be the future of database technology, the distributed document-centric database. [...]
[...] if you’re IBM you hire Damien Katz the person behind CouchDB. I think 2008 could be the year that cloud-based database services really take off [...]