Category Archives: JavaScript

Data Wrangler

A few weeks ago I came across (thanks to @lismissData Wrangler; a very promising data cleansing tool from the Stanford Visualization Group. Not only is Data Wrangler a web-service (which the group intend to open source) but it also allows transformations to be “recorded” in either Python or JavaScript (see here). It was this Python scripting feature that really caught my attention; would be very useful to be able to hack away at a dataset using the service, then transfer the script to microETL’s PyScript to adjust and integrate with Excel and SQL.

The demo video and test datasets give a good overview of the tool but the proof of the pudding is in trying out some real world dirty data; I chose a fine example of the art of Freedom Of Information datasets, issued by a Republic of Ireland government department. As an example of how not to do something (unless your intention is to make the recipient regret asking for the FOI in the first place) this is excellent. (I suppose we should be grateful it’s in Excel not Word or PDF or even PowerPoint). You can download it here http://bit.ly/Ireland_FOI_example (the data as released is in the FOI sheet).

As I said, Data Wrangler is promising, but needs some more work (to be fair, the group warns it’s a work in progress). The tool choked on the FOI dataset, too many columns I think, so not ready for the real world yet but I’ll be keeping an eye on its progress. Don’t let my experience put you off, it looks more than capable of handling smaller but still quite messy datasets.

If you’ve downloaded my example workbook, you’ll see how I managed to cleanse the data using microETL’s Python & SQL scripting functionality (the PyScript is in the Python sheet, with the SQLScript in the Control sheet). I could have cleansed the data using pure Excel and some VBA  and perhaps I would have if this was a format requiring parsing on a regular basis; I could then save the transformation as a single file macro-enabled workbook, ideal for sharing, no need for add-ins etc. But it was a once-off, and even if it wasn’t, it’s quite likely the format supplied in answer to a subsequent FOI request would be different. This is the sort of work that microETL’s Python & SQL scripting is designed for; quick and dirty data wrangling, but with the ability ro persist, and modify the resulting transformations if so required.

If you wish to try out this example, there’s a new version of microETL (Alpha1.08) available for download. You’ll notice a new folder structure (the usual sub-folders are now under a single sub-folder call microETL) to make installation of the add-in somewhat neater; and there’s also a setup.xls that’ll do all the hard work of installing (and un-installing) the microETL add-in. Note: you still need to manually install Python 2.7 to enable the PyScript’ing functionality.

If you need help with your Excel, ETL or  data cleansing tasks, I can help.

JavaScript as an Excel scripting language via JSDB

A few years back I posted about JavaScript as an Excel scripting language via ExcelDNA. That involved using JavaScript (in the guise of JScript.NET) as an ExcelDNA scripting language. It was purely an academic exercise to prove it could be done, I continued to use C# (or increasingly VB.NET) to build .NET user defined functions. This time however, I’ve managed to embed JavaScript (in the guise of Mozilla Foundation’s SpiderMonkey) directly via a native C interface not to prove I could do it (even though there’s a definite satisfaction in simply doing it) but to use it.

Why add another scripting language to xLite, hasn’t it already got Python?

True, Python is and remains a very powerful add-on to xLite. It’s a mature and long-established language, popular amongst IT professionals and “citizen programmers” alike. But, it’s a bit of a monster and can be awkward to package, particularly on Windows. By using Py2Exe and after a lot of digging on the issue of manifest files I have managed to package and isolate xLite’s Pythonic bits so that it can be used on a PC without first installing the required Python version (I’ve only tested against V2.6, Python’s lack of a side-by-side Windows installation capability is a major pain-in-the-butt bad news: tested against Python 2.5 & it doesn’t work; good news: side-by-side is possible; simply change the system path to reflect which-ever version you wish to run at the command line; xLite will (must) continue to use V2.6). This “version-hell” mitigates against using Python as a core-element of xLite, fine for those of us who are comfortable with and require the full power of Python, but not as the tool’s primary scripting environment.

No, what I need is:

  • a light (single EXE or DLL preferably) and an approachable popular language,
  • with native SQLite support,
  • runable as standalone executable (on both Windows & Linux),
  • embeddable (is that a word?) in Excel via VBA-friendly DLL.

Add to that essential list some nice-to-haves such as:

  • native COM-interface support (for the likes of ADO etc.),
  • native networking support, for HTTP, raw TCP sockets etc.,
  • native (and easy to use) XML and JSON parsers and emitters,
  • ability to spawn detached/attached command line processes and the ability to stream data to and from such processes; allowing me to easily orchestrate & provide a “grid” of processes (scripted, command line executables, Excel instances) both local and remote (with remote being either traditional servers, http servers or Hadoop Streaming grids).

While looking at the various implementations of Javascipt as a server/shell language, I came across JSDB – JavaScript for Databases. It’s a C++ wrapper around Mozilla’s SpiderMonkey, with lots of useful data related utility classes added. To make JSDB a perfect fit it simply required:

  • a few minor changes to the SQlite class (allow the loading of Virtual Table extensions, add the ability to pass in the address of already open SQLite memory structures);
  • a linker change to use the DLL version of SQLite;
  • plus a VBA-friendly DLL wrapper ( & VBA declares, to call the DLL) to replace the JSDB shell when embedding in Excel.

From my SQLite as the MP3 of Data post:  ”Just as “fractional horsepower” electrical motors revolutionised manufacturing and eventually all our lives (car starter-motors, fridge motors, washing machines etc.), “fractional horsepower” databases can do the same for data. Distributing data to where it is needed.” I can now add a distributed “fractional horsepower” processing engine for that distributed data. This transforms xLite from a micro-ETL platform into one capable of handling (or at least orchestrating) practically any ETL (Extract, Transform & Load),DI (data integration) or “Time Asset” (see this post) process.

UPDATE:

Here’s a VBA source with examples of using the embedded JSDB engine:

http://www2.gobansaor.com/share/example_jsdb_vb_embed.bas

… and the source code for the DLL wrapper:

http://www2.gobansaor.com/share/wrap_sqlite.cpp

… plus the VBA Declares and helper Windows APIs:

http://www2.gobansaor.com/share/jsdb.bas

UPDATE:

See JSDB being used to front-end Excel as a web service http://blog.gobansaor.com/2011/02/06/excel-as-a-diy-web-service/

Time Assets


This Stephen Hawkins article on “How to build a time machine” (all that’s needed is a wormhole, the Large Hadron Collider or a rocket that goes really,really fast) is well worth a read.

The concept of time travel was, for most of my life, simply science fiction, but it’s now looking more & more like science fact. Most science-fiction plots involving time-travel tend to involve travelling to the past; this, however, is not part of the emerging time travelling theory, moving forward in time seems the only option.

We may not be able to go back in time but we humans have become adept at “capturing time” and packaging it for reuse later on; our early ancestors spent valuable time crafting  tools and honing skills that they figured would repay any time spent many times over; they were in fact investing in time assets.

Software is perhaps human kind’s greatest time asset generator, similar in concept to the tools and machines we’ve always built, but nearly totally frictionless and with the potential of immense returns on the asset once the initial upfront cost has been met. Yet many are leaving our formal education systems with no idea of the power of software to harness time, to save it, shape it and reuse it again & again. They have not been taught to program.

I’m not suggesting here that every student be forced to study computer science, no, just for them to be introduced to the practical everyday uses of programming (with some formal theory as a foundation) – Applied Computing, if you like. In fact, if hardcore geeks consider the course to be rubbish and refuse to take it, then you know you’re hitting the right note.

At a minimum, everybody should be taught the basics and the possibilities of spreadsheets.  Although using Excel for this purpose would be more “saleable” once students hit the streets and join the work force, I would think that Google Docs Spreadsheets would be a better option as a teaching tool, because:

  • Firstly, it would be cheaper, no licences, minimum hardware requirement (anything with a browser) and the collaboration features of Goggle Docs in general are ideally suited for use in education.
  • Secondly, such training should not be primarily vocational, it should be about learning the possibilities of end-user programming.
  • Excel’s macro language is VBA, a noble language with a long distinguished history, but a language that its owners have abandoned. Google Doc’s scripting language is JavaScript, like VB a language that has often been much maligned, but unlike VB, it’s a language with a future, it’s the magic behind the browser. So students would not only learn the fundamentals of spreadsheets but would through the courses’ scripting modules learn a language that lies at the heart of their everyday computing experience.
  • Google Docs can also be manipulated via a web-based API and can be embedded in web pages. So again students would learn the fundamentals of REST and basic HTML markup, the underlying architecture of the WWW .

Studying such a course, would not only teach a useful life skill (the manipulation of numbers and lists and the automation of such tasks to create time assets) but would also provide an understanding of the building blocks of modern IT.

We need more, and better prepared (dare I say, trained) citizen programmers; there’ll never be enough professional programmers to go around and even if there were, the cost will continue to be prohibitive in many situations (both the financial cost and the time cost of keeping professional programmers aligned with (or even aware of) the business needs of multitudes of organisations).

Just like the right to bear arms was regarded as a necessity in the frontier society of 18th century America, the right (and the basic skills) to program is a necessity on our modern IT frontier. Not everybody will use (or indeed even be capable of using, or allowed to use) that right, but for millions of others, having the power to build time assets for themselves or their businesses will be one of their most prized skills.

Zimki – the spirt lives on …

Although Zimki is to shut down on Christmas Eve, the ideas behind the service live on. Two new offerings, Horuku and AppJet, offer variations on the idea of hosted application development/deployment.

AppJet, funded by Paul Graham‘s Y-Combinator, is very similar to Zimki, being a server-side JavaScript platform. No details yet as to what sort of paid options will be offered (all accounts are free at the moment). Unlike Zimki there’s no plans to create an open-source version. I like the easy “build a Facebook app” feature; and I guess this is the sort of light-weight applications that they hope to attract.

Although Heroku uses Ruby-on-Rails technology, rather than JavaScript, it is closer to the original Zimki idea; but rather than take the hard (and ultimately unsuccessful in Zimki’s case) road of building an open-source platform from scratch, Heroku takes an already popular open-source project and offers it wrapped in a full on-line development and deployment environment. Again, being in beta, there’s no indication as to what pricing model it will operate under, but I would think that it will attract more “serious” projects than AppJet since anything developed under Heroku is pure Rails which means it can be migrated to any other Rails hosting environment; so no lock-in. The online editor is excellent and whatever about its merits as a hosting service it’s by far the easiest way to learn and explore Ruby and Rails, even easier than this…

If Facebook apps are your goal but you wish to use Ruby rather than AppJet’s JavaScript then not to panic, as being Ruby some bright young spark (no, not me I’m afraid) will already have done a lot of the hard graft for you…

JavaScript as an Excel scripting language via ExcelDNA

Update June 2010: Also, see JavaScript as an Excel scripting language via JSDB

Developing .NET DLLs that are to be used within an Excel VBA add-in is relatively easy to do. But the overhead of the COM managed interfaces can be a serious performance bottleneck if the .NET managed functions are called from within a tight loop. The alternative is to create a C++ XLL and then to call out to .NET, and that’s exactly what the ExcelDNA add-in does.

ExcelDNA allows you to expose any .NET DLL’s public static methods as Excel UDFs, but not only that, by inserting code into the add-in’s associated .dna file you can ‘script’ a UDF without compiling it into a DLL (via the magic of CodeDomProviders). Both C# and VB.net are supported as ‘scripting’ languages but in fact any .NET language that exposes a CodeDomProvider class can be used. I thought to myself, how cool would it be to use JavaScript as an Excel scripting language!

To use JavaScript (in the guise of JScript.NET) with ExcelDNA you’ll need the AssemblyQualifiedName of the Microsoft.JScript.JScriptCodeProvider type. Set the Project Language=”Microsoft.JScript.JScriptCodeProvider, Microsoft.JScript, Version=1.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a” within the .dna file and that’s it, code away in JavaScript. The code snippet below also includes a C# UDF to check if a Type is a sub-class of CodeDomProvider.


<DnaLibrary Name="My Test AddIn">
<Project Language="Microsoft.JScript.JScriptCodeProvider, Microsoft.JScript, Version=1.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a">
<![CDATA[

import ExcelDna.Integration;
public class myFunctions
{
ExcelFunctionAttribute(Description="Checks if a 1 to 3 digit number", Category="JScript REGEX example", IsMacroType=true)
public static function  ValidateStuff(str)
{
var regex=/^\\d{1,3}$/;
if (regex.test(str))
{
return "VERY GOOD: it's a 1 to 3 digit number";
}
else
{
return "NOT GOOD: Must be a number with between 1 and 3 digits";
}
}
}

]]>
</Project>
<Project Language="CS">
<![CDATA[
using System;
using System.CodeDom.Compiler;
using System.Collections.Generic;
using System.Diagnostics;
using System.Reflection;
using System.Reflection.Emit;
using System.Text;
using System.Xml.Serialization;

using ExcelDna.Integration;

public class Testing123
{
[ExcelFunction(Description="Check if type inherits CodeDomProvider", Category="Useful functions")]
public static string Test11(string str)
{
string str2;
str2="Testing if ...";
try
{
Type t = Type.GetType(str,true,true);

if (t.IsSubclassOf(typeof(CodeDomProvider)))
{
ConstructorInfo ci = t.GetConstructor(new Type[] {} );
CodeDomProvider p = (CodeDomProvider)ci.Invoke(new object[] { });
str2 = p.GetType().AssemblyQualifiedName;
}

}
catch (Exception e)
{
str2=e.ToString();
}
return str2;
}
[ExcelFunction(Description="Joins a string to a number", Category="Useful functions")]
public static string DoJoinThem(string str, double val)
{
return str + val;
}

}

]]>
</Project>
</DnaLibrary>