Posts in category 'hacking'
Haskell + Data Analysis -> Good Times
So, as part of my ongoing obsession with toying with unusual programming languages, Haskell has periodically popped on and off my radar. The problem is, it’s rare that I find a problem where I feel like sitting down and figuring out how to solve it in Haskell, particularly since Haskell’s strengths and weaknesses don’t often mesh with the kinds of ad-hoc programming I tend to do (for example, Haskell sucks for text parsing, primarily due to performance constraints, and I find much of the random coding I do involves high-volume text processing).
But all that has changed due to an interesting problem we’ve been fighting with at work. You see, on one of our production servers, we’re having performance problems. And so the first thing we did was find a way to collect telemetry. Of course, the first cut dumped out raw CSV files, which are a pain in the butt to manipulate in interesting ways, and as a result, I found myself writing a lot of Perl to deal with the data we received. Not fun.
Finally, after days of this, I decided to write a new tool that collects telemetry as we were doing before, but rather than using CSV, stores the data in an SQLite database, thus making the information a hell of a lot easier to manipulate. “But now you need to analyze that database!”, you say. Ahh yes, you’re quite right, and normally I might turn to Perl to do just that. However, it turns out, Haskell is more or less perfect for that very job.
See, Haskell just so happens to have HDBC, which is really the Haskell equivalent to Perl’s DBI. And there just happens to be an SQLite HDBC driver available, which provides a nice functional interface to the underlying database. With this combination, querying the database and manipulating its contents becomes exceedingly easy. And in particular, because of Haskell’s laziness, we can do much of our processing in a streaming fashion, rather than bulk loading large amounts of data for processing.
For example, suppose we have a table as follows:
ID Date Value
Where you may have multiple rows for a given date. Now say you want to take that table, and group it so that all the rows for the same date are collected together. Well, in Perl, you’d probably set up a loop, track the previous and next rows, build a list in memory, and output the results as you go, and that would work out just fine. But it’s tedious. Haskell, on the other hand, makes this all remarkably easy.
First, let’s back up. What we really want to do is take a list of items, and then group them together based on some kind of splitting function. It may be a list of integers, a list of strings, or a list of database rows. But in the end, it’s really all the same thing. Well, you could define a function like that as follows:
~> splitWhen :: (a -> a -> Bool) -> [a] -> ([a], [a]) splitWhen func  = (, ) splitWhen func (head:) = ([head], ) splitWhen func (first:second:rest) | func first second = (first:result, remainder) | otherwise = ([first], second:rest) where (result, remainder) = splitWhen func (second:rest) ~> splitList :: (a -> a -> Bool) -> [a] -> [[a]] splitList func  =  splitList func lst = group:(splitList func remainder) where (group, remainder) = splitWhen func lst
So, first we define splitWhen, which is a function that takes:
- A test function.
- A list.
The test function is applied to each pair of items in the list, starting at the beginning, and the list is split at the point where the function returns false. splitList then uses splitWhen to break a whole list into groups. So, for example:
splitList (\x y -> x < y) [ 1, 2, 1, 3 ]
[ [1, 2] [1, 3] ]
But this code has another interesting property that may not be obvious to someone unused to Haskell: these functions are lazy. That means they only do work as elements are requested from the list. For example, given this code:
take 5 $ splitWhen (\x y -> x < y) [ sin x | x <- [ 1 .. ] ]
The second part of this statement generates an infinite list of the sin() values of the whole numbers starting from 1. And splitWhen operates on that list. If this weren’t Haskell, this code would run forever, but because Haskell evaluates statements lazily, this only returns the first 5 groups, as follows:
[ [0.8414709848078965, 0.9092974268256817], [0.1411200080598672], [-0.7568024953079282], [-0.9589242746631385, -0.27941549819892586, 0.6569865987187891, 0.9893582466233818], [0.4121184852417566] ]
Nice! As an aside, this is one of the more interesting aspects of Haskell: it encourages you to write reusable functions like this.
So, let’s apply this to a database query. Well, it turns out, that’s dead simple. You’d just do something like:
conn <- connectSqlite3 "database.db" stmt <- prepare conn "SELECT Date, Value FROM theTable ORDER BY Date" execute stmt  groups <- (splitWhen (\(adate:rest) (bdate:rest) -> adate == bdate)) `liftM` (fetchAllRows stmt) putStrLn $ take 5 groups
Yeah, okay, this is a little dense. The first few lines prepare our query. No big deal there. It’s the last line where the magic really happens. First, let’s start on the far right. Here we see the function fetchAllRows being called. That function returns the rows generated from the query, but it does so lazily. So rows are only retrieved from the database as they’re needed. We then apply the splitWhen function to the results (ignore the liftM, that has to do with Monads, and you probably don’t want to know…). And then we take 5 groups from the result. Voila! In a surprisingly small amount of code, a huge chunk of which is nicely generic and reusable, we can do what, in Perl, would likely take dozens of lines of code. Pretty nice!
Dangers of Abstraction
One of the more impressive things about Pharo/Squeak is the level of depth in the core libraries, and how those libraries build upon each other to create larger, complex structures. One need only look at the Collection hierarchy for an example of this, where myriad collection types are supported in a deep hierarchy that allows for powerful language constructs like:
aCollection select: aPredicateBlock thenCollect: aMappingBlock
to work across essentially every type of collection available. Unfortunately, building these large software constructs can have negative consequences when one attempts to analyze performance or complexity, and in this post I’ll outline one particular case that bit me a few weeks back.
My problems all started while I was still experimenting with Magma. Magma, as you may or may not recall (depending on if you’ve read anything else I’ve posted… which you probably haven’t) is a pure-Smalltalk object-oriented database whose end goal is to provide the Smalltalk world with a free, powerful, transparent object store.
Now, among Magma’s features is a powerful set of collections, which implement the aforementioned collection protocols, while also providing a much-needed feature: querying. In order to make use of this facility, any column that you wish to generate queries over must have an index defined over it, which is really a glorified hash table on the column1. Whenever you create one of these indexes on a collection, the index itself is squirreled away in a file on disk alongside the database. And that’s where the problems come in.
In my application, a Go game repository, I had a fairly large number of collections sitting around holding references to Game objects (one per individual user, plus one per Go player), and I needed to be able to query each of these collections across a number of features (not the least of which, the tags applied to each game). That meant potentially many thousands of indexes in the system, at least2. And that meant thousands of files on disk for each of those indexes.
Well, when I first hit the site, I found something rather peculiar: initially accessing an individual collection took a very long time. On the order of a few seconds, at least. Naturally this dismayed me, and so I started profiling the code, in order to pin down the performance issues. And I was, frankly, a little shocked at the outcome.
It turns out that, deep in the bowels of the Magma index code, Magma makes use of the FileDirectory class to find the index file name for the index itself. Makes sense so far, right? As part of that, it uses some features of the FileDirectory class to identify files with a specific naming convention. And that code reads the entire directory, in order to identify the desired files.
On the face of it, this should be fine.
However, internally, that code does a bunch of work to translate those file names from Unicode to internal Squeak character/strings. And it turns out that little bit of code isn’t exactly snappy. Multiply that by thousands of files, and voila, you get horrible performance.
So believe it or not, the index performance issues had nothing to do with Magma. It was all due to inefficiencies deep in the bowels of Squeak. And hence the subject of this article. Deep abstraction and code reuse is a very good thing, don’t get me wrong. But any time you build up what I think of as a “cathedral” of code, it’s possible for rotting foundations to bite you later.
Where Pharo Falls Short
Well, as you might imagine from the title, I figured I’d take a bit of a break from my continual gushing about Seaside to examine some of the areas where I think Pharo/Squeak unfortunately falls behind as a development environment. Of course, keep in mind, I wouldn’t be using these tools if I didn’t think they were an overall win, despite their shortcomings. But perspective is always a good thing, and it’s important to see the bad as well as the good.
So, with that said, where to begin… well, as I’ve mentioned previously, in general, Smalltalk implementations make use of an image paradigm for storing and managing code. In this world, from the outside, the image is a monolithic blob of binary data, but contained within is essentially a snapshot of the entire Smalltalk environment. Open that image with a VM, and you’re presented with a completely self-contained world, including a windowing system, editors, file managers, and so forth. And in the case of Pharo or Squeak, that entire world is open to unlimited poking and prodding, as the deepest bowels of the system are themselves written in Smalltalk and available for inspection.
However, this metaphor has influenced Smalltalk in ways that, I think, have proven a detriment to it’s adoption. For example, in general, there is no other way to edit Smalltalk code, save through the editor(s) provided by the environment. Now, granted, because that editor is deeply tied into the system, it’s capable of browsing code in some very impressive ways. But it means the user is given absolutely no flexibility to select a tool of his or her choice. And in the case of Pharo/Squeak, those tools can be a bit primitive (and occasionally buggy), providing little in the way of customizability (well, unless you want to hack the code, which you are, of course, free to do), while failing to provide facilitates that one often takes for granted (macro facilities, multiple copy/paste buffers, regexp-based search/replace… the list goes on). In fact, the Pharo/Squeak editor is little more than a basic Notepad-style editor with syntax highlighting and some primitive auto-indentation capability.
Additionally, much as there is no option for editors, the choice of code management tools is extremely limited. The current tool of choice for version control in Pharo/Squeak is Monticello, a form of distributed version control system. Unfortunately, compared to, say, git, Monticello is decidedly primitive. Now, granted, there are those attempting to implement Git for Pharo/Squeak, but those projects are only just beginning, and one will still be limited to working in the Pharo/Squeak environment, and the tools available there.
Lastly, the Pharo/Squeak VM itself can sometimes be rather… frustrating. The VM itself is single-threaded, which means that any long-running piece of code, if not invoked in a background process (implemented as green threads), will hang the VM. Fortunately, the Alt-. hotkey exists to interrupt such operations so they can be terminated, but if the operation is sufficiently nasty (say, accidentally looping and inspecting items in a collection, creating a very large number of windows), it can be very unpleasant to clean up. Moreover, the image itself is only saved upon demand (ie, there are no automatic saves to a separate backup file, like in many editors), and so if something catastrophic does happen to take down the VM, one’s work can be lost.
Unfortunately, in the end, despite all these shortcomings, the damn language and libraries are so good, I just can’t help but work with it. While I’m sure much time is wasted dealing with the aforementioned issues, so much is gained from sheer productivity that the win is clearly there. Plus, I must admit, Smalltalk is just plain fun to write. In fact, I haven’t had this much fun in years!
AJAX in Seaside
So, in yet another post on a series about Pharo and Seaside, I thought I’d highlight a great strength in Seaside: it’s incredibly powerful support for building rich, AJAX-enabled web applications.
As any web developer today knows, if you’re building rich web apps with complex user interactions, you’d be remiss not to look at AJAX for facilitating some of those interactions. AJAX makes it possible for a rendered web page, in a browser, to interact with the server and perform partial updates of the web page, in situ. This means that full page loads aren’t necessary to, say, update a list of information on the screen, and results in a cleaner, more seamless user experience (Gmail was really an early champion of this technique).
Of course, this post wouldn’t exist if Seaside didn’t somehow make this situation a whole lot simpler, and boy does it ever. To illustrate this, I’m going to demonstrate an AJAX-enabled version of the counter program mentioned in my first post on Seaside. So, instead of doing a full page refresh to display the updated counter value, we’re simply going to update the heading each time the value changes. Now, again, imagine what it would take to do this is a more traditional web framework. Then compare it to this:
renderContentOn: html | id counter | counter := 0. id := html nextId. html heading id: id; with: counter. html anchor onClick: ( html scriptaculous updater id: id; callback: [ :ajaxHtml | counter := counter + 1. ajaxHtml text: counter. ] ); url: '#'; with: 'Increase'. html space. html anchor onClick: ( html scriptaculous updater id: id; callback: [ :ajaxHtml | counter := counter - 1. ajaxHtml text: counter. ] ); url: '#'; with: 'Decrease'.
That’s it. The full script.
Now, a little explanation. The script begins with a little preamble, initializing our counter, and allocating an ID, which we then associate with the header when we first render it. Pretty standard fare so far. The really interesting bit comes in the anchor definition, and in particular the definition of the onClick handler. Of course, this bit bares a little explanation.
So in this particular case, we have two anchor tags, each of which has an onClick event registered which, when invoked, updates the counter value and then updates the heading on the page.
By the way, there’s also a little bit of extra magic going on here. You’ll notice the ‘counter’ variable is local, while in the original example it was an instance variable. But this works, here, because those callbacks are actually lexical closures, and so the ‘counter’ variable sticks around, referenced by those closures, even though the function itself has returned, and the variable technically has gone out of scope.
If you want to see the above application running live, you can find it here.