Review: Carpe Jugulum
Review of Carpe Jugulum (Discworld #23.0) by Terry Pratchett (9780613277617)★★★★★(https://b-ark.ca/gKWiYy)When Uberwald's undead population, the Magpyrs, begins to invade Lancre, a priest forges an tentative allience with the local witches to prevent the kingdom from being overrun.
Granny Weatherwax is one of those characters that, prior to this point, I would describe as a quiet mystery. Old, crotchety, wise, powerful, and wickedly intelligent, her role in her coven was always important but never a centerpiece, at least in my mind, and she was never a character toward which I would ascribe feelings of sympathy.
But in this book we see Pratchett really explore what makes Granny Granny, and in doing so, shows us why he loves her character so much. Like Captain Vimes, Granny is a creature of honour and duty, willing to do those things others can’t or won’t do because they are necessary. But as we’ve seen Vimes, here we see Granny bending beneath the weight of that duty, and in those moments we see the humanity and vulnerability of a character who seems so unbreakable.
To me, this book stacks up with Guards! Guards! as one of Pratchett’s best… I’m a guy who loves relating to great characters, and this book delivers in spades.
Introducing the Trie
So in the kickoff post of my series on data structures and algorithms I’d like to begin with a relatively simple but handy little data structure: the trie. If you want to jump ahead and look at a very simplistic implementation of a trie data structure (only the insert and dump operations have been completed), I’ve put my experimental code up on GitHub here.
A clever little play on the word retrieval (though I, and many others, insist on pronouncing it “try”… suck it etymology), a trie is a key-value store represented as an n-ary tree, except that unlike a typical key-value store, no one node stores the key itself. Instead, the path to the value within the tree is defined by the characters/bits/what-have-you that define the key itself. Yeah, that’s pretty abstract, why don’t we just look at an example:
In this construction I’ve chosen the following set of keys:
- allow
- alter
- alloy
- ant
- bad
- bar
As you can see, each character in the key is used to label an edge in the tree, while the nodes store the values associated with that key (note, in this example I’ve chosen to use the keys as values as well… this entirely artificial, and a bit confusing. Just remember, those values could be absolutely anything.) 1 Typically these keys are strings, as depicted here, although it’s entirely possible to build a bit-wise trie that can be keyed off of arbitrary strings of bits. To find the value for a key, you take each character and, starting with the root node, transition through the graph until the target node is found. Or, as pseudo-code:
find(root_node, key): current_node = root_node current_key = key while (current_key.length > 0): character = current_key[0] if current_node.has_edge_for(character): current_node = current_node.get_get_for(character).endpoint else throw "ERMAGERD" return current_node.value
Strangely, a very similar algorithm can be used for both inserts and deletes.
Some Interesting Properties
The trie offers a number of interesting advantages over traditional key-value stores such as hash tables and binary search trees:
- As mentioned previously, they have the peculiar feature that inserts, deletes, and lookups use very similar codepaths, and thus have very similar performance characteristics. As such, in applications where these operations are performed with equal frequency, the trie can provide better overall performance than other more traditional key-value stores.
- Lookup performance is a factor of key length as opposed to key distribution or dataset size. As such, for lookups they often outperform both hash tables and BSTs.
- They are quite space efficient for short keys, as key prefixes are shared between edges, resulting in compression of the graph.
- They enable longest-prefix matching. Given a candidate key, a trie can be used to perform a closest fit search with the same performance as an exact search.
- Pre-order traversal of the graph generates an ordered list of the keys (in fact, this implementation is a form of radix sort).
- Unlike hashes, there’s no need to design a hash function, and collisions can only occur if identical keys are inserted multiple times.
Applications
Because tries are well-suited to fuzzy matching algorithms, they often see use in spell checking implementations or other areas involving fuzzy matching against a dictionary. In addition, the trie forms the core of Radix/PATRICIA and Suffix Trees, both of which are interesting enough to warrant separate posts of their own. Stay tuned!
References
-
Interestingly, if you looked at this example graph, you’d be forgiven for assuming it was an illustration of a finite state machine, with the characters in the key triggering transitions to deeper levels of the graph. ↩
On Algorithms
So, generally speaking, I’ve typically adhered to the rule that those who develop software should be aware of various classes of algorithms and data structures, but should avoid implementing them if at all possible. The reasoning here is pretty simple, and I think pretty common:
- You’re reinventing the wheel. Stop that, we have enough wheels.
- You’re probably reinventing it badly.
So just go find yourself the appropriate wheel to solve your problem and move on.
Ah, but there’s a gotcha, here: Speaking for myself, I never truly understand an algorithm or data structure, both theoretically (ie, how it works in the abstract, complexity, etc) and practically (ie, how you’d actually implement the thing) until I try to implement it. After all, these things in the abstract can be tricky to grok, and when actually implemented you discover there’s all kinds of details and edge cases that you need to deal with.
Now, I’ve spent a lot of my free time learning about programming languages (the tools of our trade that we use to express our ideas), and about software architecture and design, the “blueprints”, if you will. But if languages are the tools and the architecture and design are the blueprints, algorithms and data structures are akin to the templates carpenters use for building doors, windows, etc. That is, they provide a general framework for solving various classes of problems that we as developers encounter day-to-day.
And, like a framer, day-to-day we may very well make use of various prefabbed components to get our jobs done more quickly and efficiently. But without understanding how and why those components are built the way they are, it can be very easy to misuse or abuse them. Plus, it can’t hurt if, when someone comes along and asks you to show off your mad skillz, you can demonstrate your ability to build one of those components from scratch.
Consequently, I plan to kick off a round of posts wherein I explore various interesting algorithms and data structures that happen to catch my attention. So far I have a couple on the list that look interesting, either because I don’t know them, or because it’s been so long that I’ve forgotten them…
Data Structures
- Skip list
- Fibonacci heap
- Red-Black tree
- Tries
- Radix/PATRICIA Tries
- Suffix Tries
- Bloom filter
Algorithms
- Various streaming algorithms (computations over read-once streams of data):
- Heavy hitters (finding elements that appear more often than a proscribed freqency)
- Counting distinct elements
- Computing entropy
- Topological sort
And I guarantee there’s more that belong on this list, but this is just an initial roadmap… assuming I follow through, anyway.
Hosting Git on Windows
Using Git to push changes upstream to servers is incredibly handy. In essence, you set up a bare repository on the target server, configure git to use the production application path as the git working directory, and then set up hooks to automatically update the working directory when changes are pushed into the repository. The result is dead easy code deployment, as you can simply push from your repository to the remote on the server.
But making this work when the Git repository is being hosted on Windows is a bit tricky. Normally ssh is the default transport for git, but making that work on Windows is an enormous pain. As such, this little writeup assumes the use of HTTP as the transport protocol.
Installation
So, first up we need to install a couple components:
- msysgit
- Apache
Note: When installing msysgit, make sure to select the option that installs git in your path! After installation the system path should include the following1:
C:\Program Files\Git\cmd;C:\Program Files\Git\bin;C:\Program Files\Git\libexec\git-core
Now, in addition, we’ll be using git-http-backend to serve up our repository, and it turns out the msysgit installation of this tool is broken such that one of its required DLLs is not in the directory where it’s installed. As such, you need to copy:
C:\Program Files\Git\bin\libiconv-2.dll
to
C:\Program Files\Git\libexec\git-core\
Repository Initialization
Once you have the software installed, create your bare repository by firing up Git Bash and running something like:
$ mkdir -p /c/git/project.git $ cd /c/git/project.git $ git init --bare $ git config core.worktree c:/path/to/webroot $ git config http.receivepack true $ touch git-daemon-export-ok
Those last three commands are vital and will ensure that we can push to the repository, and that the repository uses our web root as the working tree.
Configuring Apache
Next up, add the following lines to your httpd.conf:
SetEnv GIT_PROJECT_ROOT c:*git* ScriptAlias *git* "C:/Program Files/Git/libexec/git-core/git-http-backend.exe/" <Directory "C:/Program Files/Git/libexec/git-core/"> Options +ExecCGI FollowSymLinks Allow From All </Directory>
Note, I’ve omitted any security, here. You’ll probably want to enable some form of HTTP authentication.
In addition, in order to make hooks work, you need to reconfigure the Apache daemon to run as a normal user. Obviously this user should have permissions to read from/write to the git repository folder and web root.
Oh, and last but not least, don’t forget to restart Apache at this point.
Pushing the Base Repository
So, we now have our repository exposed, let’s try to push to it. Assuming you have an already established repository ready to go and it’s our master branch we want to publish, we just need to do a:
git remote add server http://myserver/git/project.git git push server master
In theory, anyway.
Note: After the initial push, in at least one instance I’ve found that “logs/refs” wasn’t present in the server bare repository. This breaks, among other things, git stash. To remedy this I simply created that folder manually.
Lastly, you can pop over to your server, fire up Git Bash, and:
$ cd /c/git/project.git $ git checkout master
Our Hooks
So, about those hooks. I use two, one that triggers before a new update comes to stash any local changes, and then another after a pack is applied to update the working tree and then unstash those local changes. The first is a pre-receive hook:
#!/bin/sh export GIT_DIR=`pwd` cd `git config --get core.worktree` git stash save --include-untracked
The second is a post-update hook:
#!/bin/sh export GIT_DIR=`pwd` cd `git config --get core.worktree` git checkout -f git reset --hard HEAD git stash pop
Obviously you can do whatever you want, here. This is just something I slapped together for a test server I was working with.
-
Obviously any paths, here, would need to be tweaked on a 64-bit server with a 32-bit Git. ↩