• Review: Monstrous Regiment

    Review of Monstrous Regiment (Discworld #31.0) by Terry Pratchett (9780060013165)★★★★★
    (https://b-ark.ca/M8UaEY)
    Cover for Monstrous Regiment by Terry Pratchett

    War has come to Discworld ... again.

    And, to no one's great surprise, the conflict centers around the small, arrogantly fundamentalist duchy of Borogravia, which has long prided itself on its unrelenting aggressiveness. A year ago, Polly Perks's brother marched off to battle, and Polly's willing to resort to drastic measures to find him. So she cuts off her hair, dons masculine garb, and -- aided by a well-placed pair of socks -- sets out to join this man's army. Since a nation in such dire need of cannon fodder can't afford to be too picky, Polly is eagerly welcomed into the fighting fold—along with a vampire, a troll, an Igor, a religious fanatic, and two uncommonly close "friends." It would appear that Polly "Ozzer" Perks isn't the only grunt with a secret. But duty calls, the battlefield beckons. And now is the time for all good ... er ... "men" to come to the aid of their country.

    “We are a proud country.” “What are you proud of?” It came swiftly, like a blow, and Polly realized how wars happened. … We have our pride. And that’s what we’re proud of. We’re proud of being proud…

    In a few words, Terry Pratchett shows us why fiction and satire are so vital and powerful.

    Next to Night Watch and Small Gods, Monstrous Regiment is now one of my favourite Discworld novels. Tackling issues of gender equality, the insanity of war, and the dangers of blind nationalism, here Pratchett is, in my opinion, at his more powerful and his most poignant.

    Continue reading...
  • Fun with Puppeteer

    In the past web scraping involved a lot of offline scripting and parsing of HTML, either through a library or, for quick and dirty work, manual string transformations. The work was always painful, and as the web has become more dynamic, this offline approach has gone from painful to essentially impossible… you simply cannot scrape the contents of a website without a Javascript engine and a DOM implementation.

    The next generation of web scraping came in the form of tools like Selenium. Selenium uses a scripting language, along with a browser-side driver, to automate browser interactions. The primary use case for this particular stack is actually web testing, but it allows scraping by taking advantage of a full browser to load dynamic content. This allows you to simulate human interactions with the site, enabling scraping of even the most dynamic sites out there.

    Then came PhantomJS. PhantomJS took browser automation to the next level by wrapping a headless browser engine in a Javascript API. Using Javascript, you could then instantiate a browser, load a site, and interact with the page using standard DOM APIs. No longer did you need a secondary scripting language or a browser driver… in fact, you didn’t even need a GUI! Again, one of the primary use cases for this kind of technology is testing, but site automation in general, and scraping in particular, are excellent use cases for Phantom.

    And then the Chrome guys came along and gave us Puppeteer.

    Puppeteer is essentially PhantomJS but using the Chromium browser engine, delivered as an npm you can run atop node. Current benchmarks indicate Puppeteer is faster and uses less memory while using a more up-to-date browser engine.

    You might wonder why I started playing with Puppeteer.

    Well, it turns out Google Groups is sitting on a pretty extensive archive of old Usenet posts, some of which I’ve written, all of which date back to as early as ‘94. I wanted to archive those posts for myself, but discovered Groups provides no mechanism or API for pulling bulk content from their archive.

    For shame!

    Fortunately, Puppeteer made this a pretty easy nut to crack, such that it was just challenging enough to be fun, but easy enough to be done in a day. And thus I had the perfect one-day project during my holiday! The resulting script is roughly 100 lines of Javascript that is mostly reliable (unless Groups takes an unusually long time loading some of its content):

    const puppeteer = require('puppeteer')
    const fs = require('fs')
    
    async function run() {
      var browser = await puppeteer.launch({ headless: true });
    
      async function processPage(url) {
        const page = await browser.newPage();
    
        await page.goto(url);
        await page.addScriptTag({url: 'https://code.jquery.com/jquery-3.2.1.min.js'});
        await page.waitForFunction('$(".F0XO1GC-nb-Y").find("[dir=\'ltr\']").length > 0');
        await page.waitForFunction('$(".F0XO1GC-nb-Y").find("._username").text().length > 0');
    
        await page.exposeFunction('escape', async () => {
          page.keyboard.press('Escape');
        });
    
        await page.exposeFunction('log', async (message) => {
          console.log(message);
        });
    
        var messages = await page.evaluate(async () => {
          function sleep(ms) {
            return new Promise(resolve => setTimeout(resolve, ms));
          }
    
          var res = []
    
          await sleep(5000);
    
          var messages = $(".F0XO1GC-nb-Y");
          var texts = messages.find("[dir='ltr']").filter("div");
    
          for (let msg of messages.get()) {
            // Open the message menu
            $(msg).find(".F0XO1GC-k-b").first().click();
    
            await sleep(100);
    
            // Find the link button
            $(":contains('Link')").filter("span").click();
    
            await sleep(100);
    
            // Grab the URL
            var msgurl = $(".F0XO1GC-Cc-b").filter("input").val().replace(
              "https://groups.google.com/d/", 
              "https://groups.google.com/forum/message/raw?"
            ).replace("msg/", "msg=");
    
            await sleep(100);
    
            // Now close the thing
            window.escape();       
    
            var text;
    
            await $.get(msgurl, (data) => text = data);
    
            res.push({
              'username': $(msg).find("._username").text(),
              'date': $(msg).find(".F0XO1GC-nb-Q").text(),
              'url': msgurl,
              'message': text
            });
    
            window.log("Message: " + res.length);
          };
    
          return JSON.stringify({
            'group': $(".F0XO1GC-mb-x").find("a").first().text(),
            'count': res.length,
            'subject': $(".F0XO1GC-mb-Y").text(),
            'messages': res
          }, null, 4);
        });
    
        await page.close();
    
        return messages;
      }
    
      for (let url of urls) {
        var parts = url.split("/");
        var id = parts[parts.length - 1];
    
        console.log("Loading URL: " + url);
    
        fs.writeFile("messages/" + id + ".json", await processPage(url), function(err) {
          if (err) {
            return console.log(err);
          }
    
          console.log("Done");
        });
      }
    
      browser.close();
    }
    
    run()
    

    The interactions, here, are actually fairly complex. Each Google Groups message has a drop-down menu that you can use to get a link to the message itself. Some minor transformations to that URL then get you a link to the raw message contents. So this script loads the URL containing the thread, and then one-by-one, opens the menu, activates the popup to get the link, performs an Ajax call to get the message content, then scrapes out some relevant metadata and adds the result to a collection. The collection is then serialized out to JSON.

    It works remarkably well for a complete hack job!