Posts in category 'personalarchiving'
I kinda feel like we’re just too inured to the modern miracle that is SQLite. That a fully featured relational database supporting much of the SQL standard can be packaged up in less than 1MB of portable code is incredible. And they guarantee to support the current (open, portable) file format until 2050, which is why it is specifically supported by the Library of Congress!
I don’t agree with all the advice here, but this is another reason to have my own blog, authored in Markdown, converted to static HTML and CSS.
Personal Archiving and the IndieWeb
The indieweb is about controlling your identity. But it’s also be a great way to claw back all that content I’ve been scattering across the web so I can get better at archiving!
Our data, scattered
A while back I started to take an interest in the topic of personal data archiving, and in particular how the topic intersects with the various social media platforms that so many of us interact with. The simple fact is that so much of who we are–the things we write, the photos and videos we take, the people we interact with, our very memories, as Facebook likes to remind us–are locked up in a bunch of different walled gardens that are difficult to escape, both technically and due to the powerful social pressures that keep us on these platforms.
I like to think of the traditional photo album as an interesting contrast.
It used to be that we collected memories in these books, and stored those books on a shelf. There was some real downsides to this approach! It’s a pain to add stuff to them (I have to “print” photos??) They’re difficult to share and enjoy. They’re single points of failure (think: house fires). They require intentional acts to ensure preservation. The list goes on.
But, they were ours. We owned them. We could take those photos and easily copy them, share them, rearrange them, archive them, and so forth.
Now imagine that you collected all your photos in a photo album that you could only store and access from a vault being run by a private company. The company would ensure the photos were protected and stored properly, and they provided a really nice, simple mechanism to easily add photos to your album right from your phone! That’s really nice! But if you wanted to look at those photos, you’d have to go to the vault, enter your passcode, and then you could only look at them while you were in the vault. And if you wanted to get a copy of all of those photos for yourself, well, you can, but it’s ugly and complicated and designed to make it minimally possible and maximally difficult.
Next, imagine the corporation changed their policies in a way you didn’t like. Or imagine that corporation went bankrupt. Or experienced a fire. Or you lost the passcode for that vault. Or a loved one passed away and didn’t store the passcode in a safe place.
What then?
Today, we don’t just lock those photos in one vault run by one private company. We lock those photos in many vaults, spread out all over the place. In doing so, we dramatically increase these risks, because instead of just one company failing or one account that we might lose access to or one set of terms of service we need to worry about, it’s many.
All the while we fragment our identity, spreading ourselves thin across the internet, which makes it extremely difficult to preserve all of those memories.
So what can we do about it?
Continue reading...HTML JSON Data Archiving
My Google Groups web scraping exercise left me with an archive of over 2400 messages, of which 336 were written by yours truly. These messages were laid down in a set of files, each containing JSON payloads of messages and associated metadata.
But… what do I do with it now?
Obviously the goal is to be able to explore the messages easily, but that requires a user interface of some kind.
Well, the obvious user interface for a large blob of JSON-encoded data is, of course, HTML, and so started my next mini-project.
First, I took the individual message group files and concatenated them into a single large JSON structure containing all the messages. Total file size: 4.88MB.
Next, I created an empty shell HTML file, loaded in jQuery and the JSON data as individual scripts, and then wrote some code to walk through the messages and build up a DOM representation that I could format with CSS. The result is simple but effective! Feel free to take a look at my Usenet Archive here. But be warned, a lot of this is stuff I posted when I was as young as 14 years old…
Usage is explained in the document, so hopefully it should be pretty self-explanatory.
Anyway, this got me thinking about the possibilities of JSON as an archival format for data, and HTML as the front-end rendering interface. The fact that I can ship a data payload and an interactive UI in a single package is very interesting!
Update: I also used this project as an opportunity to experiment with ES6 generators as a method for browser timeslicing. If you look at the code, it makes use of a combination of setTimeout and a generator to populate the page while keeping the browser responsive. This, in effect, provides re-entrant, cooperative multitasking by allowing you to pause the computation and hand control back to the browser periodically. Handy! Of course, it requires a semi-modern browser, but lucky for me, I don’t much care about backward compatibility for this little experiment!
Fun with Puppeteer
In the past web scraping involved a lot of offline scripting and parsing of HTML, either through a library or, for quick and dirty work, manual string transformations. The work was always painful, and as the web has become more dynamic, this offline approach has gone from painful to essentially impossible… you simply cannot scrape the contents of a website without a Javascript engine and a DOM implementation.
The next generation of web scraping came in the form of tools like Selenium. Selenium uses a scripting language, along with a browser-side driver, to automate browser interactions. The primary use case for this particular stack is actually web testing, but it allows scraping by taking advantage of a full browser to load dynamic content. This allows you to simulate human interactions with the site, enabling scraping of even the most dynamic sites out there.
Then came PhantomJS. PhantomJS took browser automation to the next level by wrapping a headless browser engine in a Javascript API. Using Javascript, you could then instantiate a browser, load a site, and interact with the page using standard DOM APIs. No longer did you need a secondary scripting language or a browser driver… in fact, you didn’t even need a GUI! Again, one of the primary use cases for this kind of technology is testing, but site automation in general, and scraping in particular, are excellent use cases for Phantom.
And then the Chrome guys came along and gave us Puppeteer.
Puppeteer is essentially PhantomJS but using the Chromium browser engine, delivered as an npm you can run atop node. Current benchmarks indicate Puppeteer is faster and uses less memory while using a more up-to-date browser engine.
You might wonder why I started playing with Puppeteer.
Well, it turns out Google Groups is sitting on a pretty extensive archive of old Usenet posts, some of which I’ve written, all of which date back to as early as ‘94. I wanted to archive those posts for myself, but discovered Groups provides no mechanism or API for pulling bulk content from their archive.
For shame!
Fortunately, Puppeteer made this a pretty easy nut to crack, such that it was just challenging enough to be fun, but easy enough to be done in a day. And thus I had the perfect one-day project during my holiday! The resulting script is roughly 100 lines of Javascript that is mostly reliable (unless Groups takes an unusually long time loading some of its content):
const puppeteer = require('puppeteer') const fs = require('fs') async function run() { var browser = await puppeteer.launch({ headless: true }); async function processPage(url) { const page = await browser.newPage(); await page.goto(url); await page.addScriptTag({url: 'https://code.jquery.com/jquery-3.2.1.min.js'}); await page.waitForFunction('$(".F0XO1GC-nb-Y").find("[dir=\'ltr\']").length > 0'); await page.waitForFunction('$(".F0XO1GC-nb-Y").find("._username").text().length > 0'); await page.exposeFunction('escape', async () => { page.keyboard.press('Escape'); }); await page.exposeFunction('log', async (message) => { console.log(message); }); var messages = await page.evaluate(async () => { function sleep(ms) { return new Promise(resolve => setTimeout(resolve, ms)); } var res = [] await sleep(5000); var messages = $(".F0XO1GC-nb-Y"); var texts = messages.find("[dir='ltr']").filter("div"); for (let msg of messages.get()) { // Open the message menu $(msg).find(".F0XO1GC-k-b").first().click(); await sleep(100); // Find the link button $(":contains('Link')").filter("span").click(); await sleep(100); // Grab the URL var msgurl = $(".F0XO1GC-Cc-b").filter("input").val().replace( "https://groups.google.com/d/", "https://groups.google.com/forum/message/raw?" ).replace("msg/", "msg="); await sleep(100); // Now close the thing window.escape(); var text; await $.get(msgurl, (data) => text = data); res.push({ 'username': $(msg).find("._username").text(), 'date': $(msg).find(".F0XO1GC-nb-Q").text(), 'url': msgurl, 'message': text }); window.log("Message: " + res.length); }; return JSON.stringify({ 'group': $(".F0XO1GC-mb-x").find("a").first().text(), 'count': res.length, 'subject': $(".F0XO1GC-mb-Y").text(), 'messages': res }, null, 4); }); await page.close(); return messages; } for (let url of urls) { var parts = url.split("/"); var id = parts[parts.length - 1]; console.log("Loading URL: " + url); fs.writeFile("messages/" + id + ".json", await processPage(url), function(err) { if (err) { return console.log(err); } console.log("Done"); }); } browser.close(); } run()
The interactions, here, are actually fairly complex. Each Google Groups message has a drop-down menu that you can use to get a link to the message itself. Some minor transformations to that URL then get you a link to the raw message contents. So this script loads the URL containing the thread, and then one-by-one, opens the menu, activates the popup to get the link, performs an Ajax call to get the message content, then scrapes out some relevant metadata and adds the result to a collection. The collection is then serialized out to JSON.
It works remarkably well for a complete hack job!