building-spooler

Building Spooler

by Darius Kazemi, Jul 13, 2017

Note: This is blog post about design decisions. There is some code in it but you don't have to know how to code to follow along. Special thanks to my Patreon backers, who are making it possible for me to write stuff like this far more regularly.

Yesterday I released Spooler, a tool that turns Twitter threads into blog posts. It was an unusual project for me—outside of professional work I do for clients, I normally don't spend much longer than a day or two working on a project. Spooler took me a whole month of (sporadic) effort.

In the course of making Spooler, I had to make a number of different design decisions. I often make, uh, idiosyncratic design decisions, so I thought I'd talk about the process here.

Initial goals

I wanted Spooler to convert Twitter threads into something resembling a blog post. So: paragraphs of text, embedded media, a single author. Specifically I wanted to target numbered Twitter threads, the kind of stuff journalists and uh, journalist-adjacent people often do. Here's a typical example:

1. Among Very Serious People the real argument for American global hegemony is not freedom or democracy but stability.
— Jeet Heer (@HeerJeet) June 19, 2017

If you click through you'll see a thread of numbered replies, also from the original poster, meant to be read in sequence.

So the first thing we need is actual sentences rather than tweets.

Turning tweets into sentences

Individual tweets aren't like normal sentences, even normal internet sentences. They're full of stange conventions and t.co links and all sorts of other weirdness. I pulled a tweet from the middle of what I thought was a representative thread, since it has sequence numbers, embedded media, URLs, @-replies. More specifically I pulled the data that a computer program would get if it asked Twitter for the contents of the tweet (the statuses/show API call). And I started off with exactly the following code snippet.

let tweet = `@IndivisibleTeam @Ron_Pollack 19/20 So get to work with your Indivisible group. Let @indivisibleteam know how we can help. Together, we will win. https://t.co/XbYq0WuoHQ`;

function clean(tweet) {
  const replyAtStart = /^@\w+ /;
  while (tweet.match(replyAtStart)) {
    tweet = tweet.replace(replyAtStart,'');
  }
  return(tweet);
}

console.log(clean(tweet));

(Programming aside: at first I was like, "Oh no, a while loop! That's never good." But then I thought about it for a few seconds and realized this loop as constructed would literally never cause an infinite blocking loop, so, it's probably okay and I could like take a deep breath and calm down.)

Anyhow, the output of the above program is

19/20 So get to work with your Indivisible group. Let @indivisibleteam know how we can help. Together, we will win. https://t.co/XbYq0WuoHQ

Just a simple regular expression to detect and then get rid of an arbitrary number of @-replies at the start of a tweet.

From here I started adding more and more regular expressions:

const tweetNumberStart = /^\d+\/\d+\b/; // starts with 2/15
const tweetNumberEnd = /\d+\/\d+$/;     // ends with 2/15
const tweetParenStart = /^\(\d+\)/;     // starts with (2)
const tweetDotStart = /^\d+\./;         // starts with 2.
const tweetSlashTrailStart = /^\d+\//;  // starts with 2/
const endsWithThreeDots = /\.\.\.$/;    // ends with ...
const numberParenStart = /^\d+\)/;      // starts with 2)

That cleared up most of the common sequential numbering formats I was seeing. That's it, btw, those are the rules for catching and removing numbers from threads.

Then I added some more rules that I won't show the code for, but they do things like:

turn "@tinysubversions" into the hyperlinked "@tinysubversions"
turn "https://t.co/blah" into an unfurled, parenthetical hyperlink "(https://example-display-url...)"
turn "#whnbm" into "#whnbm"

Those last two bits are information that Twitter helpfully gives you when you ask it for everything it knows about a tweet.

Design decision — I decided to turn most links into parenthetical asides. I did this because I had no way of guessing what term (if any) in a given tweet might the URL be related to. So instead I treat it kind of like an MLA in-text citation. But also I do create real hyperlinks where possible, with @-replies and hashtags. This is the internet after all!

Traversing the thread

The next question I had is: what would the user input be? It seemed like the easiest thing was to ask people to copy/paste any tweet from a thread, and then automatically figure out the entirety of the thread for the user, rendering it as a complete blog post.

Unfortunately, Twitter does not tell you the children of a tweet, only its parents. So if you imagine a Twitter thread as a family tree, full of branches, if I start with one tweet, I can tell you its parent and its grandparent, but I can't tell you its cousins or siblings or, crucially, its children. This makes some sense, as tweets can have hundreds of children but only one parent, and I imagine it would be a pain in the ass for Twitter to provide all that information on their API.

What this means is you can't start with the first tweet of a thread and then work your way down through it all. You have to start from the bottom and work your way up.

I considered a whole bunch of tricks for getting around this problem. John Holdun suggested the software could start with a given tweet, then search for tweets from that user that happened close by in time, then check to see their lineage and sort them out into their appropriate family tree. Unfortunately, this wouldn't catch one of my favorite types of thread: topical threads that users update over the course of many months or years. (So like, every time I tweet about karaoke, I put it in this thread for easy reference.)

Design decision — In the end, I thought this issue would surely make this a useless tool. Who would go out of their way to scroll to the bottom of a thread just so they wouldn't have to, you know, scroll through a whole thread? It would have been very easy to get caught up in this problem and never release Spooler at all! But I also realized that I could just ignore this problem completely and move on with the project. In the end I would have a flawed tool, but the tool would exist, and at least I would use it.

Collecting data

At this point I needed more threads to test on, so I asked my Twitter followers for examples of long-ass numbered Twitter threads. They gave me so, so many threads. It was great.

It was at this point that I realized that a lot of threads use video and images. Some threads are just huge repositories of animated GIFs. I needed to render this stuff in a nice way. Fortunately, for native embedded media, Twitter gives you the URL of the source images/videos, so I could easily drop that into a web page in an <img> or <video> tag and call it a day.

Design decision — Images were easy enough to include, but for video the big question was: to autoplay or not to autoplay. In the end I decided to autoplay animated GIFs (which on Twitter are actually converted to short, silent videos), but to require the user to press play on a "real" video with sound. I made this decision by rendering a bunch of multimedia threads and going with my gut, figuring out what I felt were the least annoying default settings for media playback.

Paragraphs

A question I get very often is: how do you decide where to put line breaks? Well, that was definitely more of an art than a science. At the moment, these are the rules:

if a tweet has a line break inside it, of course render that line break
if a tweet has embedded media, break to a new paragraph
if the current tweet's parent starts with an uppercase letter, and the current tweet ends with a terminating punctuation mark, break to a new paragraph

I can't provide a rational explanation for the last rule. I was fiddling around with different conditional parameters, testing it with different threads, and that one seemed to do a decent job of guessing where a "new thought" began in a long thread.

Design decision — Uhhhh, it looks nice? I got nothing.

Performance and scaling and privacy

Twitter heavily rate limits their API. This means that you can only get information about tweets once per second. This means that a 120 tweet thread would take 120 seconds, aka two minutes, to render. There's not really a way around this that doesn't violate Twitter's terms of use. (Well, I could pay Twitter lots of money to give the app a bigger pipeline of data, but I don't have lots of money.)

One obvious solution was to cache the results. That way if I spool thread X it might take a minute, but then the result is saved on the server, so if someone else spools thread X, it happens instantly because it doesn't need to even talk to Twitter to get the result.

But this opens up a huge can of worms: people can set their accounts to private/locked. People can delete tweets. If I save tweets, I'd need to check in with Twitter every single time someone accessed a saved tweet, to see if the user still had permission to see it. Which would end up costing me 1 second per tweet anyway. The other option would be to ignore the privacy of users and continue caching their deleted or private tweets. That would be an enormous dick move, so I opted to not do that.

Design decision — I had to accept the one second per tweet limit. And it turns out the lazy programming decision (query Twitter about every tweet every time) is the most compliant with the Twitter terms of service and probably the best thing for user privacy. Hooray laziness!

A slow and flawed tool that actually exists

There were of course other design decisions I had to make. What choice of CSS to use, what the thing would look like on mobile, which websites to embed rich media from, and of course tons of technical design decisions around application architecture. But I hope that by outlining some of the more high-level design decisions I've maybe helped shed some light on why this app behaves the way it does.

Mostly I wanted to make a useful and thought-provoking thing. It has to work, but it doesn't have to be fast. It has to be convenient, but not at the cost of other people's privacy. And it has to actually exist, even if it's a little flawed.