← Tiny Subversions
by Darius Kazemi, Mar 28, 2018
time for my Professional wrestlers from Manitoba bracket pic.twitter.com/5UmF2VAYt7— Bracket Meme Bot (@BracketMemeBot) March 28, 2018
A lot of people have been asking me how Bracket Meme Bot works, so I decided to write up how I came up with the algorithm and some of the important decisions I had to make along the way.
Yesterday Jake Rodkin tweeted the following at me:
@tinysubversions I definitely just searched twitter for “bracket from:tinysubversions” because it seems inevitable.— Jake Rodkin (@ja2ke) March 27, 2018
What Jake was referring to is the "bracket meme" that is going around Twitter. It's March, so basketball tournament brackets are a thing, but somebody posted a bracket of 16 different Disney movies and of course, tons of people started arguing about which would win and eventually making their own brackets for other categories. Jake later said he was specifically inspired by this hilarious tweet from Claire Hummel:
can I make a bracket so boring that no one will @ me about it pic.twitter.com/leo839vXaP— Claire Hummel🏜🐍 (@shoomlah) March 27, 2018
Anyway! I thought about the format of the meme. Well, the part where you draw text on an empty bracket image is easy enough. The challenge is in figuring out how to get a list of 16 things that are all related in some way where it would make sense to pit one against another. You could just pick random stuff and put it in there and it might be funny, but it wouldn't prompt really interesting conversations, and wouldn't be that similar to the original meme.
When it comes to "things that are related to other things", there are two sources that I usually go to: ConceptNet and Wikipedia.
ConceptNet is very wide-ranging and pulls from a bunch of different data sources including Wikipedia itself. You might think this would be a natural choice, but ConceptNet, as you might expect, is more about general concepts like "gloom" and "razorfish" and "pencil" than it is about hyper-specific pop culture stuff. Whereas Wikipedia is replete with pop culture, and that's kind of key to this whole thing.
So, Wikipedia it is!
In order to get the data, we need to define the algorithm that identifies it.
My first thought was to look at Wikipedia's famous "Lists". Like, there is a List of Walt Disney Animation Studios films, and there are tables of data in there that I could presumably grab stuff from. But if we look at other lists, they're not formatted in a consistent way at all. Some lists are tables, some are bullet points, some are lists of lists. And if there's no consistent semantic structure to the data, then there's no way for us to get it.
Fortunately, Wikipedia has another, better defined way of categorizing things: via the aptly-named Category. The Category system is basically just tagging: if you have an article on Marie Curie, you might file her in the 20th century physicists category and the women physicists category. (Curie is actually in 3 or 4 dozen different categories!)
So the first step of our algorithm is:
But wait... Marie Curie herself is a Category!! Which makes some sense. Her category contains movies about her life, places named after her, her relatives, and so on. But it wouldn't make sense to have a "Marie Curie" bracket where you pit her children against movies against the Curie Institute in Warsaw.
What want is lists of things like: Disney films, soccer players, buildings in NYC, and cat breeds. What do all those things have in common? Well: they all have a plural noun in them. We can use a part of speech tagger, which is a kind of program where you give it a word and it tells you its best guess at the part of speech for that word. We'll get to the technical bit in the next section, but for now, our algorithm looks like this:
And of course, we only want categories with at least 16 things in them. If you look at a Category page, you'll see it contains both "Subcategories" and "Pages". We only care about "Pages" so:
Well, now we have to actually get those pages and put them on an image, so:
This is a good start and probably the best we can do before we sit down to actually code the thing.
So, technically speaking, how do we tell a computer to get this information?
I have written in the past about how to query Wikipedia like a database, and I originally tried to use that method for this bot (since I never rewrite code if I can just copy-paste it). So I started playing around with DBPedia, but it turns out that DBPedia doesn't have a way to just... grab a random category.
I then looked at the other major option for getting data from Wikipedia, which is using the MediaWiki API. (MediaWiki is the software that powers Wikipedia, and you can use this API on any wiki that runs on MediaWiki, not just Wikipedia.)
The API Query documentation has this useful table:
|Used in the given page(s)
|Which pages have it
|List all in the wiki
Okay, maybe it could be a little clearer. I had to kind of squint at it for five minutes to understand what it means. Eventually I figured out that if we want to query for "all Categories in the wiki" we need to use this
list=allcategories query, and if we want to know what pages are in a Category, we use the
So how does this stack up to our algorithm? The first thing we want to do is "grab a random Wikipedia Category", so let's see what that
allcategories thing can do.
If we go to the allcategories documentation, it becomes clear pretty quickly that:
So our proposed algorithm is incompatible with the software we have at hand. At this point we have two options:
HERE IS WHERE A VAST MAJORITY OF ENGINEERS COMPLETELY SCREW UP!
See, as engineers, we are trained to think that if something doesn't work as we intend it to, well by golly we can fix the tech so that it does work the way we intend it to. So our instinct tells us that the software not working the way we want it to is a bug. We are inclined to write new tech that matches the algorithm we wrote in our head.
This is entitlement. "The design vision I had in my head for this system is correct, therefore I will implement the software to my vision." When people like me talk about technology built with a "colonialist mindset", most techies' eyes will glaze over, but this is a big part of what we mean when we say it. It is the same mindset where a colonizer lands on foreign soil and says, "My god, look at these savages, living in a way completely different from my own mental model. This is a bug and I am here to fix it."
What if, instead of bending the world to our will, we work with and around the world as it is? This seems like a high-minded thing to consider when the decision in front of us is "do I write new code or figure out another way around the problem" but hear me out: all of your moral values as a person come into play in literally every action you take. So you might as well be aware of them.
So back to the screw-up.
The screw-up decision here would be to write some kind of Wikipedia parser or scraper that maybe downloads a Wikipedia dump and then can pre-filter everything by our requirements and then do everything exactly the way we envisioned it originally.
The right decision here, and I mean "correct" or "just", is to simply change our brilliant initial design and move on. It will probably change the outcome of the project and what it looks like. This is okay.
So we refine our algorithm. Since we can't grab a "random" category, but we can search for categories that start with a sequence of characters, maybe we can pick a random letter and then search for categories that start with that letter.
Unfortunately this means if we ask for 500 categories (the max the API returns) that start with "B" with at least 16 members, we end up with an alphabetical list that starts at "B'z album covers" and ends at "B-Class Indian districts articles". You can see the result of the query here.
So asking for the first letter will just give us a small subset of categories that start with that letter. What if, instead we picked 3 random letters and searched for those? Well the problem here is that most combinations would be like "qqz" and not give us any results.
The solution I ended up picking was to grab a random English word and then search for the first three letters of that word. That way we'd be guaranteed valid sets of letter triplets, and we'd even be biased in favor of the more common ones.
Our new algorithm:
So if I grab the word "horse" and then do the second step of the algorithm with "hor", we get this result, 500 categories starting with "Horace Mann School alumni" and ending in "Horticulturists and gardeners".
Here are the technical steps:
nns (a plural noun)
For very specific details, you can find the complete source code on Github.