bracket-meme-bot

How I made Bracket Meme Bot

by Darius Kazemi, Mar 28, 2018

time for my Professional wrestlers from Manitoba bracket pic.twitter.com/5UmF2VAYt7
— Bracket Meme Bot (@BracketMemeBot) March 28, 2018

A lot of people have been asking me how Bracket Meme Bot works, so I decided to write up how I came up with the algorithm and some of the important decisions I had to make along the way.

Inspiration

Yesterday Jake Rodkin tweeted the following at me:

@tinysubversions I definitely just searched twitter for “bracket from:tinysubversions” because it seems inevitable.
— Jake Rodkin (@ja2ke) March 27, 2018

What Jake was referring to is the "bracket meme" that is going around Twitter. It's March, so basketball tournament brackets are a thing, but somebody posted a bracket of 16 different Disney movies and of course, tons of people started arguing about which would win and eventually making their own brackets for other categories. Jake later said he was specifically inspired by this hilarious tweet from Claire Hummel:

can I make a bracket so boring that no one will @ me about it pic.twitter.com/leo839vXaP
— Claire Hummel🏜🐍 (@shoomlah) March 27, 2018

Anyway! I thought about the format of the meme. Well, the part where you draw text on an empty bracket image is easy enough. The challenge is in figuring out how to get a list of 16 things that are all related in some way where it would make sense to pit one against another. You could just pick random stuff and put it in there and it might be funny, but it wouldn't prompt really interesting conversations, and wouldn't be that similar to the original meme.

Where to get the data

When it comes to "things that are related to other things", there are two sources that I usually go to: ConceptNet and Wikipedia.

ConceptNet is very wide-ranging and pulls from a bunch of different data sources including Wikipedia itself. You might think this would be a natural choice, but ConceptNet, as you might expect, is more about general concepts like "gloom" and "razorfish" and "pencil" than it is about hyper-specific pop culture stuff. Whereas Wikipedia is replete with pop culture, and that's kind of key to this whole thing.

So, Wikipedia it is!

Defining our algorithm

In order to get the data, we need to define the algorithm that identifies it.

My first thought was to look at Wikipedia's famous "Lists". Like, there is a List of Walt Disney Animation Studios films, and there are tables of data in there that I could presumably grab stuff from. But if we look at other lists, they're not formatted in a consistent way at all. Some lists are tables, some are bullet points, some are lists of lists. And if there's no consistent semantic structure to the data, then there's no way for us to get it.

Fortunately, Wikipedia has another, better defined way of categorizing things: via the aptly-named Category. The Category system is basically just tagging: if you have an article on Marie Curie, you might file her in the 20th century physicists category and the women physicists category. (Curie is actually in 3 or 4 dozen different categories!)

So the first step of our algorithm is:

grab a random Wikipedia Category

But wait... Marie Curie herself is a Category!! Which makes some sense. Her category contains movies about her life, places named after her, her relatives, and so on. But it wouldn't make sense to have a "Marie Curie" bracket where you pit her children against movies against the Curie Institute in Warsaw.

What want is lists of things like: Disney films, soccer players, buildings in NYC, and cat breeds. What do all those things have in common? Well: they all have a plural noun in them. We can use a part of speech tagger, which is a kind of program where you give it a word and it tells you its best guess at the part of speech for that word. We'll get to the technical bit in the next section, but for now, our algorithm looks like this:

grab a random Wikipedia Category
- ...that has a plural noun in its title

And of course, we only want categories with at least 16 things in them. If you look at a Category page, you'll see it contains both "Subcategories" and "Pages". We only care about "Pages" so:

grab a random Wikipedia Category
- ...that has a plural noun in its title
- ...that has at least 16 "Pages"

Well, now we have to actually get those pages and put them on an image, so:

grab a random Wikipedia Category
- ...that has a plural noun in its title
- ...that has at least 16 "Pages"
get the list of Pages in the Category
pick 16 of them at random and draw them on the bracket

This is a good start and probably the best we can do before we sit down to actually code the thing.

Talking to Wikipedia

So, technically speaking, how do we tell a computer to get this information?

I have written in the past about how to query Wikipedia like a database, and I originally tried to use that method for this bot (since I never rewrite code if I can just copy-paste it). So I started playing around with DBPedia, but it turns out that DBPedia doesn't have a way to just... grab a random category.

I then looked at the other major option for getting data from Wikipedia, which is using the MediaWiki API. (MediaWiki is the software that powers Wikipedia, and you can use this API on any wiki that runs on MediaWiki, not just Wikipedia.)

The API Query documentation has this useful table:

Page type	Example	Used in the given page(s)	Which pages have it	List all in the wiki
Page link	[[Page]]	prop=links	list=backlinks	list=alllinks
Template transclusion	{{Template}}	prop=templates	list=embeddedin	list=alltransclusions
Categories	[[category:Cat]]	prop=categories	list=categorymembers	list=allcategories
Images	[[file:image.png]]	prop=images	list=imageusage	list=allimages
Language links	[[ru:Page]]	prop=langlinks	list=langbacklinks
Interwiki links	[[meta:Page]]	prop=iwlinks	list=iwbacklinks
URLs	https://mediawiki.org	prop=extlinks	list=exturlusage

Okay, maybe it could be a little clearer. I had to kind of squint at it for five minutes to understand what it means. Eventually I figured out that if we want to query for "all Categories in the wiki" we need to use this list=allcategories query, and if we want to know what pages are in a Category, we use the list=categorymembers query.

Refining the algorithm

So how does this stack up to our algorithm? The first thing we want to do is "grab a random Wikipedia Category", so let's see what that allcategories thing can do.

If we go to the allcategories documentation, it becomes clear pretty quickly that:

there is no built-in way to grab a "random" category, but you can say "starting at this alphabetical index, give me the next 500 categories"
it will tell us how many Pages are in a category but it won't let us filter by that
it will let us filter by number of members, which is different from number of Pages -- members is the combined total of Pages and Subcategories. Is this documented anywhere? Not that I could find. I had to just perform the query manually and then see what came back. Technology is terrible.

So our proposed algorithm is incompatible with the software we have at hand. At this point we have two options:

adjust the algorithm to fit the tech
write new tech that does what we want

HERE IS WHERE A VAST MAJORITY OF ENGINEERS COMPLETELY SCREW UP!

See, as engineers, we are trained to think that if something doesn't work as we intend it to, well by golly we can fix the tech so that it does work the way we intend it to. So our instinct tells us that the software not working the way we want it to is a bug. We are inclined to write new tech that matches the algorithm we wrote in our head.

This is entitlement. "The design vision I had in my head for this system is correct, therefore I will implement the software to my vision." When people like me talk about technology built with a "colonialist mindset", most techies' eyes will glaze over, but this is a big part of what we mean when we say it. It is the same mindset where a colonizer lands on foreign soil and says, "My god, look at these savages, living in a way completely different from my own mental model. This is a bug and I am here to fix it."

What if, instead of bending the world to our will, we work with and around the world as it is? This seems like a high-minded thing to consider when the decision in front of us is "do I write new code or figure out another way around the problem" but hear me out: all of your moral values as a person come into play in literally every action you take. So you might as well be aware of them.

So back to the screw-up.

The screw-up decision here would be to write some kind of Wikipedia parser or scraper that maybe downloads a Wikipedia dump and then can pre-filter everything by our requirements and then do everything exactly the way we envisioned it originally.

The right decision here, and I mean "correct" or "just", is to simply change our brilliant initial design and move on. It will probably change the outcome of the project and what it looks like. This is okay.

So we refine our algorithm. Since we can't grab a "random" category, but we can search for categories that start with a sequence of characters, maybe we can pick a random letter and then search for categories that start with that letter.

Unfortunately this means if we ask for 500 categories (the max the API returns) that start with "B" with at least 16 members, we end up with an alphabetical list that starts at "B'z album covers" and ends at "B-Class Indian districts articles". You can see the result of the query here.

So asking for the first letter will just give us a small subset of categories that start with that letter. What if, instead we picked 3 random letters and searched for those? Well the problem here is that most combinations would be like "qqz" and not give us any results.

The solution I ended up picking was to grab a random English word and then search for the first three letters of that word. That way we'd be guaranteed valid sets of letter triplets, and we'd even be biased in favor of the more common ones.

Our new algorithm:

get the first 3 letters of a random dictionary word
grab 500 Wikipedia Categories with 16+ members in alphabetical order starting with those 3 letters
- ...use a part of speech tagger to only keep ones that have plural noun in their title
- ...and that have at least 16 "Pages"
randomly pick one of these narrowed-down categories
get the list of Pages in the Category
pick 16 of them at random and draw them on the bracket

So if I grab the word "horse" and then do the second step of the algorithm with "hor", we get this result, 500 categories starting with "Horace Mann School alumni" and ending in "Horticulturists and gardeners".

Nitty-gritty and source code

Here are the technical steps:

ask RitaJS for a random word.
take the first 3 letters of that word and do a GET request on this endpoint
use Rita again, this time to get the part of speech tags for each category
throw away anything that doesn't have a word that is tagged nns (a plural noun)
throw away anything with less than 16 Pages
throw away a few other "banned terms" that gave consistently boring results (for example, anything with "articles" in it since that is usually a category that is a list of Wikipedia articles)
from the categories that remain, pick one at random
make a GET request to this endpoint to get all the Pages in the category
pick 16 and draw them on the image

For very specific details, you can find the complete source code on Github.