by Darius Kazemi, Apr 13, 2015
Last week I released @SortingBot, a Twitter Sorting Hat bot. If you follow it on Twitter, it adds you to a queue, and when it's your turn it gives you a short rhyming couplet sorting you into a Hogwarts house from the Harry Potter books.
If you're unfamiliar with the Sorting Hat or have only seen the movies, the Hat usually introduces the Sorting with a song—though the melody is never specified and they read to me more like sing-songy poems than song lyrics. The songs are written in a somewhat loose iambic tetrameter, aka 'da DUM da DUM da DUM da DUM', arranged into ABCB rhyming quatrains. Here's an example from the first Harry Potter book:
You might belong in Gryffindor,
Where dwell the brave at heart,
Their daring, nerve and chivalry,
Set Gryffindors apart;
Just a few months ago I created @YearlyAwards, a bot that used a "follow to get customized absurd content" model, which is a model I picked up from @irondavy's excellent @robotuaries a couple years ago.
A bot that sorts people into Harry Potter houses isn't that intrinsically interesting to me, but the prospect of making it generate short quatrains in a Sorting Hat style was too good to pass up. I had just discovered RiTa, an all-in-one text processing and generation tool that came highly recommended to me from other botmakers. I've done poetry generation projects before, but never anything that attempted strict meter, so this was a pretty fun challenge.
Note: this is a loooooong article. If you'd like to skip straight ahead to the source code for the full bot, it's here.
The first thing I did was re-read all the Sorting Hat songs several times to make sure I had the right idea and tone. That's where I picked up on the exact meter and rhyme scheme, and also where I noticed that the Hat likes to refer to itself a lot! Another example from the first book:
For I'm the Hogwarts Sorting Hat
And I can cap them all.
So I thought up an example of my own:
The body of an antelope, the wisdom of an owl,
I'm putting you in Slytherin for you are really foul
I decided to make them into rhyming AA octameter couplets instead of strict quatrains by removing two of the line breaks. Mostly this was based on a hunch that 2 lines of ~60 chars each reads "easier" on Twitter than 4 lines of ~30 characters each.
So how would I generalize the above animal-based couplet? My first attempt was something like this:
The body of [an] [animal #1, with a
1/0/1
stress], the wisdom of [an] [animal #2, with a1
stress],
I'm putting you in Slytherin for you are really [adjective with1
stress rhyming with animal #2]
Well okay! When I'm talking about "stress" and using weird numbers up there, what I mean is the way syllables are stressed in a word. When you say a stress is 1/0/1
, that means it's a word like KAN-ga-ROO or CHRIST-mas-TIME. A word like "a-QUAR-i-UM" is 0/1/0/1
, etc. A single syllable word is always 1
.
This means we need to do a few things:
For the stress and the rhyming, these are both things that the aforementioned RiTa is really good at. The getStresses()
function can be fed a word array of words and return their stresses as "0/1/0"
style strings, and rhymes()
returns a list of words that rhyme with a given word.
Let's build a simple version of the @SortingBot generator, just using that one animal-based couplet. The first thing we do is install our modules for RiTa, and also Underscore (a set of utilities that makes manipulating data in JavaScript really easy).
$ npm install underscore --save
$ npm install rita --save
The list of animals part is pretty easy. I run a project called Corpora, which is a repository of lists of stuff. One of those lists is a list of common animals, so we download that and save it to a file called animals.json
like so:
$ wget https://github.com/dariusk/corpora/blob/master/data/animals/common.json -O animals.json
Now we open a new file, index.js
, and include all the modules and the data (along with a little helper function, Array.prototype.pick()
) in our program:
var _ = require('underscore');
var RiTa = require('rita');
var lexicon = new RiTa.RiLexicon();
var rita = RiTa.RiTa;
var animals = require('./animals.json').animals;
// a helper function that we add to all arrays
// this picks a random element from an array
Array.prototype.pick = function() {
return this[Math.floor(Math.random()*this.length)];
};
If we want build out our couplet, a good order would be:
1/0/1
stress1
stressSo we want to be able to say "give me an animal that has a stress of X". We can write a single function to do this!
function getAnimalsByStress(stress) {
// filter our list of animals
return _.filter(animals, function(animal) {
// only include animals with the stress we're looking for
return rita.getStresses(animal) === stress;
});
}
console.log(getAnimalsByStress('1/0/1'));
Here we are passing "1/0/1"
to the function, and then it's using Underscore's filter
to whittle down the array of animals to just those animals who match our stress. (The filter
function is available in vanilla Node.js, but I like to use Underscore's chaining and other nice features.) The end result we print out is:
[ 'antelope',
'buffalo',
'crocodile',
'kangaroo',
'ocelot',
'parakeet',
'porcupine',
'wolverine' ]
Now you might notice this runs very slowly. We'll mostly ignore performance for this article, but what I ended up doing to streamline performance for the final bot was precomputing the stresses for all the animals and include that in the JSON file as well alongside each animal name.
Armed with this new function, we can now build out our first line:
var couplet = 'The body of a ' + getAnimalsByStress('1/0/1').pick() + ', the wisdom of a ' + getAnimalsByStress('1').pick();
console.log(couplet);
Sample results:
The body of a antelope, the wisdom of a hog
The body of a wolverine, the wisdom of a fox
The body of a crocodile, the wisdom of a dog
That looks okay, although we now have a problem where the indefinite article "a/an" doesn't always match up. We have to compute that somehow. We COULD write a simple function like this, which checks the first letter of a word to see if it starts with a vowel:
function a(word) {
var result = 'a';
var first = word[0];
if (first === 'a'
|| first === 'e'
|| first === 'i'
|| first === 'o'
|| first === 'u') {
result = 'an';
}
return result + ' ' + word;
...but that wouldn't catch situations like "an hour". Fortunately, RiTa comes to the rescue here with its getPhonemes()
function. This function takes a word like "elephant" and returns its pronunciation: 'eh-l-ax-f-ax-n-t'
. You can read the full list of phonemes if you like. So instead of checking to see if the word itself starts with a vowel, we can check if the pronunciation of the words starts with a vowel:
function a(word) {
var result = 'a';
var first = rita.getPhonemes(word)[0];
if (first === 'a'
|| first === 'e'
|| first === 'i'
|| first === 'o'
|| first === 'u') {
result = 'an';
}
return result + ' ' + word;
}
You'll also note that I'm naming the function a
. Normally I recommend against using one-letter function names, but I like how this ends up being more readable in the context of what we're doing. Look at the code now:
var animal_101 = getAnimalsByStress('1/0/1').pick();
var animal_1 = getAnimalsByStress('1').pick();
var couplet = 'The body of ' + a(animal_101) + ', the wisdom of ' + a(animal_1);
console.log(couplet);
When I read that third line there, it looks mostly like English! So I make an exception to my usual "no single-letter function names" rule here. Anyway, now we're correctly handling indefinite articles:
The body of an antelope, the wisdom of a hog
The next thing we need to do is take the last word of the first line ("hog" in the above case) and get a list of adjectives that rhyme with it. RiTa provides the rhymes()
function, which returns an array of rhymes for a given word. That function is part of RiTa's lexicon
object, which we defined way back up at the top of our file. Getting a list of rhymes for "hog" is simple:
var lexicon = new RiTa.RiLexicon();
console.log(lexicon.rhymes('hog'));
// [ 'backlog', 'bog', 'clog', 'fog', 'frog', 'jog', 'slog', 'smog' ]
RiTa also provides a getPosTags()
function which returns an array of part of speech tags for a word:
var rita = RiTa.RiTa;
console.log(rita.getPosTags('apple'));
// [ 'nn' ]
It's a little messy (for example, it thinks "bear" is always a verb), but it gets the job done simply.
We can now generalize this to a function that grabs rhymes and then filters the results to find any adjectives in the set:
function getAdjRhyme(word) {
var rhymes = lexicon.rhymes(word);
return _.filter(rhymes, function(rhyme) {
// only return rhymes that are tagged 'jj' (adjectives)
return rita.getPosTags(rhyme)[0] === 'jj';
});
}
What happens if we try this with "money"?
console.log(getAdjRhyme('money'));
// [ 'funny', 'sunny' ]
Excellent! We get two adjectives that rhyme with "money". Let's try this on "hog":
console.log(getAdjRhyme('hog'));
// []
Wait what happened? Well it turns out that RiTa is unaware of any adjectives that rhyme with hog. Imagine if we tried to get a rhyming adjective for "orange"—we'd get the same result, an empty array.
So how do we handle this error? Well ideally we'd like to to try again—but if there are no rhyming words for "hog", and we need a rhyme, then what we really need to do is pick a new one-syllable animal and test THAT to see if it has a rhyming adjective.
At this point, we could write code that does that, but we're going to need this many times in the future and there's an easier solution: just redo the whole couplet if we hit any invalid cases! This isn't great for performance, but I'm not engineering this for performance.
This is the part where programming becomes not-fun. I'm sorry. It's boring, awful work, and you have to do stuff like this if you're going to spend your life trying to tell computers what to do.
Let's build a broken version first, with no error handling:
function makeCouplet() {
var animal_101 = getAnimalsByStress('1/0/1').pick();
var animal_1 = getAnimalsByStress('1').pick();
var couplet = 'The body of ' + a(animal_101) + ', the wisdom of ' + a(animal_1);
couplet += '\nToday you join with Gryffindor since you are ' + getAdjRhymes(animal_1).pick();
return couplet;
}
Sample output:
The body of a porcupine, the wisdom of a boar
Today you join with Gryffindor because you are postwar
The body of a buffalo, the wisdom of an ape
Today you join with Gryffindor because you are shipshape
The body of a parakeet, the wisdom of a lynx
Today you join with Gryffindor because you are undefined
Two problems with this output: first, the problem we arleady know about—if there's no rhyme (like for "lynx"), then it spits "undefined" because we're calling pick()
on an empty array. The second problem, that we hadn't foreseen, is the meter is still off. I don't just need an adjective that rhymes with a word. I need an adjective of a certain meter that rhymes with a word.
So I have a function that returns an array of rhyming adjectives. Now I need a function that filters the list of adjectives so they only have a certain meter. Ugh. Here we go, it's similar to our getAnimalsByStress()
function:
function getWordsByStress(words, stress) {
return _.filter(words, function(word) {
// only include words with the stress we're looking for
return rita.getStresses(word) === stress;
});
}
So now we're rewriting our makeCouplet()
function to check the rhyming words for a particular stress, in this case "1/0/1"
:
function makeCouplet() {
var animal_101 = getAnimalsByStress('1/0/1').pick();
var animal_1 = getAnimalsByStress('1').pick();
var couplet = 'The body of ' + a(animal_101) + ', the wisdom of ' + a(animal_1);
var rhymes = getWordsByStress(getAdjRhymes(animal_1), '1/0/1');
couplet += '\nToday you join with Gryffindor since you are ' + rhymes.pick();
return couplet;
}
Results:
The body of an antelope, the wisdom of a boar
Today you join with Gryffindor since you are antiwar
The body of a kangaroo, the wisdom of a mink
Today you join with Gryffindor since you are undefined
We got our first good result with that boar/antiwar rhyme! We're still seeing an error for "mink" though, because even though "pink" is a rhyme for it, "pink" is not a "1/0/1"
stress, it's a "1"
stress.
Now let's put in that error handling. What we're going to do is test separately to see if getAdjRhymes()
or getWordsByStress()
come back empty. If so, we'll just call makeCouplet()
and try again. Is this performant? Not really. Does it work? Yes:
function makeCouplet() {
var animal_101 = getAnimalsByStress('1/0/1').pick();
var animal_1 = getAnimalsByStress('1').pick();
var couplet = 'The body of ' + a(animal_101) + ', the wisdom of ' + a(animal_1);
var adjRhymes = getAdjRhymes(animal_1);
if (adjRhymes.length === 0) {
return makeCouplet();
}
var rhymes = getWordsByStress(adjRhymes, '1/0/1');
if (rhymes.length === 0) {
return makeCouplet();
}
couplet += '\nToday you join with Gryffindor since you are ' + rhymes.pick();
return couplet;
}
This takes about 5 seconds to run on average, which is kind of awful, but it works:
The body of an antelope, the wisdom of a newt
Today you join with Gryffindor since you are absolute
The body of an ocelot, the wisdom of an ox
Today you join with Gryffindor since you are orthodox
One thing you'll notice if you run it a bunch of times is only a small number of rhymes happen: ox/orthrodox, fox/orthodox, newt/absolute, newt/resolute, etc. This is because not a lot of "1/0/1"
adjectives rhyme with "1"
animal names. It's not very exciting. One thing we can do is manually add new single-syllable animals to our list ("trout", "bug", etc). Another thing we can do is vary what we return for the rhyming adjective. What we really want is a "1/0/1" meter that we can fit in to the end of our rhyme. So it could be:
you are absolute
you are so astute
you are very cute
So we can generalize this further and write a get101Phrase()
function that will provide either a "1/0/1"
rhyme, "so" followed by a "0/1"
rhyme, or "very" followed by a "1"
rhyme:
// accepts a list of words and returns a list of 1/0/1 formatted phrases with "very" and "so" as padding
function get101Phrase(words) {
var results = _.chain(words)
.map(function(el) {
var stress = rita.getStresses(el);
var pos = rita.getPosTags(el)[0];
if (stress === '1') {
el = 'very ' + el;
stress = '1/0/1';
}
else if (stress === '0/1') {
el = 'so ' + el;
stress = '1/0/1';
}
return {
word: el,
stress: stress
};
})
// just get the meter we want
.filter(function(el) {
return el.stress === '1/0/1';
})
// just return the word, not the stress
.map(function(el) {
return el.word;
})
.value();
return results;
}
And our makeCouplet()
now looks like:
function makeCouplet() {
var animal_101 = getAnimalsByStress('1/0/1').pick();
var animal_1 = getAnimalsByStress('1').pick();
var couplet = 'The body of ' + a(animal_101) + ', the wisdom of ' + a(animal_1);
var adjRhymes = getAdjRhymes(animal_1);
if (adjRhymes.length === 0) {
return makeCouplet();
}
var rhymes = get101Phrase(adjRhymes);
if (rhymes.length === 0) {
return makeCouplet();
}
couplet += '\nToday you join with Gryffindor since you are ' + rhymes.pick();
return couplet;
}
Here's the output:
The body of a crocodile, the wisdom of a mink
Today you join with Gryffindor since you are very pink
The body of an ocelot, the wisdom of a whale
Today you join with Gryffindor since you are very pale
We can of course vary it up further to make it even more diverse:
var couplet = 'The ' + ['body','prowess','ethic'].pick() + ' of ' + a(animal_101) + ', the ' + ['wisdom','instinct'].pick() + ' of ' + a(animal_1);
So that whole process wasn't actually very easy but we have reasonable, funny output now. The whole bot works more or less the way I've described above. You can download the source code for this exercise.
You can also see the entire source code for @SortingBot here.
Oh right. That. The actual sorting part is totally random.
var house = ['Gryffindor', 'Ravenclaw', 'Hufflepuff', 'Slytherin'].pick();
¯\_(ツ)_/¯
The whole thing was made like 30% easier because each house has a "1/0/1"
meter, which I never noticed until building this bot!