by Darius Kazemi, Jan 19, 2015
About a month ago, I got an email from Paul Ford asking a small group of creative programmers: How would you extract aphorisms from a large body of text? I dove right into the challenge since it's up my alley and ended up putting together a little bit of code that I used recently to build Hottest Startups, my latest Twitter bot.
Let's start out by being incredibly naive about aphorisms. I always like to start with the easiest most naive solution possible, because sometimes you get lucky and the easy solution is Good Enough and you can implement it and go do something else with your life.
First question: what is an aphorism? A good start would be this list of proverbs that I scraped two months ago. It's part of my Corpora project, which is a collection of small lists of stuff that you can use for creative coding.
A very cursory glance at the aphorism list tells me that an aphorism is a sentence less than 50 characters long. Let's start with that as our (probably bad) theory: an aphorism is a short sentence. How would we test it?
Well there's a text file of about 40 top Project Gutenberg ebooks that I happen to have lying around so let's suck that text in, apply our naive ideas, and see what we get.
I'm using NodeJS to do the text manipulation, not because it's especially good at this stuff, but it's what I'm used to using. In addition to vanilla NodeJS, we're going to use Underscore to do data manipulation.
var _ = require('underscore');
var fs = require('fs');
fs.readFile('../gutencorpus/lib/corpus.txt', 'utf8', function(error, data) {
// Break our text file up into individual sentences (approximate).
var sentences = data.match( /[^\.!\?]+[\.!\?]+/g );
console.log('Number of sentences:', sentences.length);
// Filter our list of sentences so we only keep ones
// matching certain criteria
sentences = _.filter(sentences, function(el) {
// trim the whitespace around a sentence
el = el.trim();
// must be less than 50 chars in length
return el.length < 50;
});
console.log('Number of sentences < 50 chars:', sentences.length);
// print a sampling of ten results
console.log(_.sample(sentences, 10));
});
And the output:
Number of sentences: 192969
Number of sentences < 50 chars: 82147
[ ' To say nothing of the fact that she is my cousin.',
' "My dear fellow, I am not quite serious.',
'" "I told you the real reason.',
' Kerrigan, will you take Miss Power?',
' W.',
' _Why I left the church of Rome?',
' Know what I mean?',
' Make haste, make haste.',
' Jonathan sleeping.',
' We ought to have hats modelled on our heads.' ]
Hmmmm. Maybe we should introduce a minimum sentence length too. Most sentences in the proverb list are greater than 20 chars in length.
Our filter looks like this now:
sentences = _.filter(sentences, function(el) {
el = el.trim();
return el.length < 50
&& el.length > 20;
});
Output:
Number of sentences: 192969
Number after filter: 43256
[ ' Miss Lucas is married and settled.',
'88 | | 212 | 1.',
' Yes, but men often propose for practice.',
' "Whatever cannot ye keep yourself for, then?',
' You could hear them in Paris and New York.',
' She must go to your house, friend John.',
' A brave stave that--who calls?',
' Tell him I\'m Boylan with impatience.',
'" A pause and a whisper was followed by-- "Eh!',
' LYNCH: _(Lifting Kitty from the table)_ Come!' ]
Okay maybe we should just kill anything with double quotations, since those tend to be lines of dialogue which refer to things like other characters. And we're not looking for character names. We want Universal Truths! Since a lot of books use single quotes instead of double quotes for conversation, let's also get rid of sentences that contain more than one single-quote (so we still allow for some contraction use).
sentences = _.filter(sentences, function(el) {
el = el.trim();
return el.length < 50
&& el.length > 20
&& el.indexOf('"') === -1
&& (el.split("'").length -1 < 2);
});
Output:
Number of sentences: 192969
Number after filter: 32917
[ ' What is the matter, Tom?',
' --He knew what money was, Mr Deasy said.',
' Jingle jingle jaunted jingling.',
' The family themselves ate in the kitchen.',
'39 | | 58 | 26.',
' The scene is on the sea-shore.',
' But a long threatening comes at last, they say.',
' And yet she had some secret sorrow, this woman.',
' Nothing to cut a feeling or sting a passion?',
' I am going to be better.' ]
Hrrrm okay. So I've been trying to avoid doing any linguistics-based stuff here, just pattern matching and that kind of thing. And our results are kind of okay? "The scene is on the sea-shore" sounds like an aphorism. "A long threatening comes at last" is an aphorism. The rest... not so much.
Linguistics-based tools can seem daunting at first, but they don't have to be, especially if you're like me and just stick to the simpler tools. I manage a NodeJS project called pos. It's a pretty okay, and very fast, part-of-speech tagger. It tells you whether a word is a noun, or a verb, or whatever else.
So I'm going to add the pos
library to the program:
var pos = require('pos');
...along with a new function I've written that, given a sentence, will return a string of parts of speech for that sentence:
function getPos(sentence) {
var words = new pos.Lexer().lex(sentence);
var taggedWords = new pos.Tagger().tag(words);
taggedWords = _.map(taggedWords, function(tag) {
return tag[1];
});
return _.flatten(taggedWords).join(' ');
}
This means that if I feed "The cat purrs softly."
into getPos()
, I end up with "DT NN VBZ RB ."
If we look at the tag listing for pos, we see this translates to: "Determiner, singular noun, present tense verb, adverb, period."
One common aphorism pattern is "[noun] [verb]". Every dog has its day. Absence makes the heart grow fonder. So let's write some code that grabs noun-verb patterns. I'm putting this after our existing set of filters, because that way we have a much smaller set of sentences to run the part-of-speech analysis on.
sentences = _.filter(sentences, function(el) {
var nounVerbs = !_.isNull(getPos(el).match(/NN? VB[ZP]/));
return nounVerbs;
});
This code is a little complicated since it uses regular expressions. What we're doing is calling getPos()
on each sentence to get our little sentence diagrams back. Then we look at the sentence diagram and run a regular expression search on it. The regular expression is /NN.? VB[ZP]/
. This is searching for two things next to each other: first it wants any word whose category starts with "NN" (so all nouns). Then it looks to see if the next word in the sentence is of type "VBZ" (singular present tense verb, like "dog eats") or "VBP" (plural present tense verb like "dogs eat").
Number of sentences: 192969
Number after filter: 3308
[ ' The great, the good Patroclus is no more!',
' And when I say that it means a deal, Jim.',
' I wish I had put his eyes out!',
' Its specific gravity is approximately 1.',
' No; I am not so selfish.',
' The field follows, a bunch of bucking mounts.',
'--they say he killed himself.',
' Shop closes early on Thursday.',
' I am telling you the truth.',
'\' My Robert believes he was a wine-merchant.' ]
Hmm. Let's get rid of anything that contains "I". Those aren't very aphorism-like. And let's get rid of my/me/he/she/you as well.
sentences = _.filter(sentences, function(el) {
el = el.trim();
return el.length < 50
&& el.length > 20
&& el.indexOf('"') === -1
&& (el.split("'").length -1 < 2)
&& el.indexOf('I ') === -1
&& _.isNull(el.match(/\bmy\b/i))
&& _.isNull(el.match(/\bme\b/i))
&& _.isNull(el.match(/\bhe\b/i))
&& _.isNull(el.match(/\bshe\b/i))
&& _.isNull(el.match(/\byou\b/i))
&& _.isNull(el.match(/\bhis\b/i))
&& _.isNull(el.match(/\bher\b/i));
});
Number of sentences: 192969
Number after filter: 1293
[ ' Live axle drives are souped.',
' Yellow poison streaks are on the drawn face.',
' The dog is always with him.',
' (The action takes place in Helmer\'s house.',
' Argal, one hat is one hat.',
' The same day continues.',
' That land hath store of such.',
' Malachi Mulligan is coming too.',
' These mouths are fitted with heavy doors of iron.' ]
So! This is pretty resonable. We're now getting aphorism-like things. Here's the source code.
So I built this essentially on a dare, but then last week I telling jokes of this form on Twitter:
Startup idea: [famous Marxist phrase]
At which point @deathbearbrown reminded me that I was doing very formulaic jokes. Well, what good is a formulaic joke if you can't automate it with code, right?
So I grabbed some classic Marxist texts from the archive at Marxists.org. I didn't just grab them at random: I picked texts that I thought would lend themselves to humor juxtaposed alongside the phrase "Startup idea." I ended up deploying a modified version of the code you saw above. (You can see the source code for the bot if you like, it's pretty much identical, just with some rules tweaks.) The end result is @HottestStartups, a joke-bot that makes me laugh pretty consistently!