by Darius Kazemi, Apr 13, 2015
On Sunday night I built a Twitter bot called Empire Plots that comes up with weird but plausible soap opera type plots for the TV show Empire (inspired by Eve Ewing).
In order to build this, I wanted a list of living African-American actors. Normally I would go to Wikipedia and just copy/paste a list of actors, but while they have categories for male and female African-American actors, there isn't a list of specifically ones who are alive.
Because that would be creepy.
Wikipedia, however, does have an understanding of which actors are alive and dead. It's just normally on the profile page for an individual actor. If a person in a Wikipedia entry is dead, there's usually a box on the right hand side of their page with a Died field that has a date in it.
So how do we take advantage of this? How do we ask Wikipedia for a list of articles that contain (or don't contain) death dates for actors?
Enter DBPedia. DBPedia is kind of hard to describe, but in non-technical terms, it's a static backup of Wikipedia. "Static" means it's not updated regularly, only occasionally (something like once a year), so the information in it is not cutting-edge. Instead of being a normal backup, it's a semantic backup, meaning that every article on Wikipedia is stored like an item in a database and you can query it using a query language.
This lets us compose queries like: "Give me every country in the world with a population of more than 100 million people."
We'd compose that query by writing it in SPARQL. Please don't click on that link. SPARQL is very difficult to use and I've never seen anyone teach it well. (That's uh, kind of why I'm writing this article.)
SPARQL isn't just a DBPedia thing. It applies to all sorts of other sites and resources. Technically it's powerful enough to let you make queries across different websites, which is super cool and powerful but nobody with a mere human brain can figure out how to actually do it.
We're just talking about DBPedia here, which makes things a little easier.
Let's use our earlier example and say that our query is "Give me every country in the world with a population of more than 10 million people." First we find the Wikipedia page for an example country, like this page for Saudi Arabia. Then we copy the part of the url after the wiki/
part, so https://en.wikipedia.org/wiki/Saudi_Arabia
becomes just Saudi_Arabia
.
In front of Saudi_Arabia
we put http://dbpedia.org/page
, so we end up with http://dbpedia.org/page/Saudi_Arabia
as our final URL. We go to the website described by that page and we see a listing of a huge list of properties, which are pieces of data that DBPedia understands about the Wikipedia page for Saudi Arabia.
First we want to understand how DBPedia knows this is even a country, and not like, an album called "Saudi Arabia". If we ctrl+F
to search the page for the word "Country", we'll eventually come across a dbo:Country
listing in a section labeled rdf:type
. This means that the "type" of the entry is a Country
(it's also a Thing
, a Place
, a Location
, etc etc).
The dbo:
and rdf:
parts are "namespaces" but that doesn't matter right now. I'm not even going to cover namespaces in this article because they are intensely confusing to the newcomer and are, in the end, just a convenient shortcut for power users.
dbo:Country
itself is a link to this url: http://dbpedia.org/ontology/Country
. For now we don't need to go to that page. We just need to make a note of that URL.
Next we open up a SPARQL query tool. This one is pretty simple, just open it in a new tab: http://dbpedia.org/snorql/
In the text field it'll say something like
SELECT * WHERE {
...
}
Let's change that to the following simple query:
SELECT ?country WHERE {
?country rdf:type <http://dbpedia.org/ontology/Country>
}
The first line says "select every country that meets the following criteria in {curly brackets}". (The ?country
bit is actually just a variable name. We could call it ?cheese
and it'll still work. It's just looking for "stuff" that matches our criteria.)
The second line is our criteria. It's "every country where its TYPE property (rdf:type
) field contains <http://dbpedia.org/ontology/Country>
(triangle brackets are required) in its list of values".
The third line just closes out the curly brackets and lets the computer know that we're done asking for things.
If you press "Go!" now, it might churn for a minute and then it'll give you a biiiiiig list of entries that starts like this:
country
:Finland_national_cricket_team
:Great_Britain_men's_national_basketball_team
:Indiana_Democratic_Party
:Pakistan_men's_national_field_hockey_team
:Republic_of_Molossia
:Talossa
:United_States_Ambassador_to_Denmark
Ummmm. Those don't look very much like countries? What happened?
Well it turns out that Wikipedia has a pretty broad definition of what counts as a "Country". Fortunately we can refine this further. Let's ask it for Country
type entries that have a population field.
If we go back to our Saudi Arabia entry we'll see there's a dbo:populationTotal
field that maps to http://dbpedia.org/ontology/populationTotal
. So now we want to add an additonal criterion: we only want it to return things that have a dbo:populationTotal
property. We do this like so:
SELECT ?country WHERE {
?country rdf:type <http://dbpedia.org/ontology/Country> .
?country <http://dbpedia.org/ontology/populationTotal> ?population
}
The first line is the same as before. The second line is the same as before except we've added a period .
to the end as a notification that there're more information to come on the next line. The third line is new: it says "also give us countries that have a population".
Let's hit "Go!" again and see what our results are. They start off like this:
country
:Talossa
:Nagorno-Karabakh
:Sark
:Somaliland
:Sovereign_Military_Order_of_Malta
:Ross_Dependency
:Cocos_(Keeling)_Islands
:French_Southern_and_Antarctic_Lands
:Cherokee_Nation
:Adjara
:Kerguelen_Islands
:Macedonia_(region)
:Principality_of_Sealand
That's great! There are only a couple hundred results, and these are all actual places, even though some of them might be politically disputed territory. If we scroll down more we start seeing entries like Egypt
and Italy
so we know we've got a pretty good list here.
Since we defined a new variable in that third line called ?population
, one cool thing we can do now is add it to our first line. This will return a second column which contains the population data for every entry:
SELECT ?country, ?population WHERE {
?country rdf:type <http://dbpedia.org/ontology/Country> .
?country <http://dbpedia.org/ontology/populationTotal> ?population
}
And the output is:
country population
:Talossa 239
:Nagorno-Karabakh 138000
:Sark 600
:Somaliland 3500000
:Sovereign_Military_Order_of_Malta 3
:Ross_Dependency 10
:Cocos_(Keeling)_Islands 596
:French_Southern_and_Antarctic_Lands 140
:Cherokee_Nation 299862
:Adjara 393700
:Kerguelen_Islands 45
:Macedonia_(region) 4760000
:Principality_of_Sealand 50
:Savoy 1125119
Now for the population filter part. We want just countries with 100 million people or more. Here's how we do that:
SELECT ?country, ?population WHERE {
?country rdf:type <http://dbpedia.org/ontology/Country> .
?country <http://dbpedia.org/ontology/populationTotal> ?population .
FILTER ( ?population > 100000000 )
}
All we've done is add a .
to the end of our third line and a FILTER
to the end that only accepts results with a ?population
of greater than one hundred million:
country population
:Bangladesh 156594962
:Brazil 202768562
:China 1357380000
:Indonesia 252164800
:Japan 126434964
:Mexico 118395054
:Nigeria 174507539
:Pakistan 196174380
:Philippines 100617630
:Russia 143975923
:United_States 320206000
Pretty snazzy. Another really cool thing is that once you have the query down, there are a million tools out there in whatever programming language you want that will query DBPedia and return the data however you like. Or if you're into it, you can simply make an HTTP GET request to the JSON endpoint for the query.
Well let's start with one. Let's pick African-American actress Felicia Pearson, and go to her DBPedia entry (remember it's just her Wikipedia entry but with dbpedia.org/page/
instead of en.wikipedia.org/wiki/
). We'll see an entry on there with the property dct:subject
that has the value dbc:African-American_actresses
. That's the Wikipedia category page we linked at the top of this article for African-American actresses.
Again we go through the exercise where we expand things out to their URLs, so dct:subject
becomes <http://purl.org/dc/terms/subject>
and dbc:African-American_actresses
becomes <http://dbpedia.org/resource/Category:African-American_actresses>
. Then we do this simple query:
SELECT ?actor WHERE {
?actor <http://purl.org/dc/terms/subject> <http://dbpedia.org/resource/Category:African-American_actresses>
}
"Give us every subject listed in the category of African-American actresses." We end up with this:
actor
:Robin_Givens
:Jonelle_Allen
:Barbara_McNair
:Brandi_Chavonne_Massey
:Christy_Knowings
:Felicia_Pearson
:Femi_Emiola
:Lillias_White
:Luenell
:Meagan_Tandy
Which looks great! Except wayyyy down the list we get this:
:List_of_black_Academy_Award_winners_and_nominees
This is NOT what we want. Turns out that list is a member of that category. But we just want people! Fortunately we can filter people like we filtered countries above. Again we look at the DBPedia page for a specific actress, browse around, and see that she has a rdf:type
of <http://dbpedia.org/ontology/Person>
. AKA she's a person! Let's add that to the query:
SELECT ?actor WHERE {
?actor <http://purl.org/dc/terms/subject> <http://dbpedia.org/resource/Category:African-American_actresses> .
?actor rdf:type <http://dbpedia.org/ontology/Person>
}
(Again, don't forget the .
to join the two statements.)
I won't reprint it here but this gives us a full list of African-American actresses... both living and dead.
So how to filter out dead actresses? Well, it's kind of like with the population thing. Let's go to the Wikipedia page of an actress we know to be dead, like... Aaliyah (R.I.P.!!!!). Right near the top we see she has a dbo:deathDate
field that links to http://dbpedia.org/ontology/deathDate
. So we can add this to the query:
SELECT ?actor, ?deathdate WHERE {
?actor <http://purl.org/dc/terms/subject> <http://dbpedia.org/resource/Category:African-American_actresses> .
?actor rdf:type <http://dbpedia.org/ontology/Person> .
?actor <http://dbpedia.org/ontology/deathDate> ?deathdate
}
This returns a much shorter list of actresses along with the date they died:
actor deathdate
:Barbara_McNair "2007-02-04"^^xsd:date
:Vinnette_Justine_Carroll "2002-11-05"^^xsd:date
:Hattie_McDaniel "1952-10-26"^^xsd:date
:Dorothy_Dandridge "1965-09-08"^^xsd:date
:Ethel_Waters "1977-09-01"^^xsd:date
:Ruth_Brown "2006-11-17"^^xsd:date
:Roxie_Roker "1995-12-02"^^xsd:date
...etc
Unfortunately this is the OPPOSITE of what we want. We really want living actresses. Fortunately all we have to do is mark our "deathdate" query as "OPTIONAL":
SELECT ?actor, ?deathdate WHERE {
?actor <http://purl.org/dc/terms/subject> <http://dbpedia.org/resource/Category:African-American_actresses> .
?actor rdf:type <http://dbpedia.org/ontology/Person> .
OPTIONAL {?actor <http://dbpedia.org/ontology/deathDate> ?deathdate}
}
And now we get a list that starts exactly like above, but eventually looks like this:
:Nina_Mae_McKinney "1967-05-03"^^xsd:date
:Damita_Jo_DeBlanc "1998-12-25"^^xsd:date
:Alaina_Reed_Hall "2009-12-17"^^xsd:date
:Simone_Battle "2014-09-05"^^xsd:date
:Robin_Givens -
:Jonelle_Allen -
:Brandi_Chavonne_Massey -
:Christy_Knowings -
The second half of our list is actresses who do not have a known death date in wikipedia as of 2014 (when DBPedia was last populated with data).
DBPedia is confusing and intimidating and poorly documented and POWERFUL AS ALL HECK. I hope you enjoy using it in your future projects.