how-to-query-wikipedia

How to query Wikipedia like a database

by Darius Kazemi, Apr 13, 2015

On Sunday night I built a Twitter bot called Empire Plots that comes up with weird but plausible soap opera type plots for the TV show Empire (inspired by Eve Ewing).

In order to build this, I wanted a list of living African-American actors. Normally I would go to Wikipedia and just copy/paste a list of actors, but while they have categories for male and female African-American actors, there isn't a list of specifically ones who are alive.

Because that would be creepy.

Wikipedia, however, does have an understanding of which actors are alive and dead. It's just normally on the profile page for an individual actor. If a person in a Wikipedia entry is dead, there's usually a box on the right hand side of their page with a Died field that has a date in it.

So how do we take advantage of this? How do we ask Wikipedia for a list of articles that contain (or don't contain) death dates for actors?

DBPedia: your confusing but useful friend

Enter DBPedia. DBPedia is kind of hard to describe, but in non-technical terms, it's a static backup of Wikipedia. "Static" means it's not updated regularly, only occasionally (something like once a year), so the information in it is not cutting-edge. Instead of being a normal backup, it's a semantic backup, meaning that every article on Wikipedia is stored like an item in a database and you can query it using a query language.

This lets us compose queries like: "Give me every country in the world with a population of more than 100 million people."

We'd compose that query by writing it in SPARQL. Please don't click on that link. SPARQL is very difficult to use and I've never seen anyone teach it well. (That's uh, kind of why I'm writing this article.)

SPARQL isn't just a DBPedia thing. It applies to all sorts of other sites and resources. Technically it's powerful enough to let you make queries across different websites, which is super cool and powerful but nobody with a mere human brain can figure out how to actually do it.

We're just talking about DBPedia here, which makes things a little easier.

How to compose a DBPedia query

Let's use our earlier example and say that our query is "Give me every country in the world with a population of more than 10 million people." First we find the Wikipedia page for an example country, like this page for Saudi Arabia. Then we copy the part of the url after the wiki/ part, so https://en.wikipedia.org/wiki/Saudi_Arabia becomes just Saudi_Arabia.

In front of Saudi_Arabia we put http://dbpedia.org/page, so we end up with http://dbpedia.org/page/Saudi_Arabia as our final URL. We go to the website described by that page and we see a listing of a huge list of properties, which are pieces of data that DBPedia understands about the Wikipedia page for Saudi Arabia.

First we want to understand how DBPedia knows this is even a country, and not like, an album called "Saudi Arabia". If we ctrl+F to search the page for the word "Country", we'll eventually come across a dbo:Country listing in a section labeled rdf:type. This means that the "type" of the entry is a Country (it's also a Thing, a Place, a Location, etc etc).

The dbo: and rdf: parts are "namespaces" but that doesn't matter right now. I'm not even going to cover namespaces in this article because they are intensely confusing to the newcomer and are, in the end, just a convenient shortcut for power users.

dbo:Country itself is a link to this url: http://dbpedia.org/ontology/Country. For now we don't need to go to that page. We just need to make a note of that URL.

Next we open up a SPARQL query tool. This one is pretty simple, just open it in a new tab: http://dbpedia.org/snorql/

In the text field it'll say something like

SELECT * WHERE {
...
}

Let's change that to the following simple query:

SELECT ?country WHERE {
  ?country rdf:type <http://dbpedia.org/ontology/Country>
}

The first line says "select every country that meets the following criteria in {curly brackets}". (The ?country bit is actually just a variable name. We could call it ?cheese and it'll still work. It's just looking for "stuff" that matches our criteria.)

The second line is our criteria. It's "every country where its TYPE property (rdf:type) field contains <http://dbpedia.org/ontology/Country> (triangle brackets are required) in its list of values".

The third line just closes out the curly brackets and lets the computer know that we're done asking for things.

If you press "Go!" now, it might churn for a minute and then it'll give you a biiiiiig list of entries that starts like this:

country
:Finland_national_cricket_team
:Great_Britain_men's_national_basketball_team
:Indiana_Democratic_Party
:Pakistan_men's_national_field_hockey_team
:Republic_of_Molossia
:Talossa
:United_States_Ambassador_to_Denmark

Ummmm. Those don't look very much like countries? What happened?

Well it turns out that Wikipedia has a pretty broad definition of what counts as a "Country". Fortunately we can refine this further. Let's ask it for Country type entries that have a population field.

If we go back to our Saudi Arabia entry we'll see there's a dbo:populationTotal field that maps to http://dbpedia.org/ontology/populationTotal. So now we want to add an additonal criterion: we only want it to return things that have a dbo:populationTotal property. We do this like so:

SELECT ?country WHERE {
  ?country rdf:type <http://dbpedia.org/ontology/Country> .
  ?country  <http://dbpedia.org/ontology/populationTotal> ?population 
}

The first line is the same as before. The second line is the same as before except we've added a period . to the end as a notification that there're more information to come on the next line. The third line is new: it says "also give us countries that have a population".

Let's hit "Go!" again and see what our results are. They start off like this:

country
:Talossa
:Nagorno-Karabakh
:Sark
:Somaliland
:Sovereign_Military_Order_of_Malta
:Ross_Dependency
:Cocos_(Keeling)_Islands
:French_Southern_and_Antarctic_Lands
:Cherokee_Nation
:Adjara
:Kerguelen_Islands
:Macedonia_(region)
:Principality_of_Sealand

That's great! There are only a couple hundred results, and these are all actual places, even though some of them might be politically disputed territory. If we scroll down more we start seeing entries like Egypt and Italy so we know we've got a pretty good list here.

Since we defined a new variable in that third line called ?population, one cool thing we can do now is add it to our first line. This will return a second column which contains the population data for every entry:

SELECT ?country, ?population WHERE {
  ?country rdf:type <http://dbpedia.org/ontology/Country> .
  ?country  <http://dbpedia.org/ontology/populationTotal> ?population 
}

And the output is:

country                                population
:Talossa                               239
:Nagorno-Karabakh                      138000
:Sark                                  600
:Somaliland                            3500000
:Sovereign_Military_Order_of_Malta     3
:Ross_Dependency                       10
:Cocos_(Keeling)_Islands               596
:French_Southern_and_Antarctic_Lands   140
:Cherokee_Nation                       299862
:Adjara                                393700
:Kerguelen_Islands                     45
:Macedonia_(region)                    4760000
:Principality_of_Sealand               50
:Savoy                                 1125119

Now for the population filter part. We want just countries with 100 million people or more. Here's how we do that:

SELECT ?country, ?population WHERE {
?country rdf:type <http://dbpedia.org/ontology/Country> .
?country  <http://dbpedia.org/ontology/populationTotal> ?population .
FILTER ( ?population > 100000000 )
}

All we've done is add a . to the end of our third line and a FILTER to the end that only accepts results with a ?population of greater than one hundred million:

country         population
:Bangladesh     156594962
:Brazil         202768562
:China          1357380000
:Indonesia      252164800
:Japan          126434964
:Mexico         118395054
:Nigeria        174507539
:Pakistan       196174380
:Philippines    100617630
:Russia         143975923
:United_States  320206000

Pretty snazzy. Another really cool thing is that once you have the query down, there are a million tools out there in whatever programming language you want that will query DBPedia and return the data however you like. Or if you're into it, you can simply make an HTTP GET request to the JSON endpoint for the query.

Okay but how do we get living African-American actors??

Well let's start with one. Let's pick African-American actress Felicia Pearson, and go to her DBPedia entry (remember it's just her Wikipedia entry but with dbpedia.org/page/ instead of en.wikipedia.org/wiki/). We'll see an entry on there with the property dct:subject that has the value dbc:African-American_actresses. That's the Wikipedia category page we linked at the top of this article for African-American actresses.

Again we go through the exercise where we expand things out to their URLs, so dct:subject becomes <http://purl.org/dc/terms/subject> and dbc:African-American_actresses becomes <http://dbpedia.org/resource/Category:African-American_actresses>. Then we do this simple query:

SELECT ?actor WHERE { 
 ?actor <http://purl.org/dc/terms/subject> <http://dbpedia.org/resource/Category:African-American_actresses>
}

"Give us every subject listed in the category of African-American actresses." We end up with this:

actor
:Robin_Givens
:Jonelle_Allen
:Barbara_McNair
:Brandi_Chavonne_Massey
:Christy_Knowings
:Felicia_Pearson
:Femi_Emiola
:Lillias_White
:Luenell
:Meagan_Tandy

Which looks great! Except wayyyy down the list we get this:

:List_of_black_Academy_Award_winners_and_nominees

This is NOT what we want. Turns out that list is a member of that category. But we just want people! Fortunately we can filter people like we filtered countries above. Again we look at the DBPedia page for a specific actress, browse around, and see that she has a rdf:type of <http://dbpedia.org/ontology/Person>. AKA she's a person! Let's add that to the query:

SELECT ?actor WHERE { 
 ?actor <http://purl.org/dc/terms/subject> <http://dbpedia.org/resource/Category:African-American_actresses> .
 ?actor rdf:type <http://dbpedia.org/ontology/Person> 
}

(Again, don't forget the . to join the two statements.)

I won't reprint it here but this gives us a full list of African-American actresses... both living and dead.

So how to filter out dead actresses? Well, it's kind of like with the population thing. Let's go to the Wikipedia page of an actress we know to be dead, like... Aaliyah (R.I.P.!!!!). Right near the top we see she has a dbo:deathDate field that links to http://dbpedia.org/ontology/deathDate. So we can add this to the query:

SELECT ?actor, ?deathdate WHERE { 
 ?actor <http://purl.org/dc/terms/subject> <http://dbpedia.org/resource/Category:African-American_actresses> .
 ?actor rdf:type <http://dbpedia.org/ontology/Person> .
 ?actor <http://dbpedia.org/ontology/deathDate> ?deathdate
}

This returns a much shorter list of actresses along with the date they died:

actor                      deathdate
:Barbara_McNair            "2007-02-04"^^xsd:date
:Vinnette_Justine_Carroll  "2002-11-05"^^xsd:date
:Hattie_McDaniel           "1952-10-26"^^xsd:date
:Dorothy_Dandridge         "1965-09-08"^^xsd:date
:Ethel_Waters              "1977-09-01"^^xsd:date
:Ruth_Brown                "2006-11-17"^^xsd:date
:Roxie_Roker               "1995-12-02"^^xsd:date
...etc

Unfortunately this is the OPPOSITE of what we want. We really want living actresses. Fortunately all we have to do is mark our "deathdate" query as "OPTIONAL":

SELECT ?actor, ?deathdate WHERE { 
 ?actor <http://purl.org/dc/terms/subject> <http://dbpedia.org/resource/Category:African-American_actresses> .
 ?actor rdf:type <http://dbpedia.org/ontology/Person> .
 OPTIONAL {?actor <http://dbpedia.org/ontology/deathDate> ?deathdate}
}

And now we get a list that starts exactly like above, but eventually looks like this:

:Nina_Mae_McKinney      "1967-05-03"^^xsd:date
:Damita_Jo_DeBlanc      "1998-12-25"^^xsd:date
:Alaina_Reed_Hall       "2009-12-17"^^xsd:date
:Simone_Battle          "2014-09-05"^^xsd:date
:Robin_Givens           -
:Jonelle_Allen          -
:Brandi_Chavonne_Massey -
:Christy_Knowings       -

The second half of our list is actresses who do not have a known death date in wikipedia as of 2014 (when DBPedia was last populated with data).

Conclusion

DBPedia is confusing and intimidating and poorly documented and POWERFUL AS ALL HECK. I hope you enjoy using it in your future projects.