Behind the Scenes
Dec 17, 2014
On Dec 16 I released a little project called Content, Forever. It generates a meandering essay from a seed topic by crawling Wikipedia, bouncing around from article to article with a short attention span.
How it Works
It works, loosely, like so:
- Go to the Wikipedia article for the seed topic.
- Get the first 5 paragraphs that contain a link to another WP article.
- Grab a random paragraph from those 5, then grab a random link from that paragraph.
- Print out the sentence we got the link from, then go to the linked article.
- If there are images on the page, there's a 20% chance we grab an image at random along with its caption and print that out too.
- (If we hit a snag anywhere in this process, grab a random link from anywhere in the article and print that sentence then go to that article.)
- Start over at step 2. Stop if we hit our content limit.
Drafts (Process Notes)
I thought it might be fun for people to see all the draft stages of the project. I wrote most of this in JSFiddle, an online tool that lets you write JS/HTML/CSS and test it very quickly, then save out drafts and share them with people. A nice side effect is that all my drafts are saved this way.
Here are a few notable drafts. All of these you need to manually reload to see new content.
- Version 0. This isn't on JSFiddle. It's not even code. I always prototype ideas by hand whenever possible. I ran through my idea for the algorithm manually and compiled it into this little document. I liked it, so I decided to start developing it. (My original idea, as written into a chat channel with some friends, was: "start at a random wikipedia page, extract the first N sentences until one sentence contains a link to another article (and not a meta-page). print those sentences, then click the link, repeat." It was actually an early NaNoGenMo concept I threw away.)
- Version 1. Grabs the first sentence from an article and goes to a random Wikipedia link in that sentence, repeats until it hits an error (which happens frequently).
- Version 4. Only clicks on the first-ish link it finds. This tends to loop, repeating articles on Greek word roots and linguistics, since most articles start with the phonetic roots of a given word. It's actually very similar to the Wikipedia philosophy phenomenon. I also now strip out markup, which makes it feel little more aesthetically coherent.
- Version 8. This version doesn't just grab the first sentence it finds with links: grabs a random sentence from anywhere in the article that has links. The result is very close to what is used in the final version, and is a lot more meandering and doesn't just always talk about Greek linguistic roots.
- Version 11. However, that version was a little too meandering. I decided to limit it to picking at random from the first five sentences with links in the article. That way we're probably grabbing stuff from the overview section, so we continue to discuss generalities about topics rather than really detailed stuff that requires prior context. This lets us have more "atomic" responses that make sense as part of a chain of output.
- Version 18. This doesn't look like much different, but there's a lot more code in here to handle random errors and edge cases. At this point, when I hit an error it was more likely a problem with a Wikipedia article than a problem with my code! For example, it choked on articles with zero outgoing links (which is discouraged on Wikipedia). At this point it generates extremely long articles, and is able to handle redirects, disambiguation pages, pages that are stubs or collections of links, etc.
- Version 19. At this point, I showed the project to my wife Courtney Stanton and she suggested I let people set how much content and pick a starting topic. So I did that. She has good ideas.
- Version 21. I wanted to play with styling more, and I realized that these articles are sort of like what you'd find on a site like Medium. The most important thing you can do with generative content is give it some kind of context, no matter how subtle. Humans rely on context as a crutch all the time: by publishing on Medium, for example, even crappy things have a kind of gravitas to them. If a human can use a contextual crutch, so can an algorithm!
- Version 26. My first attempt at including images. This was a last-minute addition, that I thought of just a few hours before it went live. I decided to look at real articles on Medium, and most of them have in-line images with captions. This version has only images, no captions.
- Version 30. I finally figured out how to get captions (by scraping the whole image container div with captions and not just the image itself). Images are still thumbnail sized, but basically this is the final version before I went live.