Saturday, March 15, 2008

A thousand monkeys on a thousand typewriters

The idea of artificially creating or recreating literature has been taken on by a couple of interesting authors with varying points of view. Jorge Louis Borges mentions the idea in an essay on the literary experience and explores some of the (volume) implications of the idea in his short story “The Library of Babel.” Enjoying a lazy Saturday afternoon in Maseru, Lesotho, I thought I would think about this idea a little on my own.

Take for example a very basic and limited form like the 5-7-5 haiku. Assume that 'thought' is the longest possible one-syllable word; it may not be, but we will use it as a placeholder. The longest string of characters forming one of the 5 syllable lines would thus be 43 character spaces in length, including the potential space for a punctuation mark after each 'thought' and counting the single spacing after each non-terminal word. Applying similar rules, the 7 syllable line would have 62 character spaces, giving us a total of 150 potential character spaces for the entire poem.

Given the standard keyboard, there are a limited number of options to fill those spaces. While someone can always insert non-Roman characters or symbols into poetry, using the standard entry available on an English keyboard, the breakdown of possible basic options when filling the character spaces is: Alphabet (lower case) – 26, Alphabet (upper case ) – 26, numbers – 10, space – 1, other keyboard options – 33, for a grand total of 96.

Given those two facts, 150 spaces and 96 options for each space, the total number of potential haiku fitting into the 5-7-5 scheme and the punctuation assumptions I have set forward is equal to 96^150. In other words, our set of potential 5-7-5 haiku has 96^150 members, within which will be captured all of the 5-7-5 haiku ever written or that can be written. Every master poet’s output, every mediocre effort by disinterested middle schoolers, haiku greater (and worse) than any that have ever been written. Not just in English either. Every potential haiku in Spanish, in French, in Chinese (pin-yin anyway), they would all be in that set.

How probable is this approach? While not a computer scientist, and should any computer scientists read this and care to comment I would be very interested, 96^150 is a very large number. 2.19*1^297 to be exact. Returning to Borges conceit (a library full of books containing every permutation of 410 page book containing 1.3 million letters), the library containing every one of our haiku permutations would have about 7.3*1^294 books (assuming 300 haiku per book). At 1 book a day, that would mean reading for 2.0*1^292 years. While scientific notation is nifty, in plain numbers that means you would be reading for 2,011,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000 years.

In the computer age the paper analogy is interesting by not necessarily topical to the discussion. Assuming you use computers to generate the set you could also layer in spelling and grammatical checks to weed out the vast majority of haiku that are completely or partially gibberish. Assuming that you could eliminate ~95% of the set as non-viable haiku (a low estimate), you would still have 1.09^1*296 members to sift through. Any way you slice it, a number of astronomical size. How about our monkeys on their anachronistic typewriters? Assuming a thousand monkeys each churning out a hundred distinct and legible haiku a day, it would take them 3.00*1^286 years to produce 1% of the set of viable haiku (which in turn was only 5% of the total possible).

Without a better knowledge of the relevant computer options I cannot say whether this is something a reasonably powerful supercomputer could handle, but the set, even in reduced form, is obviously far too large for any significant portion to be produced. And that is just for a simple haiku. The set for longer forms like the play or novel, thousands of times long than the haiku, would be accordingly larger. Could a thousand monkeys on a thousand typewriters reproduce the works of Shakespeare? Yes. We could even calculate the size of the potential set and the corresponding probability of them hitting one specific permutation like Macbeth within a given timeframe, but without some mechanism driving them towards more productive avenues, the time frame for getting our simian Shakespeare would probably far outstrip the expected lifespan of the sun.

Update: I’ve been discussing this with my brother (a sociologist with impressive large data set skills) and will revisit the topic in the future with more background research.

No comments: