Saturday, March 15, 2008

A thousand monkeys on a thousand typewriters

The idea of artificially creating or recreating literature has been taken on by a couple of interesting authors with varying points of view. Jorge Louis Borges mentions the idea in an essay on the literary experience and explores some of the (volume) implications of the idea in his short story “The Library of Babel.” Enjoying a lazy Saturday afternoon in Maseru, Lesotho, I thought I would think about this idea a little on my own.

Take for example a very basic and limited form like the 5-7-5 haiku. Assume that 'thought' is the longest possible one-syllable word; it may not be, but we will use it as a placeholder. The longest string of characters forming one of the 5 syllable lines would thus be 43 character spaces in length, including the potential space for a punctuation mark after each 'thought' and counting the single spacing after each non-terminal word. Applying similar rules, the 7 syllable line would have 62 character spaces, giving us a total of 150 potential character spaces for the entire poem.

Given the standard keyboard, there are a limited number of options to fill those spaces. While someone can always insert non-Roman characters or symbols into poetry, using the standard entry available on an English keyboard, the breakdown of possible basic options when filling the character spaces is: Alphabet (lower case) – 26, Alphabet (upper case ) – 26, numbers – 10, space – 1, other keyboard options – 33, for a grand total of 96.

Given those two facts, 150 spaces and 96 options for each space, the total number of potential haiku fitting into the 5-7-5 scheme and the punctuation assumptions I have set forward is equal to 96^150. In other words, our set of potential 5-7-5 haiku has 96^150 members, within which will be captured all of the 5-7-5 haiku ever written or that can be written. Every master poet’s output, every mediocre effort by disinterested middle schoolers, haiku greater (and worse) than any that have ever been written. Not just in English either. Every potential haiku in Spanish, in French, in Chinese (pin-yin anyway), they would all be in that set.

How probable is this approach? While not a computer scientist, and should any computer scientists read this and care to comment I would be very interested, 96^150 is a very large number. 2.19*1^297 to be exact. Returning to Borges conceit (a library full of books containing every permutation of 410 page book containing 1.3 million letters), the library containing every one of our haiku permutations would have about 7.3*1^294 books (assuming 300 haiku per book). At 1 book a day, that would mean reading for 2.0*1^292 years. While scientific notation is nifty, in plain numbers that means you would be reading for 2,011,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000 years.

In the computer age the paper analogy is interesting by not necessarily topical to the discussion. Assuming you use computers to generate the set you could also layer in spelling and grammatical checks to weed out the vast majority of haiku that are completely or partially gibberish. Assuming that you could eliminate ~95% of the set as non-viable haiku (a low estimate), you would still have 1.09^1*296 members to sift through. Any way you slice it, a number of astronomical size. How about our monkeys on their anachronistic typewriters? Assuming a thousand monkeys each churning out a hundred distinct and legible haiku a day, it would take them 3.00*1^286 years to produce 1% of the set of viable haiku (which in turn was only 5% of the total possible).

Without a better knowledge of the relevant computer options I cannot say whether this is something a reasonably powerful supercomputer could handle, but the set, even in reduced form, is obviously far too large for any significant portion to be produced. And that is just for a simple haiku. The set for longer forms like the play or novel, thousands of times long than the haiku, would be accordingly larger. Could a thousand monkeys on a thousand typewriters reproduce the works of Shakespeare? Yes. We could even calculate the size of the potential set and the corresponding probability of them hitting one specific permutation like Macbeth within a given timeframe, but without some mechanism driving them towards more productive avenues, the time frame for getting our simian Shakespeare would probably far outstrip the expected lifespan of the sun.

Update: I’ve been discussing this with my brother (a sociologist with impressive large data set skills) and will revisit the topic in the future with more background research.

My head is too small

While individual fields of study go in and out of vogue, the net effort devoted to expanding the sphere of human knowledge has never been greater. Simultaneously, technological innovations have kept pace by making information more accessible than ever before. In response to changes in the amount and accessibility of information, people have developed coping mechanisms with profound consequences for their professional and personal lives.


Professionally, we are moving towards increasing specialization. Achieving expert status in anything other than closely circumscribed areas has been made impossible by the sheer volume of available information. Case in point, the venerable general practice physician who has now morphed into a diagnostic expert, without peer at quickly discerning patients’ general problems and then sending them to the appropriate specialist.


Personally, we are becoming more comfortable relying on external information. Constantly checking and relying upon Wikipedia, other people, or any other form of external data storage is how we have adapted to a world where the volume of data necessary to function far outstrips the limits of our memory. While my grandfather found it feasible to keep most of the knowledge he needed on a daily basis inside his head, rising generations are characterized by the ability to quickly find and internalize necessary information from external sources.


While these changes are not necessarily good or bad, they represent a fundamental shift that should not be accepted without consideration of its potential long-term societal and personal consequences.