Graph, Maps, and Trees/Project Idea

In the spirit of generating discussion I’d like to put an idea I had out there. I am sure it needs a lot of refinement, but I am curious whether other people see this as a feasible or informative line of work.

One of my observations on Moretti was that he seemed to start with a detailed understand of the material (via close reading), and then generate his higher level views based on patterns that he recognized. This may be a kind of distant reading, but still requires a large amount of knowledge and insight into the very particulars of the subject matter (who here can name the 30-odd categories of Victorian literature?). How could we go the other direction, and generate our insights without presupposing prior detailed knowledge?

My idea was using word frequencies to date books. We could reasonably expect some words or phrases to occur more in certain time periods than in others, whether idioms, references to historical events, or new words. Instead of having an expert in a particular period generate a list of key words, we could take a sampling of writing (inside and outside the canon) from a particular period, run it through a computer program, and build a statistical model of the frequencies in which words occur. Then, given a writing sample of an unknown date, we could determine how similar the word frequencies are to our known samples and say how likely it is the new sample is written in the same time period. If we took known samples from multiple time periods, we could generate a frequency models for each period. Then given a writing sample from an unknown time period, we could determine which model it most resembles, and conclude that the writing is likely from the same time period.

If this approach actually works, would it be possible to look at other quantitative aspects of writing, and to apply them to groupings other than date written? There are a number of statistical models that look at the patterns of words themselves (which words are more or less likely to follow others). Could we now generate a model for something as specific as a writer, and determine whether a given sample is the work of a specific writer?

While this may seem far fetched, these techniques have been extremely successful in creating spam filters for email. You take an unknown sample and try to determine if it is similar to known “spam” or to “real” e-mail. This is why spam first starting misspelling things like “viaagra”, then putting those lists of random words at the bottom of email, and now finally putting complete sentences taken from actually books. I will leave out the math for now, but two of the well studied methods are Bayesian filtering and Markov modeling.

Hopefully I haven’t mortally offended anyone by proposing we treat the canon of western literature like spam . . .

This entry was posted in Uncategorized. Bookmark the permalink.

4 Responses to Graph, Maps, and Trees/Project Idea

  1. I wonder how this might speak to assumptions about certain words and phrases, or how the written word differs from trends in spoken language–i.e. would you find many instances of the word “dude” in literature of the late 20th century (as it was commonly used/spoken), or not?

    This also makes me think about the history of the Oxford English Dictionary and etyomology (tracing when a word has been used and how)–is that part of this project, or not? Or can you perform this project without the history of words?

  2. Dan says:

    In relation to the music you mentioned, if one can identify which patterns represent certain kinds of music or certain composers, one could theoretically then generate music in a particular style. One of the most incredible things I have ever heard is someone who did just that (David Cope), link to samples of his work here http://artsites.ucsc.edu/faculty/cope/mp3page.htm.

    Computer already generate fairly boring writing on things like sports, weather, and financial markets. Is the next step computers that can generate writing from a specific style or author?

  3. On a funny side note, Viagra can’t be posted here according to some error message I received. Ha! There we go recognizing unsavory patterns.

  4. No offense taken here, but then again I’m only speaking for myself. Daniel I think you make a very good point here, not only in your example of searching word patterns based on specific time periods and tracking them to specific writers but in pointing out a correlation of an applicable use for such patterns as done in spam filters. The guys doing the spamming are pretty smart by recognizing the patterns that would allow their email to reach us using Viaagara versus Viagra.

    The idea of patterns based on words and genres is being applied in similar ways in numerous other places but not for such foul reasons. I believe someone mentioned how Amazon uses tracking methods to connect consumers to things they may like based on what they have already bought, like in books or shoes, useful patterns in helping increase profit. These patterns also remind me of the popular music website/application Pandora, which also goes by the name the music genome project. The name of their project seems to have root in a geneological idea, as in songs are connected by similar characteristics that go back to other songs, music or musicians from different time periods past and present. I’m sure some sort of statistical research has been done on the likeliness of someone who listens to say, Miles Davis and the possibility of that person also listening to John Coltrane, right?

    How about the popular application named Shazaam, where a hand held device can recognize a song based on the sounds now imagine if we could do research on the sounds in a sone to find out how many times a beat or song was sampled, how many times a certain arrangement was used in a broad array of genres or what musical time period is most often rehashed into contemporary music?

    Whether literature or music, I feel as though there must be some close reading (or listening) as a start and then it can branch off from there into the distance and thus new research is developed. One would have to initiate their search on the popular or traditional canon, such as the well know books or music before they could find the connection to the less popular writer who was published locally or the musician who cut an album at some remote indie label but either way this would change the research and patterns.

    I hope I’m not too off base with my examples but I think your comment made sense and I wanted to respond.

Comments are closed.