In the spirit of generating discussion I’d like to put an idea I had out there. I am sure it needs a lot of refinement, but I am curious whether other people see this as a feasible or informative line of work.
One of my observations on Moretti was that he seemed to start with a detailed understand of the material (via close reading), and then generate his higher level views based on patterns that he recognized. This may be a kind of distant reading, but still requires a large amount of knowledge and insight into the very particulars of the subject matter (who here can name the 30-odd categories of Victorian literature?). How could we go the other direction, and generate our insights without presupposing prior detailed knowledge?
My idea was using word frequencies to date books. We could reasonably expect some words or phrases to occur more in certain time periods than in others, whether idioms, references to historical events, or new words. Instead of having an expert in a particular period generate a list of key words, we could take a sampling of writing (inside and outside the canon) from a particular period, run it through a computer program, and build a statistical model of the frequencies in which words occur. Then, given a writing sample of an unknown date, we could determine how similar the word frequencies are to our known samples and say how likely it is the new sample is written in the same time period. If we took known samples from multiple time periods, we could generate a frequency models for each period. Then given a writing sample from an unknown time period, we could determine which model it most resembles, and conclude that the writing is likely from the same time period.
If this approach actually works, would it be possible to look at other quantitative aspects of writing, and to apply them to groupings other than date written? There are a number of statistical models that look at the patterns of words themselves (which words are more or less likely to follow others). Could we now generate a model for something as specific as a writer, and determine whether a given sample is the work of a specific writer?
While this may seem far fetched, these techniques have been extremely successful in creating spam filters for email. You take an unknown sample and try to determine if it is similar to known “spam” or to “real” e-mail. This is why spam first starting misspelling things like “viaagra”, then putting those lists of random words at the bottom of email, and now finally putting complete sentences taken from actually books. I will leave out the math for now, but two of the well studied methods are Bayesian filtering and Markov modeling.
Hopefully I haven’t mortally offended anyone by proposing we treat the canon of western literature like spam . . .