Computational Intelligence for the Humanities

I’m interested in how techniques from machine learning can be used in traditional studies of literature and history. I’ve published some work on confirming established theories of true authorship of the Pauline epistles, the Documentary Hypothesis of the Hebrew bible, and related areas in stylistics. Right now I’m particularly interested in designing materials to bridge the gap between traditional and computational studies (e.g. course curricula, software), and am working on machine learning models to help scholars collect, augment, and explore their specialized data sets.

The current state of my research collaborations can be explored here (when it’s not under construction).

Previous work

My earlier research focused on unsupervised approaches to building various lexical resources: syntactic verb inventories similar to VerbNet, morphological analyses of languages with a range of typological properties, and extensions to distributional-semantic models like Latent Dirichlet Analysis:

Bayesian modeling

As a graduate student my research was on unsupervised learning of verb syntax and semantics, and a focus on the rapidly-growing domain of biomedical research articles. The primary software artifact from this work was graphmod, As a post-doc, I developed unsupervised models of morphology and applied the results to improving speech recognition systems under low-resource conditions, work which led to a best paper award at COLING 2016. At the moment I’m interested (alongside keeping up with the neural world!) in generalizable improvements to MCMC, such as dynamic block sampling, and situations where data or interpretability needs call for linguistically-informed model design. For example, here’s in-progress Haskell code for unsupervised, non-parametric morphological learning using type-based sampling, using ideas from this paper.

Social media analysis

Social media analysis is a high priority for all sorts of downstream tasks in industry, government, and further academic research. Many of these tasks can be roughly characterized as “attaching labels to things”, particularly documents and the users who write them. For example, determining the language of a document, the age of an author, or the disposition of a review. To facilitate broad, fair comparison of techniques for different tasks, I maintain SteamRoller.

Domain variation

Another axis of my graduate work was its motivation in domain variation within scientific literature, which I referred to as subdomain variation. It’s a particularly acute problem for specialized research fields because both author and audience are assumed to share (often highly-productive) vocabulary and style. Several broad surveys confirmed that subdomain variation was significant along many modalities, and that resources built for “scientific language” are still too general.

Language identification

An important first step in text processing is to determine what language a document is written in: this is called language identification (LID). This is fairly easy for longer documents and well-attested languages, but with micro-blogs being used for general communication by various underserved language communities, this is increasingly not the case. I’m considering the advantages of different modeling and annotation choices. Here’s a Haskell implementation of one of our better algorithms, using a technique called Prediction by Partial Matching.