If you are reading this post, I will, until proven otherwise, assume that you are human, and that you are belonging to the subset of humans who are very interested in natural language processing. As a human, when you are comprehending these sentences, we can reasonably say that you are engaging in a complex act of communication that is, in some way, shaped by some universal set of processes that underlie human cognition. As a formally trained linguist that branched out into AI through my master’s program, I’m fascinated with a subset of natural language processing that uses computational models and vast amounts of data to look for universal human characteristics when it comes to how we process language. This field, formally known as computational psycholinguistics, deals with questions such as the following:
- Is there a connection between processing effort when reading a text, and how unexpected a word/sentence being read is?
- Is there a connection between the length of production of particular sounds, or phones, and how frequently they appear in the phonetic inventory of a language?
- How universal are such tendencies across languages?
This line of research is, sadly, under-represented in AI research communication, but I have found two papers introduced in the Linguistic Theory section of the 2021 Conference in Empirical Methods in Natural Language processing (EMNLP) which I will summarize here, each dealing with exactly those questions with a slightly different emphasis: The first one, “Revisiting the Uniform Information Density Hypothesis”, deals with human text processing, and the other, “A surprisal–duration trade-off across and within the world’s languages”, concentrates on human speech production. Both of those papers center around the very interesting but slightly esoteric-sounding information theoretic concept known as the Uniform Information Density Hypothesis (UID). The UID hypothesizes that language users, on some level of language structure, prefer information to be distributed evenly across a linguistic signal. If that sounds vague or dense, just stick with me and it will all make sense soon enough. If you’re already familiar with the UID, feel free to skip the next section of this post, in which I’ll explain the intuitions behind UID and how it relates to language processing. After that, I will detail the experiments in each of the research papers and their results and make some brief concluding remarks.
A Gentle Overview of the Uniform Information Density Hypothesis (UID)
As promised, let’s first dig into the origins and assumptions postulated by the Uniform Information Density Hypothesis. The origins of UID are to be found within information theory, a field that had its beginnings in mathematics and communications and has subsequently been applied in multiple disciplines spanning from engineering to cognitive science. The underlying assumption in information theory, first proposed by Claude Shanon (1948), is that communication can be understood as information transmitted at a certain rate limited by some upper-bound of a metaphorical noisy communication channel. With respect to language, what this means is that there exists some maximum capacity through which information content can be communicated reliably while minimizing the error across a linguistic signal. Thus, according to most interpretations, what UID claims is that since there is an upper bound on information rate that humans can reliably handle, they must modulate their flow of information such that it maximizes such a channel capacity, i.e. information density doesn’t ever stray too far from some global mean value of information rate, avoiding an overflow or under-use of the channel at any given point.
So, how is this abstract notion of information density actually measured? Through a concept known as surprisal, which is simply just defined as the negative log probability of some linguistic unit conditioned on its prior context. What this means is that, given some kind of linguistic signal, which for now can be anything from a sentence to the sequence of speech sounds; the higher the probability of some individual linguistic unit occurring in that context, e.g. a word or a sound, the lower the surprisal, i.e. the informational content, becomes. This idea that high-surprisal items carries more informational content reflects the linguistic intuition that, for instance, unpredictable words carry more information than predictable ones in a given sentence, or, as we’ll see the second paper, people spend longer time on producing less frequent speech sounds.
Okay, I did say a gentle overview, so how about an analogy: Imagine the flow of information coming at you in this post is a river. Most of the words and concepts are easy to comprehend, which are nice smooth spots on our river of information. Inevitably, you will encounter words or concepts that are more difficult to comprehend, which are rocks and rapids in the river; these are the surprisals. The UID states that, to make things as easy as possible to float down this river of information, the surprisals shouldn’t all be loaded in one part of this blog post like a giant impassable waterfall, but should be spread out so we can make it through the rapids (even if there is still some difficulty!).
Now, another very relevant question that may arise is, how do you estimate surprisal, given that there is no way to measure the ground-truth probability of a unit in context? This is where natural language processing finally comes in! Surprisal can simply be estimated using the output of a computational language model. Traditionally, N-gram models have been used, although recently, as we will see in the first paper, large pre-trained language models based on transformer architectures have been shown to have supperior psychometric predictive power. In speech data, a sequence-to-sequence architecture such as LSTM is more useful, as further seen in the section on the second paper.
The take-away here is that it’s possible, through the use of computational models and corpora that contains various kinds of psychometric data, i.e. reading times or running speech, to more elaborately verify the intuitions of UID in both language production and language processing.
In the remainder of this post, I’ll introduce each of the two aforementioned papers, which deal with various effects of UID on language comprehention and production, and explain their main empirical contributions and findings.
Revisiting the Uniform Information Density Hypothesis
Let’s first explore a paper that deals with investigating the implications of UID on language comprehention and linguistic acceptability, a somewhat less explored area than implications on language production. Specifically, the authors explore the relationship between sentence-level processing effort and informational content; i.e. the assummption that the more unexpected a word is in a sentence, the heavier the cognitive load for the reader. They also dig into perceived linguistic acceptability, which in linguistics is simply the extend to which a sentence is permissible by the users of a language as defined by rules of grammaticality. They propose that a sentence that is deemed more acceptable is also easier to process, and vice versa.
To make this more concrete, they provide two example sentences that illustrate how uniform information density might manifest:
- “How big is the family that you cook for?”
- “How big is the family you cook for?”
Intuitively, they argue, most English-speakers would prefer the 1st variant of the sentence where the relative clause marker, “that”, is included. The information theoretical explanation for this is simply that if “that” is removed, “you” then carries both the marking of a relative clause, as well as the signal of 2nd person singular/plural. Hence, including the relative clause marker spreads information more evenly across a sentence and avoids rapid switching between dense and less dense information.
For data, the authors use multi-modal reading times (self-paced reading times and eye movements) from 4 different corpora, as well as human judgements of linguistic acceptability from another two corpora for English and Dutch. The motivation behind their experiments is to fit different mathematical functions of surprisal on reading time and acceptability datasets to see which one best predicts the psychometric data. In very basic terms, verifying UID would imply that this mathematical function is super-linear: For the reading time data, they ask whether the processing effort of a linguistic signal is a function of the sum of individual surprisals, which would imply that the processing effort increases linearly with the informational content. However, this seems very counter-intuitive, because it would imply that the distribution of information across a sentence does not matter for the processing at all, which multiple psycho-linguistic experiments suggest isn’t the case. If the function is instead super-linear, the high-surprisal utterances would require a disproportionately high processing effort, which would motivate the smoothing information across a linguistic signal and, as such, confirm the UID hypothesis. The same logic then holds for the case of linguistic acceptability.
They explore these different hypotheses by fitting different regression models on the data. To estimate surprisal, they use three pre-trained neural language models; namely BERT, GPT-2, and TransformerXL. The predictive power is then given by the log probabilities under each of these regression models, namely linear and logistic regression, with or without the sum-of-surprisals term. The results are, for the most part, consistent with UID. However, the super-linear effects are far more consistently observed in linguistic acceptability data, suggesting that uniform information density is much more strongly correlated with linguistic acceptability than with reading times. Thus, this might suggest that when judging the grammaticality of a sentence, language users have a stronger preference for uniform distributions of information, although these results are merely correlational and don’t make strong causal claims.
This brings up another interesting question, namely how to determine the scope of uniformity across a given signal: Assuming that UID implies some kind of regression towards a mean information rate, does this apply more globally, within a language in general, or more locally, within a particular sentence? If a global interpretation of UID is more predictive of the given data, then uniformity is a kind of smoothing effect to a global mean rate that information must not heavily deviate from. However, if a more local interpretation holds, uniformity should be perceived less as a smoothing effect, and more as a pressure to avoid shifting rapidly between content of different information densities, as exemplified by the two sentences earlier in this section. The authors investigate this, among others, by exploring the effect of varying the context window size on the change of variability of the computed information rate, and conclude strongly that the global (smoothing) interpretation wins over the local one. This suggests that the initial explanation of UID in this post as maximizing the use of a metaphorical noisy channel, is, indeed, the most fitting one. As such, language users have a preference for smoothing over the entire language, rather than a on a sentence or phrase-level.
A Surprisal-Duration Trade-Off Across and Within the World’s Languages
So far we’ve delved into text processing and explored some interesting findings about reading time and linguistic acceptability data. We have also established that linguistic users have a preference for smoothing information across the entire language, as opposed to on a phrase, sentence, or document-level, confirming the most popular interpretation of UID as maximizing the use of a metaphorical noisy communication channel. The second paper summarized in this post investigates UID from a slightly different angle, speech production, and addresses an important question about the linguistic universality of UID interpretations by extending the findings to a corpus of 600 different languages. Basically, this paper explores the relationship between informational content and speech production rate, and defines information content, i.e. surprisal, as the negative log probability of one individual sound, a so-called phone, given a sequence of speech sounds.
The hypothesis is that, assuming that the channel capacity of human information processing is roughly the same across languages, it can be expected that languages place various cognitive constraints on how humans smooth the signal over the phonetic inventory of a language in speech production. Namely, the authors demonstrate strong evidence for a trade-off between the surprisal of a phone and the time it takes to produce it; in other words, speakers slow down significantly when producing highly surprising speech sounds, and speed up on more predictable ones. They demonstrate that this condition holds within 319 languages, in which more surprising phones are pronounced with longer duration. Furthermore, they demonstrate that word-initial phones, i.e. sounds occurring at the beginning of a word, are on average more surprising.
For data, the authors use a phone-aligned corpus containing readings of the Bible in 600 languages spread over 70 language families. The data contains 20 hours of spoken word on average per language, making their findings the “most representative evidence for UID to date”. The surprisal estimates were computed by a phone-level LSTM model, and the phone durations were given by automatically generated alignments in the running text.
The most interesting finding of this paper is that this condition holds not only within one given language, but across languages. Understanding what this means requires a bit of understanding about language universals. Within the world’s languages, there are some tendencies with respect to the features shared by many, although not all, languages; for instance, with respect to phonetics, most languages contain syllables constructed of vowels and consonants where the vowel is the nucleus of the syllable, and most languages contain nasal consonants, [m], [n], or [ng]. Thus, their finding implies that languages containing more universally surprising phones compensate by lengthening the duration of their utterances. Furthermore, the authors didn’t find a single instance in which the oppoosite effect was true; i.e. where shorter phone durations were linked to higher surprisal, and they demonstrate through controlling for several potential confounding variables that this analysis holds remarkably well across languages.
Each of the summarized papers provide evidence of varying strength for the existance of UID and, as such, make very important contributions to computational psycholinguistics. An area of research that might be interesting to expand upon is how well the findings for language comprehention and linguistic acceptability holds across non-Indo-European languages, in order to avoid Anglocentric biases. However, this might be challenging because these languages are generally under-represented when it comes to resources and language models.
I hope that through this post, I have given you a better understanding of computational psycholinguistics and how natural language processing is used to further research into the nature of linguistic data. I hope that I have concvinced you that this is not only interesting to a particular subset of academics dealing with linguistics and cognitive science, which one might assume given that this topic does not have immediate industrial applications, but also has interest more generally in the NLP space. Finally, I hope that the journey along this admittedly somewhat dense river of information has been mostly gentle, smooth sailing, and that you at least enjoyed the rock maneuvering along the way.