Law students quickly learn that the interpretation of legal texts is an important component of legal practice. Legal disputes frequently turn on the meaning of a contract, will, rule, regulation, statute, or constitutional provision. How do we determine the meaning of legal texts? One possibility is that judges could consult their linguistic intuitions. Another possibility is the use of dictionaries. Recently, however, lawyers, judges, and legal scholars have discovered a data-driven approach to ascertaining the semantic meaning of disputed language. This technique, called "corpus linguistics," has already been used by courts and plays an increasingly prominent role in legal scholarship. This entry in the Legal Theory Lexicon provides a basic introduction to corpus linguistics. As always, the Lexicon is aimed at law students with an interest in legal theory.
Situating Corpus Linguistics
Why has corpus linguistics become important in contemporary legal theory and practice? The answer to that question is complicated. One important impetus is rooted in the revival of formalism in general legal theory: that revival is reflecting in the developments in the law and theory of both statutory and constitutional interpretation. Statutory interpretation in the 1960s and 1970s was dominated by approaches that emphasized legislative intent and statutory purpose, but in the last three decades, textualism (or "plain meaning textualism") has been on the ascendance. Similarly, the living constitutionalism once held hegemonic sway over the realm of constitutional interpretation, but in recent years, originalism has become increasingly important in both the academy and the courts.
The turn to textualism and originalism is based in part on a recognition of the importance of two theoretical distinctions. The first distinction is between "communicative content" and "legal content." Legal texts communicate content to readers: the communicative content of a text is roughly what we call the "linguistic meaning" of the text. But operative legal texts also create "legal content." For example, constitutional provisions give rise to doctrines of constitutional law. These legal rules may be direct translations of the linguistic meaning of the text, but sometimes the legal content can be significantly different from the communicative content: the First Amendment to the United States Constitution begins "Congress shall pass no law," but the legal doctrines that implement the freedoms of speech and press apply to judicial and executive action.
Closely related to the distinction between communicative content and legal content is the interpretation-construction distinction. When this distinction is made, the meaning of "interpretation" is the discovery of the communicative content, whereas "construction" means the determination of legal effect. One important component of communicative content is "conventional semantic meaning"--the meaning that assigned to words and phrases by patterns of usage. Dictionary definitions, if they are accurate, report conventional semantic meanings.
During the period when living constitutionalism and purposivism were the dominant approaches to the interpretation and construction of statutes, the precise linguistic meaning of statutory and constitutional provisions was relatively unimportant. Because courts did not consider themselves bound by the meaning of the words and phrases, fine distinctions about meaning were much less important than the identification of the purposes and values that would determine the outcome of constitutional and statutory disputes. But with the turn to formalist approaches like originalism and textualism, questions of meaning became significantly more important.
One approach to conventional semantic meanings relies on linguistic intuitions and dictionary definitions. But this method has important limitations. Linguistic intuitions are not infallible, and they may be affected by motivated reasoning. Dictionary definitions are based on limited data collection and subjective judgments by the lexicographers who compile the dictionaries. This raises the question whether there are better, more accurate, and more objective approaches.
The gradual ascent in the importance of the linguistic meaning of legal texts occurred at roughly the same time as another important development in the legal academy--the rise of interdisciplinary approaches in general and of empirical legal studies in particular. This focus on empirical and interdisciplinary methods lead legal scholars (especially those with training in linguistics and the philosophy of language) to corpus linguistics--a data driven approach to linguistic meaning.
In sum, the turn to corpus linguistics in law is (at least in part) a result of the new emphasis on the meaning of legal texts (formalism) and the turn to interdisciplinary methods (empirical legal studies and linguistics).
How Does Corpus Linguistics Work?
Corpus linguistics begins with data sets, singular "corpus" or plural "corpora." These data can be very large--with millions or even billions of words. For example, the Corpus of Contemporary American English (COCA) consists of approximately 520 million words. News on the Web (NOW) consists of more than 5.21 billion words.
Corpus lexicography uses these datasets to investigate the meaning of words and phrases. Whereas traditional dictionary lexicography relied on researchers compiling instances of usage by reading various sources, the corpus approach allows random sampling from large databases with blind coding by multiple coders.
A complete description of the methods of corpus lexicography is beyond the scope of this brief Lexicon entry, but there are two search techniques that can be described briefly. The first of these is the Key-word-in-context (or KWIC) search. This method is simple: a corpus is searched for the occurrence of a string (a word or phrase) and reports back the context in which the string occurs. The individual instances can then be coded for meaning. The result will be a set of meanings and data about the frequency of the meanings with the sample. The second method involves a search for the collocates of a word or phrase: for example, the word "bank" might have collocates like "river," "shady," "deposit," and "ATM." Collocates may help to disambiguate a word like "bank" that has multiple meanings.
Application of Corpus Linguistics to Legal Interpretation
How can the techniques of corpus lexicography be applied to the interpretation of legal texts? The primary role of the corpus approach is the identification of conventional semantic meanings for words and phrases. This use of corpus linguistics was pioneered by Associate Chief Justice Thomas Lee of the Utah Supreme Court. In State v. Rasabout, the defendant was convicted of violating a Utah statute that made it a crime to “discharge any kind of dangerous weapon or firearm . . . from an automobile . . . ; from, upon, or across any highway; . . . or . . . within 600 feet of . . . a house.” The word "discharge" has two meanings relevant to firearms: one meaning is roughly "to shoot" and another is "to unload." The former meaning would result in a violation for each shot fired, but the second would result in only one violation for emptying all of the bullets contained in the firearm. In a concurring opinion, Justice Lee used a COCA search to demonstrate that the sense of discharge that applies to a single shot is much more common than the alternative sense. Justice Lee reasoned that this frequency data supported an inference that the ordinary or plain meaning of the statute supported a conviction for multiple violations of the statute.
The use of corpus lexicography may be even more important in the case of constitutional or statutory provisions that were drafted long ago, for example, the provisions of the United States Constitution drafted at the Philadelphia Convention were written using the linguistic conventions of the late eighteenth century--well more than two centuries ago. Because of linguistic drift, the meaning of some of the words and phrases may have changed over time. For example, the phrase "domestic violence" now refers to violence with a family such as spouse abuse, but in the late eighteenth century it referred to activities like riots and insurrections within the boundaries of a state. By using date restricted searches from corpora that include usage in the late eighteenth century, corpus linguistics can be used to identify the range of semantic meanings during the time the unamended constitution was drafted.
Limitations on Corpus Linguistics
Corpus lexicography can identify the set of conventional semantic meanings that were available to the drafters of a contract, will, rule, regulation, statute or constitutional provision, but there are important limitations, including the following:
- Technical meanings: Many legal texts employ "terms of art" or technical language, including, of course, the specialized usages of lawyers. Coding a random sample of usages from a general-purpose corpus is not a good technique for sorting out technical usages, but using a corpus that is comprised of legal texts from the relevant community of lawyers would enable identification of the relevant range of technical meanings.
- Limits on the Probative Value of Frequency Data: Frequency data may be useful in identifying the "ordinary" or "plain" meaning of a legal text--especially if one sense of an ambiguous word or phrase is overwhelming predominant. But where there are multiple sense of a word or phrase, frequency data, although relevant, should be supplemented by context, which often will reveal which sense was communicated to the intended readership.
- The Special Problem of Modulation: Corpus approaches may not be suited to the identification of what are called "modulations"--the use of a word in a new sense. For example, the Recess Appointments Clause may have used the word "recess" in a new modulated sense, in which the recess of the Senate is defined in contrast to the "session" of the Senate. This modulated sense of the word recess would apply to "intersession recesses" but would not apply to other breaks (including short lunch breaks) that are within the literal meaning of the word "recess."
- Semantics versus Pragmatics (especially contextual enrichment): Corpus lexicography has an important role to play in determining the semantic meaning of legal texts, but the bare semantic meaning of a text is not necessarily equivalent to the text's full communicative content. One of the reasons that communicative content is richer than semantic content (literal meaning) is that authors can convey additional meaning through what are called "contextual enrichments." For example, the great philosopher of language identified the phenomenon of "implicature," whereby an author can communicate content without stating it. Grice's famous example is a letter of recommendation: the letter states that the candidate was punctual and attended class regularly. The semantic content is mildly positive, but in the context of a recommendation, this is "damning with faint praise" and communicates the message that the candidate is not qualified for the position.
Because of these limitations, corpus linguistics does not provide a complete method of statutory interpretation. For example, in the case of the United States Constitution, the method of corpus linguistics could be combined with study of the constitutional record and immersion of the linguistic world of the period in which a given constitutional provision was written.
The introduction of a new methodology to legal theory is a rare event, but corpus linguistics is one of the black swans. It is still early days, but the use of corpus methods has already begun in earnest--both in the courts and the academy. The Bibliography provides many of the key sources in a literature that still can easily be read in just a few days.
Related Lexicon Entries
- Legal Theory Lexicon 019: Originalism
- Legal Theory Lexicon 030: Textualism
- Legal Theory Lexicon 043: Formalism and Instrumentalism
- Legal Theory Lexicon 051: Vagueness and Ambiguity
- Legal Theory Lexicon 071: The New Originalism
- Legal Theory Lexicon 074: Restraint and Constraint in Constitutional Theory
- Legal Theory Lexicon 078: Theories of Statutory Interpretation and Construction
- Legal Theory Lexicon 079: Communicative Content and Legal Content
- Wayne Davis, Implicature, Stan. Encyc. Phil. (last revised Sept. 22, 2010), https://plato.stanford.edu/entries/implicature/
- Clarissa Hessick, Corpus Linguistics and the Criminal Law, 2018 B.Y.U. L. Rev. (2018), https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3031987
- Thomas R. Lee & Stephen C. Mouritsen, Judging Ordinary Meaning, 127 Yale L. J. (forthcoming 1018), https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2937468
- D. Carolina Núñez, War of the Words: Aliens, Immigrants, Citizens, and the Language of Exclusion, 2013 BYU L. Rev. 1517 (2013)
- Vincent B. Y. Ooi, Computer Corpus Lexicography (1998)
- Daniel Ortner, The Merciful Corpus: The Rule of Lenity, Ambiguity and Corpus Linguistics, 25 B.U. Pub. Int. L.J. 101 (2016)
- James C Phillips, Daniel M. Ortner & Thomas R. Lee, Corpus Linguistics & Original Public Meaning: A New Tool to Make Originalism More Empirical, 126 Yale L.J. F. 21 (2016)
- Lawrence M. Solan, Can Corpus Linguistics Help Make Originalism Scientific?, 126 Yale L.J. F. 57 (2016)
- Lawrence B. Solum, The Interpretation-Construction Distinction, 27 Const. Comment. 95 (2010)
- Lawrence B. Solum, Triangulating Public Meaning: Corpus Linguistics, Immersion, and the Constitutional Record, 2018 B.Y.U. L. Rev. (forthcoming 2018), https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3019494
- Lee J. Strang, How Big Data Can Increase Originalism’s Methodological Rigor: Using Corpus Linguistics to Reveal Original Language Conventions, 50 U.C. Davis L. Rev. 1181 (2017)
- State v. Rasabout, 356 P.3d 1258 (Utah 2015)
- People v. Harris, 885 N.W.2d 832, 838–39 (Mich. 2016).
(This entry was originally posted on October 22, 2017.)