How Copyright Law Could Make or Break the Future for Digital Humanities.
This case raises many legal, technical, and epistemological issues related to the future of higher education, research, and scholarship – especially those efforts that seek to take advantage of “big data” analytics and methodologies. Advances in computer technology and the availability of digital texts will allow scholars of the humanities a chance to do what biologists, physicists and economists have been doing for decades – analyze massive amounts of data. Large-scale quantitative projects like those being undertaken at the Stanford Literary Lab are unearthing previously unknowable information about individual works, and entire genres of literature.
Researchers working in Information Retrieval frequently use text mining and computer-aided classification to identify and retrieve relevant documents. Using similar techniques, researchers in the Digital Humanities are able to identify and retrieve relevant texts, often from unlikely places. Humanities researchers can thereby expand their traditional study of a few canonical works to a study of any one of the several million books in the larger archive of literary history—an archive that has hitherto remained hidden because of the limitations of humans’ reading capacity.
In this amicus brief scholars from disciplines including law, computer science, linguistics, history and literature ask the court to consider the impact on this vital area of research when ruling on the legality of mass digitization. Specifically, the brief addresses whether United States copyright law should stand as an obstacle to statistical and computational analysis of the millions of books owned by the nation’s great university libraries.
The brief argues that, just as copyright law has long recognized the distinction between protection for an author’s original expression (e.g., the narrative prose describing the plot) and the public’s right to access the facts and ideas contained within that expression (e.g., a list of characters or the places they visit), the law must also recognize the distinction between copying books for expressive purposes (e.g., reading) and nonexpressive purposes, such as extracting metadata and conducting macroanalyses. We amici urge the court to follow established precedent with respect to Internet search engines, software reverse engineering, and plagiarism detection software and to hold that the digitization of books for text-mining purposes is a form of incidental or intermediate copying to be regarded as fair use as long as the end product is also nonexpressive or otherwise non-infringing.