Chapter Title

Big Data, Machine Learning, and the Credibility Revolution in Empirical Legal Studies

Document Type

Book Chapter


The so-called credibility revolution changed empirical research (see Angrist and Pischke 2010). Before the revolution, researchers frequently relied on attempts to statistically model the world to make causal inferences from observational data. They would control for confounders, make functional form assumptions about the relationships between variables, and read regression coefficients on variables of interest as causal estimates. In essence, they would rely heavily on ex post statistical analysis to make causal inferences. The revolution centered around the idea that the only way to truly account for possible sources of bias is to remove the influence of all confounders ex ante through better research design. Thus, since the revolution, researchers have attempted to design studies around sources of random or as-if random variation, either with experiments or what have become known as “quasi-experimental” designs. This credibility revolution has increasingly brought quantitative researchers into agreement that, in the words of Donald Rubin, “design trumps analysis” (Rubin 2008).

However, the research landscape has changed dramatically in recent years. We are now in an era of “big data.” At the same time as the internet vastly expanded the number of available data sources, sophisticated computational resources became widely accessible. This has opened up a whole new frontier for social scientists and empirical legal scholars: textual data. Indeed, most of the information we have about law, politics, and society is contained in texts of one kind or another, almost all of which are now digitized and available online. For example, in the 1990s, federal courts began to adopt online case records management—known as CM/ECF—where attorneys, clerks, and judges file and access documents related to each case.1 Using the federal government’s PACER database (available at, researchers (both academic and professional) can now easily access the dockets and filings for each case that is filed in a federal court. LexisNexis, Westlaw, and other companies have further improved access by providing raw text versions of a wide range of legal documents, along with expert-coded metadata to help researchers more easily find what they are looking for. And yet, despite the potential of these newly available resources, the sheer volume presents challenges for researchers. A core problem is how to draw substantively important inferences from a mountain of often unstructured digitized text. To deal with this challenge, researchers are turning their attention back toward the tools of statistical analysis. As many of the essays in this volume demonstrate, there is now a surging interest among researchers in one particularly powerful tool of statistical analysis: machine learning.

This chapter addresses the place of machine learning in a post–“credibility revolution” landscape. We begin with an overview of machine learning and then make four main points. First, design still trumps analysis. The lessons of the credibility revolution should not be forgotten in the excitement around machine learning; machine learning does nothing to address the problem of omitted variable bias. Nonetheless, machine learning can improve a researcher’s data analysis. Indeed, with growing concerns about the reliability of even design-based research, perhaps we should be aiming for triangulation rather than design purism. Further, for some questions, we do not have the luxury of waiting for a strong design, and we need a best approximation of answer in the meantime. Second, even design-committed researchers should not ignore machine learning: it can be used in service of design-based studies to make causal estimates less variable, less biased, and more heterogeneous. Third, there are important policy-relevant prediction problems for which machine learning is particularly valuable (e.g., predicting recidivism in the criminal justice system). Yet even with research questions centered around prediction, a focus on design is still essential. As with causal inference, researchers cannot simply rely on statistical models but must also carefully consider threats to the validity of predictions. We briefly review some of these threats: GIGO (“garbage in, garbage out”), selective labels, and Campbell’s law. Fourth, the predictive power of machine learning can be leveraged for descriptive research. Where possible, we illustrate these points using examples drawn from real-world research.

Publication Date


Book Title

Law as Data