AI thinks you should go to jail, even if you didn't do the crime

Evidence from using ChatGPT as a prosecutor

By Rory Pulvino (Justice Innovation Lab), Dan Sutton (Stanford Law School), and JJ Naddeo (Justice Innovation Lab & University of Michigan Law School)

Data description

Police reports

To create an audit dataset of police reports, we submitted public record requests to multiple police agencies across the United States for samples of police reports from 2022 or 2023 for cases involving the following, generally non-violent crimes: shoplifting, petty larceny, drug possession, and drug sales. Three departments responded and provided samples based upon these criteria. From these police reports, we identified reports involving arrests and randomly sampled 20 of these reports to use. We were limited to using 20 reports due to cost and time for the experiment. Only police reports with an arrest were included since such reports better document the alleged offense and contain sufficient facts such that police made a legal assessment and determined that an individual should be arrested.

After identifying all reports with an arrest, each report was reviewed by a researcher and any references to an individual’s name or role were replaced with generic identifiers such as “Suspect 1” and “Officer 1.” Generic suspect identifiers were then replaced to create 15 versions where the arrestee is White and 15 versions where the arrestee is Black, with randomly drawn names that research has shown are predominately associated with each respective racial group. Names were sourced from Rosenman et al., 2023. For names of other individuals included in the reports (e.g. police officers, victims, and witnesses), names were simply randomly replaced. For example, below is both versions of a report where in the Replaced Report, 'SUSPECT 1' is replaced with randomly generated name 'MARK WEAVER'.

Introducing flaws

To better measure ChatGPT’s ability to carry out basic legal assessments we created a “flawed” template for each police incident report. Flawed templates set a baseline for the chatbot’s ability to recognize cases with legal deficiencies. To create these templates, the original 20 incident narratives were altered to include facts that negated necessary elements of the crime, introduced constitutional policing violations, or otherwise created legal ambiguity as to the strength of the case. These flawed templates are paired with all four prompts in the same manner as the original templates.

In evaluating differences between the original and flawed templates, the flawed templates are ranked from level one to three with regards to the severity of the introduced legal issues. Level one flawed templates include a minor legal issue such as small inconsistencies or minor procedural errors that, while noteworthy, don’t significantly undermine the core evidence. Level two flawed templates include problems with evidence that could potentially affect its reliability or admissibility. These issues might create reasonable doubt in some aspects of the case, making prosecution more challenging but not necessarily impossible. Finally, level three flawed templates include a critical evidentiary or constitutional flaw such that there are major issues that fundamentally undermine the integrity of the evidence or the case as a whole. Each original template is modified once such that some templates are modified to a level 1, some to level 2, and some to level 3. The uneven distribution may drive some differences in outcomes or affect the variance of some outputs.

ChatGPT model

Users can control which ChatGPT model is used and maintain instructions across new requests using the ChatGPT application programming interface (API). ChatGPT uses “assistants” which are persistent identities that “remember” instructions and learn from prior prompts to improve future output. Using the API we control which assistants are used and what instructions they are provided.

In order to control for whether the assistant would be affected by the jurisdiction, we create an assistant for each prompt (prosecutor low context, defense counsel low context, etc.) and each jurisdiction that provided police reports. Across our four prompts and three police departments, there are twelve assistants. We used a new assistant for each police department to control for any location-specific features when the instructions included information on what state criminal code to use. All data was compiled using ChatGPT model 3.5-Turbo and the default temperature of 1. This model was initially released in March 2023. Finally, to ensure that the assistant did not learn to recognize prior provided templates, each new request was made as a new “conversation” with the assistant.

Final dataset

The final dataset consisted of over 144,000 ChatGPT responses using the 40 different original and flawed police reports. For each report, we used the same prompt and report 30 times in order to then aggregate and estimate various quantitative measures like the prosecution recommendation score. The repeated prompting using the exact same prompt in order to capture variance in output is a standard practice in research regarding large language models.