AI thinks you should go to jail, even if you didn't do the crime
Evidence from using ChatGPT as a prosecutor
By Rory Pulvino (Justice Innovation Lab), Dan Sutton (Stanford Law School), and JJ Naddeo (Justice Innovation Lab & University of Michigan Law School)
Published July 29, 2025
Introduction
While algorithms and machine learning have long been used in the criminal justice space, the use of generative artificial intelligence (AI) is transforming how the system works. Large language models (LLMs) are especially suited to legal practice where prosecutors—the justice system’s most powerful actors—spend much of their time drafting memos, motions, and charging documents that shape every stage of a criminal case. While all industries adopting AI should be cautious about whether a given tool exhibits racial or gender bias, this is of particular concern in a criminal legal system with well-documented disparities.
To understand these risks for prosecutor offices, we tested ChatGPT (model 3.5-Turbo) on a common legal task: writing a memo about how to handle a criminal case. We evaluated this task from both prosecutor and defense attorney viewpoints to see if the results held across different roles. While our testing was designed to evaluate racial bias, what we discovered was unexpected: the AI model revealed a consistent and arguably problematic preference for prosecution, regardless of case specifics.
Using different AI prompts and a variety of real-world cases involving common non-violent offenses, we consistently found that the model recommended prosecution over dismissal or diversion. This was true even after the facts of the arrest were changed to introduce legal issues and facts that undermined the case against the arrestee. Though newer AI models may perform differently on the task, this analysis demonstrates that there are likely unforeseen risks in using a generative AI system without knowing the biases of the model.
The preference for prosecution is particularly concerning because it’s hidden from users and was only revealed through rigorous testing. These risks are compounded by our limited knowledge of exactly which language cues or features are most impactful on a response. As seen in research on image classification,models can rely heavily on patterns or details that are not salient—or even recognizable—to human observers, sometimes giving disproportionate weight to subtle or seemingly irrelevant features. Similarly, generative AI models may be influenced by linguistic signals in ways that are opaque to developers and users, leading to unpredictable or unintended outcomes.
The LLMs that underpin AI chatbots produce responses based on complex predictions of what word or string of words should come next. Responses are distinct because there is variation in a model’s predictions of the next word. Throughout this report, when we refer to biases, values, and preferences of a model, these are a reflection of those word choice predictions that are shaped by those creating the model. AI developers are able to influence, though not completely control, a model’s responses through choosing what data the model is trained on, what safety testing and reinforcement learning is done, and what safeguards are in place.
This analysis demonstrates a number of issues with the rapid adoption of generative AI in the criminal justice space, regardless of the model:
- There is a large gap in our understanding of how AI "reasons," given its underlying biases and values. A model’s biases and values are manifestations of the interaction between the underlying algorithm and a user prompt. Model developers are still learning how to shape and influence model responses given that they cannot completely control a response. Some companies refer to this shaping as "Constitutional AI" and "character" training of AI and it is an acknowledgement by developers that any identified biases or values of a model are the result of choices made during the model's development. Importantly, there are no U.S. federal regulations to minimize harmful impacts. Furthermore, companies—including those developing and selling AI tools to prosecutors—are not necessarily disclosing the values they are imbuing into their models.
- Tools like ChatGPT do not, and likely cannot, include safeguards that prevent all misuses of the tool. The flexibility of these tools and models means that users will find new ways to use the tool without necessarily recognizing that there are unforeseen risks.
- Certain biases or values of a model may be determinative of its response, even when there is significant information that goes against the model’s biases or values. Knowing what these orientations are and how they are triggered is important to safely using AI.
- Prompting is very important in shaping not only the usefulness of the generative AI response but also, crucially, the biases and values that the model exhibits. In this testing, changing the perspective of the prompter—prosecutor versus defense counsel—altered the AI responses. Additionally, the details of the prompt influenced how punitive and legally accurate those responses were.
This analysis examines ChatGPT 3.5-Turbo, which was a leading and widely used AI model during our testing period. While newer, more powerful AI models have been released since we conducted this analysis, some likely with improved legal reasoning capabilities, this evolutionary progress actually reinforces rather than undermines our results. Each new generation of AI systems will likely introduce its own sets of biases and orientations—characteristics that will remain undetected until the models undergo rigorous, domain-specific testing like the analysis presented here.
In this report, we share results from ongoing research being prepared for academic publication to address urgent questions facing prosecutor offices as they adopt AI tools. We discuss current uses and risks of generative AI in prosecution, describe both our testing and findings, and conclude with a discussion on why continued research in this area is needed.
Prosecutors are using generative AI
Before OpenAI’s public release of ChatGPT, earlier versions of large language models generated responses that exhibited well-documented problems including racism and sexism. These widely reported issues contributed to concerns about the technology’s readiness for professional use, likely shaping the legal field’s cautious approach to adoption. OpenAI and other AI developers quickly adapted in an attempt to mitigate such toxic behaviors while improving their models’ overall responses.
ChatGPT model 3.5 represented a significant improvement, generating more coherent responses and avoiding the most demonstrable flaws of earlier versions. In addition to these advances, the models became available to developers to use in tools such as chatbots. Tech start-ups like CoCounsel and ProsecutionAI are building tools for lawyers and targeting prosecutors.
It is not surprising, then, to find that lawyers, despite initial caution, began using generative AI tools. Along the way, lawyers have become familiar with other issues with AI, namely hallucinations. There have been well-reported mishaps, such as when former Trump lawyer Michael Cohen provided his criminal defense attorneys with AI-invented case citations, which the lawyers then submitted in court filings. Though such instances serve as cautionary tales for lawyers, the AI industry continues to adapt to curtail problems like hallucinations, and lawyers continue to experiment with its use. With the advent of Retrieval-Augmented Generation or RAG (technology that constrains AI to use only specific documents), and changes to user interfaces, AI tools are now tailored such that users can constrain chatbots to using only provided information, which may help mitigate hallucinations and data privacy issues.
At the most recent Association of Prosecuting Attorneys Data Summit, a conference featuring many of the most technologically advanced prosecutor offices in the country, several offices presented on their use of AI. Based on these presentations, most offices seem to be using AI tools tailored for specific purposes, such as enhancing photos or video, transcribing and summarizing conversations, or drafting correspondence with defense counsel. But at least two offices are using open-ended chatbots like ChatGPT in ways similar to our testing—providing the chatbot with case information and asking the bot about the case, and to perform tasks like drafting correspondence and legal documents. This type of use will only grow as AI tools become more powerful and lawyers learn to use agentic AI—systems that can independently complete multi-step tasks.
What are the risks of these uses?
Lawyers using a chatbot may believe that since they maintain control of the output and the tool is not making a final, consequential decision, there is little risk of harm. Our analysis demonstrates that this is likely not true. Generative AI models, even those that do not hallucinate and do not provide responses that exhibit racism or sexism, may be biased in unexpected ways that put their output at odds with how a human would reasonably act.
Not all bias is inherently problematic. Many biases reflect deliberate policy choices or organizational values. A prosecutor’s office that emphasizes prosecution over diversion operates with a particular orientation toward justice that may be intentional and transparent. The real concern arises when AI systems exhibit hidden or unintended biases that users are unaware of, especially when these biases could influence human decision-making.
A few of the ways that models could exhibit bias in the criminal legal context that would not necessarily be obvious to a user are:
Generative AI could generate subtly biased text. For instance, when describing the same criminal incident, the model may describe a defendant’s actions differently depending on race. Where the description makes the defendant seem more violent or aggressive, such descriptions may then influence readers, including judges, and ultimately decisions in the case. We describe testing for this issue in our forthcoming paper, and it is an issue that some developers are transparently concerned with.
Recent empirical research has shown that some prosecutors, rather than compounding racial disparities, may actually use their discretion to offset or “reverse” these disparities when they are aware of upstream bias in the cases they receive. By introducing AI into this process, this important corrective effect could be dampened or lost entirely, since AI models are typically unable to recognize or respond to the social and historical context that informs human prosecutorial discretion.
Generative AI may currently lack the capacity for the nuanced judgment that is central to prosecutorial decision-making. When tasked with decisions that require both factual determination and value judgment—such as whether a defendant deserves a diversion opportunity—models may exhibit a form of strict adherence to legal elements, focusing primarily on whether the facts support the crime's technical requirements rather than weighing broader considerations.
Prosecutors routinely balance public safety, fairness, societal impact, and individual circumstances when deciding whether to pursue, divert, or dismiss a case. By contrast, generative AI tools appear limited to assessing whether the strict elements of an offense are met, without considering factors such as the defendant's circumstances, the potential impact on public trust, or whether prosecution serves justice in the particular case. This limitation helps explain the model's default bias toward recommending prosecution—akin to a police officer who stops every motorist for minor infractions, ignoring the complex social context in which those laws operate.
- Models might also hold “default biases” that are less obvious and are more similar to implicit biases in humans. These default biases are less readily apparent because unlike a tendency to strictly adhere to a set of rules or laws, a model’s output may vary sufficiently for a user to not recognize that its output is overwhelmingly skewed in the aggregate.
This report demonstrates that ChatGPT exhibits an orientation towards prosecution, regardless of race, for the most minor of offenses. As described in the next section, our analysis uses a variety of prompts to have ChatGPT draft a legal memo when provided with a real-world police report for common, non-violent offenses. While ChatGPT did not always recommend prosecution, it overwhelmingly did, even when the offense at issue was the theft of $13 of goods or the possession of a marijuana vape by a teenager. For such minor offenses, human judgment and the exercise of prosecutorial discretion are pivotal in assessing how to handle a case.
How we tested ChatGPT
This analysis was initially designed as a racial bias audit, similar to studies of the effects of race in job applications. We conceived of this approach upon seeing the use of tools like ChatGPT by lawyers to draft legal memos. We were concerned that AI models that have exhibited racial biases in some contexts would be used within a criminal legal system with well-documented racial disproportionalities and that would ultimately perpetuate racial bias in the system. Based on these concerns we designed a test to identify whether ChatGPT provides racially biased responses when prompted with a real-world criminal legal task.
The task
When someone is arrested or cited for a criminal violation, the arresting officer will typically write a police report describing the incident and submit the report to a local prosecutor’s office. Prosecutors then review the report, additional evidence such as drug lab reports, victim or witness statements, and the arrestee’s criminal history before deciding how to proceed. Generally, prosecutors can do one of three things with a case—dismiss the case (which may be required if there is not sufficient evidence), divert the case to an alternative program, or proceed to prosecute an individual. Determining what to do with a case can be more complicated depending on the complexity and severity of the crime.
Less serious offenses, such as simple drug possession or theft for small amounts, are generally "easier" for a prosecutor to decide what to do. The elements of the offense are fairly straightforward, there is no violent conduct, and there are often diversion programs for such offenses because those arrested are generally considered to be low risk to commit new, more serious offenses. These are the types of cases that many prosecutors handle early in their careers. Offenses like these make up the largest portion of arrests in the U.S.
Yet even for these seemingly straightforward cases, prosecutors must exercise complex judgment—often relying on their experience to weigh not only the facts of the case, but also the potential long-term impact of prosecution versus diversion or dismissal. Recent research shows that prosecuting low-level, nonviolent misdemeanors can actually increase an individual's future involvement in the criminal legal system, while declining to prosecute these cases can significantly reduce subsequent criminal complaints.
Our analysis focuses on these cases because of their prevalence and because it is likely that generative AI will be adopted to help prosecutors manage large caseloads of these matters. In our testing, we simulate a prosecutor asking ChatGPT to draft an initial charging memo after being provided a police report, and for the memo to specify a recommendation to dismiss, divert, or prosecute the case. We also asked ChatGPT to provide a numerical score reflecting the strength of its recommendation, with higher scores indicating stronger support for prosecution.
Police reports and prompting
Our data consists of 20 police reports involving arrests that we randomly selected for testing. The police reports are actual police reports from three different police departments—Oklahoma City, Oklahoma; Buffalo, New York; and Seattle, Washington. We obtained these reports through public records requests for arrests involving shoplifting, petty larceny, drug possession, or drug possession with intent to sell, then randomly selected reports from the larger set for our analysis.
For testing, each of the 20 reports has a matching, altered police report where facts were changed in order to introduce legal issues, such as making it clear that an officer did not have probable cause for an arrest or even documenting factual innocence of the arrestee. We grade these alterations from one to three with level one 'flaws' being the least severe issues—minor procedural issues or factual inconsistencies—and level three 'flaws' being the most severe issues—critical evidentiary or constitutional policing issues.
To test how ChatGPT would approach such cases, we created two different prompts—a low-context and a high-context, prosecutor-based prompt—that asked the chatbot to review the police report and draft a memo. Context is the level of detail provided in an instruction to a generative AI chatbot. Most users start with low-context prompts, but quickly find that increasing detail, providing examples, and being specific as to the form of output yields better results. Testing various levels of context was important as output varied greatly, with low-context prompts generally resulting in one- or two-sentence memos, while high-context prompts returned one- to two-page memos.
After initial evaluation of both the low-context and high-context prosecutor prompts, we found that ChatGPT recommended prosecution in the majority of instances. Considering that this might be driven by the prompt referencing the user’s role as a prosecutor, we created additional low- and high-context prompts from a defense counsel perspective. The defense counsel prompts are framed as the defense counsel predicting how a prosecutor will approach the case in order to better compare results against the prosecutor prompt, but this may weaken any effects of trying to frame the chatbot with a defense counsel perspective. The four prompts are presented below.
Prosecutor low-context prompt
Creating the sample dataset
Because our analysis was initially conceived as a racial bias audit, we created 30 different versions of each police report—15 versions where the arrestee is White and 15 versions where the arrestee is Black, with randomly drawn names that research has shown are predominately associated with each respective racial group. Using names as a race signifier is common practice with implicit racial audits, though we explicitly include race as well since it is included in the original police reports. These altered reports were provided to ChatGPT, along with the prompts described above, in a new chat each time.
Generative AI chatbots generate text based on probabilistic determinations as to the next word or words in a sentence. This means a chatbot’s response will vary from prompt to prompt, even when given the exact same prompt. To account for this variability, we treat each prompt and police report as a pair. ChatGPT was prompted with each pair 30 times, and the resulting scores and text were aggregated to provide an "average" response from ChatGPT. In the end, our dataset consists of over 144,000 ChatGPT responses that are then aggregated to an analysis dataset of 4,784 observations. This dataset is further described on the "Data description" page.
Results
Both average recommendation scores and a qualitative review of a sample of memos demonstrate that ChatGPT consistently recommends prosecution in our testing scenarios. ChatGPT’s default bias towards prosecution is seemingly impervious to the severity of the conduct, the perspective of the prompt user, or even the introduction of legal issues that would give most lawyers pause. While we do not detect racial disparities in the model’s recommendations, this may reflect the model’s reliance on case context and legal facts, with race not serving as a salient input or signal for the model in these scenarios.
ChatGPT confidently recommends prosecution in the majority of cases
Our prompts ask ChatGPT to provide a "score" in the response that aligns with its recommendation of whether to prosecute, divert, or dismiss. The prompt tells ChatGPT that scores should range from 0 to 100 (or 0 to 10 for low-context prompts, which we scaled to 0 to 100 for comparison with high-context prompts) and that lower scores correlate with a strong recommendation to dismiss, whereas higher scores correlate with a strong recommendation to prosecute. This numerical score offers a more detailed measure of how confident ChatGPT is in its overall recommendation.
It is notable that higher context prompts yield lower and more variable scores on average. This seems to indicate that ChatGPT can be made to challenge its own default biases through more thorough prompting that asks a model to consider more factors. This seems almost analogous to how humans, when given differing levels of instruction, approach tasks differently. Given little guidance, ChatGPT appears to rely more upon default biases, suggesting judgments without full consideration of either the facts or different possible courses of action. With greater guidance, ChatGPT is more likely to recognize legal issues in fact patterns and take different courses of action. However, increasing ChatGPT’s thoughtfulness comes with the cost of increased variance.
Generally speaking, developers aim to create systems that consistently provide useful outputs while, ideally, minimizing variance. In the case of generative AI, some variance is baked in. This testing reveals, though, that the amount of variance differs based on the prompt. By providing a more comprehensive prompt, we increase the number of times ChatGPT recognizes legal issues, potentially improving accuracy. But this also lays bare that for a suspect, it is essentially a game of chance as to what output ChatGPT provides at that moment. The increased variance is problematic, as it is not at all obvious to regular users.
Regular users are not aware of the many possible answers, only the one they receive at that time; thus, they may believe a tool is working well after reviewing a few outputs and trust the tool without realizing the amount of variance for similar cases. Adjusting the prompter’s perspective from a prosecutor to defense counsel also affected responses with lower recommendation scores, though the average score was still over 50, suggesting continued preference for prosecution. This is especially surprising for the altered police reports, where we might expect the defense counsel prompts to be more likely to identify legal issues.
These consistently high scores are true across both Black and White suspects, and it is only with the introduction of the most significant legal issues that we find average scores dropping below 50. As shown in the graph below of averaged scores for each prompt-report pair, scores rarely drop below 20—where we believe a case would typically be dismissed. This is true even in reports where facts were altered such that the suspect should not face prosecution. We do not know why ChatGPT seemingly ignores such pertinent information, but it demonstrates that generative AI models may struggle with certain tasks that challenge a default bias—such as a tendency to prosecute because of a possible prioritization to "prevent crime," "punish criminals," or some other underlying value judgment that assumes suspect guilt.
In the above graph we find that only when there are serious issues in a police report does ChatGPT recognize that prosecution may not be appropriate. Focusing on reports with the most critical issues—either there are clear criminal elements missing in the fact pattern or there are constitutional issues with the arrest—ChatGPT still frequently recommends prosecution. As seen in the graph below, ChatGPT is seemingly better able to recognize issues in fact patterns where a criminal element is missing or negated and appropriately recommends dismissal or diversion more frequently. ChatGPT did not readily identify fact patterns with constitutional issues that prosecutors would recognize, rather prosecution recommendation scores remained high. This is surprisingly true even for the defense counsel prompt where one might expect the context to prompt ChatGPT to more readily identify constitutional issues.
Below we present four of the fact patterns where the altered facts negated the criminal activity. We consider to be seriously flawed cases. For two reports for drug possession, the suspected drugs ultimately tested negative for illegal substances. In Police Report 3, a shoplifting incident, the individuals were arrested without the stolen goods, and store camera footage revealed that the arrested individuals were misidentified. In a separate shoplifting incident, Police Report 4, the suspect demonstrated that they did not steal products by switching price tags to cheaper items and thus was incorrectly stopped and issued a citation.
Police Report 1
Among the responses from the high-context prompts, only for Police Report 3, where the facts clearly outline that the suspects were misidentified in store security camera footage, does ChatGPT overwhelmingly recommend dismissing or diverting charges in the majority of instances—in 91.8% of all prompts and 99.3% of high-context prompts. Even though the suspects should not have faced charges, ChatGPT still recommends prosecution in 6.4% of responses, concentrated among the low-context prompting. This demonstrates the importance of users knowing how to properly prompt, since low-context, simple prompting would result in many people facing prosecution when the facts do not support prosecution.
Among the other three reports with serious evidentiary problems, ChatGPT’s performance varied significantly. For Police Report 2 (drug possession where substances tested negative), ChatGPT recommends dismissal or diversion in 67.3% of responses, while still recommending prosecution in 32.1% of responses. Using high-context prompts, ChatGPT recommends dismissal or diversion in 87.6% of responses and prosecution in 12.1% of responses. For Police Report 4 (shoplifting where suspect had proper receipts), ChatGPT recommended dismissal or diversion 51.1% of the time and prosecution 47.7% of the time. Again, when using the high-context prompts, the percentage of responses recommending dismissal or diversion increases to 89.3%, with 10.5% of responses recommending prosecution. Most concerning, for Police Report 1, ChatGPT actually recommends prosecution in 74.7% of all prompts despite the absence of evidence supporting the charges. Even when using the high-context prompt, ChatGPT recommends prosecution for Police Report 1 in the majority of responses, 51.6%. We find then that in cases where the police reports themselves negate critical elements of the alleged offenses, a significant number of people would still face charges if evaluated by ChatGPT.
The overall recommendation percentages for all police reports combined is shown in the chart below. Note that the percentages for "prosecute" and "dismiss or divert" do not sum to 100%; this is because there is a small percentage of scores (less than 2%) that are either exactly 50 or not interpretable.
The inserted facts negating criminal activity in these reports were not subtle. A prosecutor reviewing these fact patterns would likely recognize that critical criminal elements are not present. ChatGPT appears to fail to properly weigh these issues, perhaps because of a bias to prosecute or because the prompt leads the model to believe the user wants a pro-prosecution response.
Using the original report for some of the most minor offenses, ChatGPT still favored prosecution. In such cases, ChatGPT did not ask for more information regarding the suspect, such as criminal history information or more facts regarding the incident and the individual’s circumstances.
- In one report, the suspect was a juvenile caught by a school resource officer with a "marijuana vape." The individual was initially arrested and then released back to their guardian. The police report narrative is six sentences long and does not describe how the officer identified the individual or the nature of any search. The criminal justice system treats minors differently than adults and includes many alternatives to prosecution. ChatGPT, possibly unaware of this, has a mean prosecution recommendation score of 74.9. Furthermore, ChatGPT recommends prosecution 85.8% of the time and diversion or dismissal only 13.1% of the time.
- In another report, the suspect was seen on video shoplifting from a large retailer. The individual was arrested by a notified officer with the stolen goods totaling $13.25 in value. The person was cited and then released, and the narrative identifies that the person did not have any outstanding warrants. Most prosecutors would consider this to be a minor theft such that without a significant criminal history, the suspect might have the case dismissed or diverted to an alternative program. ChatGPT gave an average prosecution recommendation score of 70.1 and recommended prosecution in 74.8% of all prompts.
These incidents are for minor offenses that many offices have diversion programs to address. While ChatGPT was not provided with criminal histories—which, if available, might have emphasized factors like minimal prior arrests or absence of violent behavior that could indicate lower risk—ChatGPT rarely considered these mitigating circumstances. Following ChatGPT’s recommendations, prosecutors would be stuck prosecuting nearly every offense, no matter the severity, leaving little time for investigating and prosecuting more serious offenses.
Our analysis of the recommendation scores indicates that ChatGPT favors prosecution. This preference is nearly determinative of the model’s output, regardless of how complex the prompt is, the perspective of the prompter, or whether there are legal issues in the fact pattern. Though we are able to make the chatbot more considerate of legal issues and alternatives to prosecution by increasing prompt complexity, this comes with an increase in variance.
AI-generated memos reveal basic legal competence, but crime-focused framing
Our review of these memos reveals that ChatGPT exhibits basic legal competence, accurately identifying relevant facts, discussing potential charges, and applying law to facts. Below we present a sample of four of the memos that are representative of the high-context prompt responses. In Prosecution Memo 2, ChatGPT explains that “based on the evidence provided in the police report, charging options could include larceny of merchandise (shoplifting)” and proceeds to cite a specific local larceny statute. However, the AI model also selectively emphasizes facts that favor prosecution. Even in Defense Memo 2, where ChatGPT acknowledges that “the white powder did not test positive for cocaine,” it still frames the case as a potential drug possession offense rather than emphasizing the exculpatory nature of the negative lab test. In Defense Memo 1, ChatGPT notes the defendant “did produce a receipt showing payment for other goods totaling $13.25,” suggesting “a possible mistake or confusion rather than an intent to steal,” but treats this potentially exonerating evidence as simply one factor to weigh rather than case-ending.
Prosecution Memo 1
The memos consistently focus on criminal liability and frame even minor offenses as serious public safety threats. ChatGPT describes possession of “1 gram of crack/cocaine” as “a serious offense under the law,” in Prosecution Memo 1, while in Prosecution Memo 2 it discusses shoplifting tools as evidence that the defendant’s “pattern of criminal behavior…should be addressed through prosecution to uphold public safety and deter future offenses.” This punitive framing even appears in defense memos—Defense Memo 1 notes “the low value of the merchandise in question” ($13.25) but still assumes the prosecutor will charge the case rather than questioning whether a minor theft warrants prosecution at all.
Our review of these memos reveals several patterns. ChatGPT rarely, if ever, stops itself or warns the user that it may not have sufficient information to complete the task. As researchers have documented, chatbots are trained to provide an answer to satisfy a user’s request. In an ideal setting, ChatGPT would have recognized that some of the police reports contained insufficient or faulty evidence, or that the suspects might be very low-risk to commit new and more serious offenses. In such situations, ChatGPT could ask for additional information before completing the task. For instance, in Defense Memo 1, ChatGPT acknowledges that “Omitted information includes any prior criminal record or history of shoplifting”—information that can be important for a charging decision—yet proceeds to make recommendations without asking for this data.
While some companies build in safeguards to try to limit a chatbot from responding when it doesn’t have sufficient information or is asked to do something inappropriate, such safeguards are likely easily circumvented given the near infinite ways in which a user can prompt. Therefore, getting ChatGPT to limit itself based on input is likely necessary, but very difficult, and is not necessarily in the tool developer’s commercial interest, as having a chatbot that questions its user or refuses a task can be easily substituted for another tool without such restrictions.
Conclusion
“Your scientists were so preoccupied with whether or not they could, they didn't stop to think if they should.” (Jurassic Park, 1993)
AI bias research has primarily focused on race and gender, but there are other forms of bias that AI contains that are not obvious to developers or users. In this testing, we expose a popular ChatGPT model’s consistent preference to prosecute even the most minor offenses and cases where suspects should not face prosecution. This bias was not what we set out to test, in part because it has never been defined. This gap is revealing of an underlying issue with large language models—the model does not know what it hasn’t been provided, and we, both developers and users, are not aware of what it does not know.
Developers of generative AI tools could disclose the training data used to create their models, but even such disclosure is unlikely to make obvious the default biases of a model. To do so, developers would need to thoroughly categorize training data, identify the sentiment of the data, and map the influence of different data on output. Having done this, developers would then need to carry out the Sisyphean task of determining everything the model is not aware of to understand where gaps in the training data might manifest into default biases.
What does this mean?
- Defining a model’s accuracy or reliability is very challenging. In the context of ChatGPT recommending prosecuting individuals where the fact pattern demonstrated they should not face charges, it is not clear whether a model developer would consider those responses to be inaccurate or unreliable. From a developer's perspective, the tool provided a reasoned response that met the requirements of the prompt. From a lawyer’s perspective, these responses are inaccurate and unreliable. Tools developed specifically for lawyers or prosecutors may test for these issues, but developers of open tools like ChatGPT are not likely to do so. Lawyers need standards to test and measure tools against their specific use cases.
Prosecutors’ offices should establish rules to govern the use of generative AI. In addition to rules, offices will need to establish ways to monitor prosecutors’ use of generative AI since the temptation to use free, open tools like ChatGPT will likely prove alluring. Offices can encourage compliance with these rules by developing strategies and policies for systematic deployment of generative AI tools. These offices should create and staff AI management committees that include both IT staff and legal users, such as prosecutors and paralegals. The committees can research and create office policies regarding different uses of generative AI, train staff on use, and create AI risk management strategies.
The National Institute of Standards and Technology (NIST) has created a public AI risk management tool that offices can review and adapt to different use cases. Justice Innovation Lab has also created a first draft of an AI risk workbook to help offices consider and document the adoption and appropriate use of AI tools. The workbook is a starting template for offices looking for a perspective on what to consider when creating internal controls for AI tool adoption.
- The criminal justice system needs to improve its data collection practices and make more data available so that researchers can accurately monitor and study the impact of AI. Since many offices and individuals will be using a diverse set of AI services, high-quality data is essential to accurately measure how outcomes are changing and what might be driving changes. For instance, adoption of generative AI to conduct an initial review and make dismissal recommendations for cases with insufficient evidence may decrease disposition times, but this cannot be measured if offices do not accurately capture the date a case is opened and the date it is closed.
To ensure transparency and accountability, prosecutors’ offices should publicly disclose not only when they are using AI tools, but also the specific models they rely on and a register of non-case-specific prompt templates—like the policy manuals or charging guidelines many offices already publish—that guide their use in decision-making. In addition, establishing and maintaining a national database to document the use of AI in criminal justice—including which models are used, for what purposes, and under what circumstances—would enable researchers and oversight bodies to rigorously assess the real-world impact of AI adoption and help uncover unintended consequences.
Furthermore, AI companies should be transparent about development choices that shape AI responses. The model biases and values discussed in this report are a reflection of developer choices regarding training data, optimization, and safeguards that are important for users to be aware of. While a model’s output will vary, a user can better determine if a tool is appropriate if they know more about the motivations and concerns of the developer.
Lawyers need training on the appropriate uses of generative AI, prompting, and how tools use data provided by users. Lawyers, like the general public, have access to an ever-changing landscape of generative AI tools, without formal training on how to use them. Consequently, lawyers, like everyone else, are learning as they go. There should be greater emphasis on mandatory training for AI use in sensitive legal fields like criminal law. Furthermore, criminal lawyers are not contract lawyers and are not necessarily equipped to understand the nuances of technology terms of service that dictate how user data is handled.
Given the sensitivity of such data, generative AI’s absorption and use of user data poses a serious risk to data privacy. Tools like ChatGPT could include clear, short disclosures as to how user data will be used or stored, but lacking national legislation regarding data privacy, such disclosures are unlikely to be a company’s priority. This shifts the burden to individuals, and it should be a standard practice for prosecutors’ offices’ AI committees to review and understand the data privacy risks of any tool.
Future research
More research is needed to be able to understand what drives generative AI output. Similar to questioning a person and reviewing their prior writings to try to understand underlying motivations for their actions, we need processes to understand generative AI. While we are unlikely to solve the problem of identifying all that an AI doesn’t know, if we can create tools that help us understand why a tool responds in a certain way, we can more easily monitor and head off problematic AI behavior.
As detailed in a recent study by researchers at Anthropic, the developer of the Claude family of LLMs, models exhibit biases and values, and the company is trying to shape these "character traits." The study is helpful in providing a first taxonomy of a model’s values, and the results showing that values vary by task and by the human values the prompter brings are complementary to our findings. In particular, our testing shows that when faced with a crime, a model may orient around retribution or public safety, which results in punitive recommendations. Under a different prompt, perhaps one asking how an individual accused of the crime might be served by rehabilitation, the model may be less punitive.
The importance of prompting and context cannot be overstated. Longer, more detailed prompts lead to longer, more detailed responses, which are more informative for understanding the model’s "reasoning." But these longer answers, at least in this case, come with increased variance in the response—at least with respect to the recommendation score and frequency of various recommendations. An increase in variance can be problematic for a number of reasons:
- It increases the randomness of winners and losers. In this case, it may mean the difference between a suspect having their case dismissed immediately or having charges filed against them when facts indicate they may not have even committed the crime. Though the increase in variance is accompanied by an increase in the number of times ChatGPT correctly recognized legal issues and the low-risk nature of the suspects, this directly ties to the second issue.
- When variance is low and a tool is inaccurate, then it is consistently inaccurate. Users are then more likely to realize that a tool has issues, even if only occasionally using the tool. In contrast, if a tool has more varied output, users may run across a string of accurate outputs and then overestimate the tool’s reliability.
What we do not know is how the tool compares to how prosecutors would generally approach each case. There is research into how prosecutors differ from one another, but this testing does not have a prosecutor comparison group to measure against. It is possible that the decisions and variance that the model exhibits are reflective of some real prosecutors. For any tool then, benchmarking how actual users would evaluate a subjective task is important.
Finally, since we conducted our testing, generative AI tools have transformed. Newer models provide more detailed and better formatted responses. The growth in use of generative AI suggests that they are becoming more helpful. However, this does not necessarily indicate that models have become more accurate at tasks like legal reasoning in the criminal justice context. It is also not clear how to even measure whether models are improving in accuracy for tasks that implicitly include value judgments such as recommending someone for diversion versus prosecution.
The change in models is further complicated by their development in functionality. The latest development is the creation of agentic AI tools. Agentic AI is generative AI that is imbued with the ability to act on a computer. These tools are capable, if provided access, of independently creating new files like word documents and deciding what to do with them. They can be configured to read and respond to emails without a user’s review if provided sufficient permission. This new form of generative AI is closer to an autonomous or near-autonomous actor, and this requires additional research into best practices for managing these tools. Before agentic AI is used for tasks like reviewing evidence and then drafting and sending subpoenas, we need research regarding:
- Human-to-agent guardrails that can protect a user from unwittingly giving an agent too much autonomy.
- The existence of default biases that may drive agents toward certain courses of action, such as filing all criminal cases and seeking the most severe punishments possible.
- Processes for ascertaining what drives a model to certain responses and the amount of variance in those responses.
- New methods for determining whether a model is acting in a biased manner. Within the prosecution space, this might entail testing how often a model is able to recognize constitutional issues, as well as developing measures for how it handles such information.
AI is already transforming the criminal legal system and there is limited but increasing body of research on its many applications and the attendant risks. This analysis demonstrates that there are likely many hidden issues and suggests that we do not even know the scope of possible issues with adoption of generative AI. The field is growing faster than research and regulation can currently keep up with. Prosecutors’ offices should move quickly to establish AI governance policies that document current usage and educate staff on appropriate uses and possible risks. Without proactive steps, the unchecked adoption of AI may produce biased decision-making that affects all stakeholders—defendants, victims, and communities—while eroding public confidence in the criminal justice system’s commitment to fairness and accuracy.