How often and why do judges erroneously conclude, in Title VII harassment cases, that there isn’t enough for a reasonable jury to find that the plaintiff suffered “severe or pervasive” enough harassment for Title VII liability? These questions are not easy to answer. No one can directly observe the counterfactual, i.e., how a jury would have ruled had the case gone to trial. And if deciding what a “reasonable” jury might do requires inferring what most juries, or a jury under ideal conditions, would do, then judges could still be good forecasters even if any particular jury would have gone the other way.
Enter Tippett and Williams with a study that provides serious leverage for answering these questions. They first sampled Title VII harassment case opinions in Westlaw between 1995 – 2019 (n = 81, mostly summary judgment motions) in which the court decided whether or not there was “an issue of fact on whether the conduct qualified as ‘severe or pervasive’” enough for a Title VII violation. In 53 of the 81 cases (65%), the court found that no reasonable jury could find that the conduct was severe or pervasive enough. (P. 19.)
Next, Tippett and Williams deployed an online Qualtrics survey during 2019, 2020, and 2022, recruiting respondents from Amazon’s Mechanical Turk platform (n = 699). Their survey randomly assigned to each MTurk respondent an excerpt from the fact section from one court opinion in their sample of court cases. This excerpt (500 words or less) described only the allegations and evidence relevant to the alleged harassment, not how the court weighed them. All respondents also got a “jury instruction” on a Title VII harassment claim (adjusted for whether their case concerned harassment due to race, sex, or both). That instruction included that the conduct had to be “sufficiently severe or pervasive that a reasonable person in the plaintiff’s position would find the plaintiff’s work environment to be hostile or abusive.” Then, the survey asked each respondent to rate to indicate how severe or pervasive the conduct was (from 0 (‘not at all’) to 100 (‘extremely’)); whether that conduct was severe or pervasive enough to satisfy the jury instruction (“Yes” or “No”); and to “[p]lease explain why” via an open text box. (Pp. 21-22, 33.) In this way, Tippett and Williams got multiple MTurk respondent reactions to the same case fact pattern.
With their survey results, Tippett and Williams compared what courts had done with how their individual survey respondents reacted to the same fact pattern. Figure 1, Figure 2, and Figure 3 depict the distribution of MTurk survey respondent numerical ratings, stratified by whether or not the rated fact pattern came from a Title VII harassment case for which a court concluded that a reasonable jury could or could not find the harassing conduct to be severe or pervasive enough for Title VII liability. (The authors provided me with the underlying data.) Figure 1 also indicates by point color the survey respondent’s decision as to whether their randomly-assigned fact pattern was severe or pervasive enough for Title VII liability.

Figure 1: Boxplot of Survey Respondent Ratings.

Figure 2: Histogram of Survey Respondent Ratings.

Figure 3: Density Plot of Survey Respondent Ratings.
However you visualize it, the key takeaway is the same: Whereas most survey respondents assigned high “severe or pervasive” ratings to fact descriptions from the “reasonable jury could” cases, many survey respondents also assigned high ratings even in cases where the court had concluded that a reasonable jury could not find “severe or pervasive” enough conduct. And in those cases, when asked simply whether their randomly-assigned fact pattern was severe or pervasive enough, over sixty percent of the survey respondents said yes. If courts were good predictors of what a reasonable jury would do in those cases, we should expect to see far less spread in ratings – perhaps something more like a reverse-mirror image of the spread of ratings on the “reasonable jury could” cases (the green left-side plots in Figure 2 and Figure 3). In fact, however, the ratings in the “reasonable jury could not” cases (the red right-side plots in Figure 2 and Figure 3) exhibit a lot of spread. Assuming the survey respondents’ ratings and judgments are, in the aggregate, a valid proxy for what a “reasonable jury” would do, Tippett and Williams infer that judges are “far too aggressive in dismissing cases on the basis of the ‘severe or pervasive’ element” of the Title VII harassment claim. (P. 26.)
If so, why? To advance the ball here, Tippett and Williams compared the court opinions in their case sample with what the survey respondents wrote in the survey’s open text box (median number of words = 25) to explain why they believed the conduct they rated was or was not severe or pervasive enough to satisfy the Title VII jury instruction. In general, they found judges assigned far less weight to certain kinds of evidence on the “severe or pervasive” issue as compared to the survey respondents. Such evidence included whether other employees suffered the same harassment, the harasser’s continued harassment even after a company warning, and company complicity in not stopping the harassment. (P. 58.) This result, they conclude, is consistent with what Zimmer (2000) once called “slicing and dicing”1 – the theory that, in Title VII cases, judges tend to take the probative value of each item of evidence (usually offered to prove discriminatory motive) in isolation, whereas laypeople tend to weigh such evidence as a whole (as the law requires). On this account, judges slice-and-dice even though, on summary judgment motions, they are supposed to construe all available inferences from the evidence in favor of the non-moving party.
As usual, answers beget more questions. If slicing-and-dicing accurately describes how judges reason (not just how they write opinions strategically), what about Title VII harassment cases causes judges to think this way? How are those causes related to other factors, including judge/juror demographics, political attitudes, or whatever might lead judges to effectively conflate what a “reasonable” jury could find with what they think a real jury should find? And would this study’s inferences hold if we did it with actual mock juries, i.e., laypeople who could discuss the facts with each other before deciding? No one study is an island, and this study’s design – like any study – carries some limits on what we can validly infer from it. Still, if you care about Title VII harassment cases (or employment discrimination litigation generally), Tippett and Williams’ paper deserves your time and attention.
Editors note: Reviewers choose what to review without input from Section Editors. Jotwell Worklaw Section Editor Elizabeth C. Tippett had no role in the editing of this article.
- Michael J. Zimmer, Slicing & Dicing of Individual Disparate Treatment Law, 61 La. L. Rev. 577 (2000), https://lawecommons.luc.edu/facpubs/290/.







This study is a hoot! Who knew judges and juries had such different opinions on what constitutes severe or pervasive harassment? Its like theyre playing Name That Tune but with legal standards. The idea that judges are slicing and dicing evidence while juries see the big picture is both fascinating and slightly alarming. Its almost as if judges are trying to solve a puzzle with only half the pieces. But heres the kicker: what if the juries are just as clueless as the judges, but in a different way? And why do we even care if judges overrule meritorious claims anyway? Maybe we should just let the harassment fight to the finish, regardless of whos right. After all, its the legal equivalent of a reality show we cant look away from.
[This comment was edited to remove spam links]