Every year, teachers in the U.S. spend over 400 million hours grading student work. As this task grows, many educators are understandably turning to AI solutions like ChatGPT and EnlightenAI to streamline the process. However, with the significant impact that grading and feedback have on student learning, it's crucial to choose a tool you can trust. Recent research showed that ChatGPT, especially earlier, cheaper models that many AI grading tools are built atop, is not ready for use for grading. At EnlightenAI, we understand these concerns, which is why we set out to rigorously test the accuracy and reliability of our AI grading assistant against both seasoned human graders and other AI technologies. We'll be releasing a white paper in the next few weeks sharing more detail on the findings we preview here. 

 

The Study Setup

In our study, we selected 437 student work samples that had previously been evaluated by educators at DREAM Charter Schools in New York. To explore what the grading might look like if DREAM had used EnlightenAI from the start, we simulated this scenario. We input the context of each assignment into EnlightenAI, graded five papers, and then generated scores and feedback for the remaining. The whole grading process took less than an hour, and was done using the exact same technology we offer to our users for free. 

 

The Result: EnlightenAI met or exceeded the accuracy benchmarks for well-trained human scorers 

Yes, you read that right. Researchers have taken a team of human scorers, put them through a 3-hour calibration training on scoring essays using a holistic rubric, and then measured their consistency with one another. EnlightenAI met or beat these human benchmarks, while exceeding ChatGPT’s performance by a wide margin. For the first time, ever, teachers and school leaders have access to a personalized scoring and feedback tool that competes with well-calibrated human graders, as well as state-of-the-art automated essay scoring tools. 

 

How often did EnlightenAI give the exact same score as DREAM graders?

How effective was EnlightenAI at producing a ‘perfect match’ with its human scorers? In assessments using a 5-point New York State constructed response rubric (scoring range 0 to 4), EnlightenAI matched the exact score assigned by DREAM educators in 53% of cases. Comparatively, in studies using a 6-point rubric (scoring range 1 to 6), well-trained human graders—after receiving 3 hours of training and undergoing continuous monitoring—agreed on the exact same score 51% of the time. ChatGPT performed worse than both, matching human scorers between 20-42% of the time on the same 6-point rubric mentioned above. You can view the distribution of errors by EnlightenAI below. An error of 0 equates to a perfect match, while errors of +1 or -1 signal that EnlightenAI missed the mark by one point in either direction. 

Figure 1. Distribution of Score Differences (EnlightenAI - DREAM)  enlightenai_grading_accuracy_pie_corrected

The table above shows the results of comparing EnlightenAI scores with those actually given by DREAM graders for a given work sample. For example, if EnlightenAI assigned a score of 3 on the holistic rubric and DREAM actually gave the submission a 3, this counts as a perfect match. If, instead, EnlightenAI assigned a score of 4 while DREAM assigned a score of 3, this would count as a difference of +1. 

 

How often did EnlightenAI assign scores within 1 point of DREAM graders?

Grading is an imperfect science, and on many rubrics score differences of a single point are largely subjective. A highly reliable and consistent grading system not only produces exact matches with calibrated human scorers, but also minimizes the size of the error when it fails to produce an exact match. On this measure, EnlightenAI really shines. EnlightenAI assigned scores within one point of the human scores in 98% of assessments. It never missed by more than 2 points in a sample of 437 papers. Notably, this is an area where human scoring consistency lags behind. Well-trained human graders assign scores within 1 point of each other just 74% of the time. Interestingly, ChatGPT outperforms humans in this regard, assigning scores within one point of human scorers 76-89% of the time, but varies widely depending on the samples of student work and the task assigned. On these measures, EnlightenAI outperformed both humans and ChatGPT. 

Figure 2. Performance Metrics of EnlightenAI Grading System Based on DREAM Dataset

Accuracy Statistic

Percentage/Value

Score miss by >1 point

1.83%

Score within 1 point

98.17%

Cohen’s Kappa

0.33

Weighted Quadratic Kappa

0.67

Pearson Correlation

0.69

 

How can we compare EnlightenAI directly to other benchmarks for grading accuracy? 

You might have noticed that EnlightenAI’s collaborative study with DREAM was done using a 5-point scoring rubric, not a 6-point scale like the one used to benchmark human scorers and ChatGPT. We corrected for this difference by calculating two metrics that enable a direct, apples-to-apples comparison. 

  • Cohen’s Kappa: This statistic adjusts for the rubric scale to measure true agreement beyond chance on a 0 to 1 scale, with scores closer to 1 indicating stronger agreement. Put more simply, Cohen's Kappa tells you how good your scoring system really is at producing an exact match to a certain standard after adjusting for random chance and the scale of your rubric. EnlightenAI scored a Kappa of 0.33, extremely close to the 0.36 "fair agreement" benchmark set by well-trained humans. ChatGPT achieves a Cohen’s Kappa of between 0.01 and 0.22 depending on the sample, lagging humans and EnlightenAI by a significant margin. All in all, even two well-calibrated human graders are not great at producing exactly the same score when presented with the same piece work, but EnlightenAI is competitive with even the best human scoring. 

  • Quadratic Weighted Kappa: This goes further, assessing not just if scores match, but how severe discrepancies are when they don’t, again on a 0 to 1 scale with scores closer to 1 signaling stronger consistency. In other words, scores closer to 1 signal that not only is your grading system good at producing exact matches with the established standard, but when you make errors, they're not huge. EnlightenAI's score here was 0.674, showing that even when it doesn’t hit the mark exactly, it’s not far off. To put this in perspective, state-of-the-art automated essay scoring (AES) systems, which are narrowly trained on hundreds of samples of work for highly specific tasks, typically score between 0.57 and 0.8. AES systems are the best in the world at producing reliable scores, though they don't generate feedback and are notoriously narrow, since they are built for highly specific tasks. By contrast, EnlightenAI is made to be flexible and adapt to just about any written task. When viewing summary statistics of EnlightenAI's predicted scores compared to those that DREAM educators actually gave, it's easy to see why EnlightenAI rated so well across different measures of reliability. 

Figure 3: Comparative Statistical Analysis of Score Distributions Between DREAM and EnlightenAI

Statistic

DREAM 

EnlightenAI 

Mean

2.48

2.47

Median

3

3

Standard Deviation

0.80

0.98

Minimum

0

0

25th Percentile

2

2

50th Percentile

3

3

75th Percentile

3

3

Maximum

4

4

 

What This All Means: The Real Impact of EnlightenAI on K-12 Schools

A team of graders at your fingertips

Every teacher and school now has a well-trained team of subject matter experts who just went through a 3-hour calibration training for the purposes of grading. Well, almost. When it comes to grading, we've demonstrated that EnlightenAI can grade 437 essays as well or better than such a well-trained team of humans, and we did it in less than an hour. This doesn't mean that the entire grading process for 437 papers should take only an hour, but it does mean that using EnlightenAI will be a massive time saver. As humans, we read about six times faster than we type. By shifting the work of grading from typing to reading, we cut 80% of the time required - tens of millions of hours of teacher time each year - while boosting the quality of feedback teachers can deliver to students. This efficiency is invaluable, potentially unlocking more frequent student writing and revision and reducing teacher burnout. Yes, this is a big deal.

Teachers should demand more of their AI grading services 

For teachers and school systems considering AI assistance for grading, it’s crucial to investigate the reliability and accuracy measures of the tool you choose. If an AI grading tool lacks transparent performance metrics, consider it a red flag. While it’s relatively straightforward to dress up a tool like ChatGPT for essay grading, research has shown that ChatGPT often falls short of human grading standards in both feedback quality and numerical accuracy. Demand transparency from your AI tools—ensure they are not only effective but also reliable and consistent in their grading.

EnlightenAI met an unrealistically high bar, and this is just the beginning 

Remember, the benchmarks we're comparing EnlightenAI to are from a team of subject-matter experts who underwent explicit training on applying a holistic rubric. This is, to put it mildly, atypical in K-12 schools. If we were to compare EnlightenAI's scoring accuracy to the grading consistency of teachers across the hall from one another, or even a given teacher grading their 5th versus their 50th paper on a given assignment, we suspect EnlightenAI's performance would seem even more impressive. This is not a knock on teachers, by the way - grading is subjective, complex, and exhausting. It's remarkable we've done it the way we have for as long as we have.

Looking to the future, we've already identified areas for enhancement that we expect will further elevate the accuracy, reliability, and consistency of EnlightenAI - by the time you use EnlightenAI next, it's likely it will already be even more reliable. We're also exploring innovative ways to leverage this accuracy in service of empowering teachers. Imagine setting up an AI assistant that not only delivers immediate feedback but does so in a manner that is accurate and aligns with your instructional voice. Not an alien AI, but an AI TA that you can count on. Or consider the potential for using reliable data to drive targeted reteaching, addressing true learning gaps effectively. Stay tuned - more is on the way.