Researchers Examined the Quality of Writing Produced by ChatGPT and Human Writers, and Here's Their Verdict

In a series of recent studies, the performance of AI-generated writing assessment tools, such as ChatGPT, has been compared to that of human teachers. The results indicate that these AI models can provide written feedback and scoring that closely approximate human evaluators, especially when feedback is carefully structured through prompt engineering.

One study compared the effectiveness of AI feedback and human teacher feedback on English as a Foreign Language (EFL) argumentative writing. Remarkably, both feedback sources produced statistically significant improvements in students’ writing scores, with no significant differences in effectiveness between AI feedback and human teacher feedback [1]. This suggests that AI-language models like ChatGPT offer a pedagogically meaningful and scalable alternative to traditional teacher feedback, particularly beneficial in settings with limited teaching resources.

Key findings from this research include:

Both AI and human feedback enhanced student writing scores significantly, with a small effect size difference between groups being negligible.
Student proficiency level influenced outcomes, with more advanced students performing better overall, but less proficient students showing high responsiveness to any structured feedback.
The study assessed immediate revision gains and involved scoring by a single rater, though inter-rater reliability checks provided reasonable confidence.
The use of AI as a complement—not a replacement—for human instructors was emphasized, highlighting the importance of guided, ethical frameworks for AI integration in education [1].

Additional insight from educational practice suggests that having students interact with AI-generated drafts by revising or coaching the AI to improve its writing can hone critical feedback and revision skills. This interaction reveals AI’s limitations and helps students develop metacognitive awareness during writing tasks [3].

However, some experts point out challenges in fully trusting AI in educational settings, including concerns about intellectual engagement, originality, and the relational nature of writing as thinking [4]. Thus, while AI feedback shows promise in scoring consistency and written feedback quality, it should be integrated thoughtfully alongside human judgment.

The studies involved 200 source-based argument essays in history from students in grades 6-12. It is important to note that the current research used specifically designed prompts that were tested and vetted by experts in using technology for writing instruction, which may not represent typical classroom prompts.

The findings suggest a potential role for AI in writing assessments, perhaps as a tool for students to improve their work before submission and as a time-saving tool for teachers. However, the need for more training and emphasis on digital literacy for both students and teachers is highlighted, as well as addressing issues such as unsanctioned use, ethical concerns, and careful implementation.

In the study where only a score was given, AI slightly outperformed humans. In another study, student papers were given a number score by both teachers and various versions of ChatGPT. In the study comparing the consistency of ChatGPT paper scores with humans, ChatGPT performed better, but there was still some inconsistency in grading between various generations of ChatGPT technology.

The pace of AI technology improvement may lead to different conversations about AI grading in the near future. As AI systems are expected to improve their feedback capabilities over time, it is essential to stay informed and adapt to these changes to ensure the best possible learning experiences for students.

| Aspect | AI-Generated Writing Assessment (e.g., ChatGPT) | Human Teacher Assessment | |------------------------------------|----------------------------------------------------------|-----------------------------------------------| | Effectiveness in feedback | Comparable improvement in writing scores | Established reliability | | Scoring consistency | High when structured properly; minor limitations noted | Gold standard; may have inter-rater variance | | Usefulness for different proficiency levels | Particularly effective with lower-proficiency learners | Effective across proficiency levels | | Pedagogical integration | Best used as complement; requires guided ethical use | Direct relational engagement | | Limitations | Current evidence mostly on immediate gains; single-rater scoring | Time-intensive; influenced by teacher workload |

References: [1] [Study link] [2] [Study link] [3] [Study link] [4] [Study link]

The AI-generated writing assessment tools, like ChatGPT, can provide feedback and scoring that are nearly as effective as human teachers, especially when feedback is structured carefully.
In EFL argumentative writing, both AI and human feedback significantly improved students' writing scores, with minimal differences in effectiveness between the two.
AI's use as a complement to human instructors was emphasized, as it offers a scalable solution for settings with limited teaching resources and can help students develop critical feedback and revision skills.
For lower-proficiency learners, AI feedback can be particularly effective, showing high responsiveness to structured feedback.
As AI systems are expected to improve, it is crucial to stay informed and adapt to ensure the best possible learning experiences for students, addressing issues such as unsanctioned use, ethical concerns, and careful implementation.

Researchers Examined the Quality of Writing Produced by ChatGPT and Human Writers, and Here's Their Verdict