Does the AI agree? Inter-rater agreement in learning diary evaluation

Abstract

Does the AI agree? Inter-rater agreement in learning diary evaluation


Theory

Learning diaries as formative assessments are promising to support the learning process and stimulate reflection. In an open learning diary, learners apply learning strategies to reflect on their learning and deepen their knowledge. To guide learners, learning diaries can be structured according to different learning strategies. However, grading and feedback on learning diaries is effortful for teachers. Artificial intelligence (AI) may assist teachers in the evaluation process. A prerequisite is that AI and teachers show high inter-rater agreement.

Research Questions

The current study aimed to analyze the agreement between teachers and ChatGPT-4o by examining four separately assessed learning strategy categories of a learning diary in adult education (organization, in-depth elaboration, transfer-supporting elaboration, and metacognition). The two research questions were:

• Q1: How accurate is the overall agreement between teachers and ChatGPT-4o, and are there differences across different learning strategy categories?

• Q2: Does the inter-rater agreement differ between teachers across the four learning strategy categories?

Method

Seven different adult education teachers and ChatGPT-4o evaluated a total of 540 learning diary entries. Each teacher assessed approximately 65 entries. Teachers were trained in criteria-based evaluation per learning strategy category. An engineered prompt supported the ChatGPT-4o model.

Teacher ratings served as the reference for the inter-rater agreement. Absolute accuracy and under-/overestimation (bias) were calculated for each learning strategy category as accuracy measures. Furthermore, overall accuracy values were calculated across the four categories for absolute accuracy and bias.

A doubly multivariate repeated measures ANOVA was conducted with the four learning strategy categories as repeated measures and absolute accuracy and bias as measures to test for accuracy differences between the learning strategy categories (Q1). Additionally, teacher was used as a between-subjects factor. Thus, the interaction of the learning strategy category and rating teacher regarding accuracy could be tested statistically (Q2).

Mehr zum Titel

Titel Does the AI agree? Inter-rater agreement in learning diary evaluation
Medien 14th Conference of the Media Psychology Division (DGPs) in Duisburg
Verfasser Dr. Nick Naujoks-Schober, Lhea Reinhold, Prof. Dr. habil. Marion Händel
Veröffentlichungsdatum 11.09.2025
Zitation Naujoks-Schober, Nick; Reinhold, Lhea; Händel, Marion (2025): Does the AI agree? Inter-rater agreement in learning diary evaluation. 14th Conference of the Media Psychology Division (DGPs) in Duisburg.