About the Benchmark

Abstract

Large Language Models (LLMs) have transformed educational technology, enhancing personalized learning and curriculum delivery. However, their capabilities in tasks such as relationship verification, multi-hop reasoning, and relationship extraction within educational knowledge graphs (KGs) are under-explored. These tasks are crucial for developing intelligent educational systems that accurately model and navigate structured knowledge components (KCs) and their prerequisite relationships. This paper presents a comprehensive evaluation of pre-trained LLMs on these complex tasks. We assess their performance in one-hop relationship verification, multi-hop chain reasoning, and relationship extraction across diverse datasets. Our findings highlight significant variability in LLM performance, particularly as task complexity increases. While models like GPT-4 and Claude-3 show robustness in simpler tasks, their accuracy declines with more complex multi-hop reasoning. Our contributions include an evaluation framework for educational tasks, detailed analyses of LLM performance, and recommendations for enhancing model training and application. This work advances our understanding of LLM capabilities in education and provides practical insights for their effective integration, aiming to improve learning efficiency and accessibility.

License

This work is licensed under a CC BY 4.0 license.

Acknowledgement

Will update this section before NeurIPS 2024.

Update Plan

We will benchmark more LLMs on the dataset provided in this work.
Current ongoing work: benchmarking Llama-3 and Qwen2-72B-Instruct.