Benchmarking LLMs on KGs with KCs and prerequisite relationships.

Here comes the benchmarking results.

Benchmarking on One-hop Relationship Verification

The results shown in the tables are the accuracy of investigated LLMs on the task of one-hop relationship verification.

1) Accuracy (in %) on all positive samples.

LLM DBE-KT22 WDKG-Course WDKG-KnowledgePoints Junyi-Prerequisites
Q. R. Q. R. Q. R. Q. R.
GPT-4 28.6 46.6 54.4 56.6 20.2 8.6 58.9 46.7
Qwen-turbo 50.2 64.6 49.9 68.6 42.1 62.7 72.2 76.8
Moonshot-v1-128k 46.5 66.0 56.7 92.4 38.1 30.4 80.5 68.5
Claude-3-haiku-20240307 56.3 81.9 82.9 86.5 65.9 52.9 80.9 76.8
Yi-34b-chat-0205 26.4 26.3 54.6 76.7 56.2 34.7 62.5 50.1
Gemini-1.5-pro 23.9 60.4 22.1 64.5 23.9 8.4 24.8 64.2

2) Accuracy (in %) on all negative samples.

LLM DBE-KT22 WDKG-Course WDKG-KnowledgePoints Junyi-Prerequisites
Q. R. Q. R. Q. R. Q. R.
GPT-4 96.8 86.8 80.6 58.8 82.3 86.7 88.8 90.9
Qwen-turbo 84.4 86.8 86.1 56.7 74.3 72.9 94.4 97.0
Moonshot-v1-128k 88.7 80.2 64.9 24.6 76.3 80.6 70.7 88.7
Claude-3-haiku-20240307 84.5 54.1 45.0 50.7 52.1 66.9 54.7 72.3
Yi-34b-chat-0205 94.6 96.5 86.9 38.7 74.1 80.9 78.7 88.3
Gemini-1.5-pro 98.1 88.6 88.3 76.7 66.1 88.6 93.0 76.4

3) Accuracy (in %) on all samples.

LLM DBE-KT22 WDKG-Course WDKG-KnowledgePoints Junyi-Prerequisites
Q. R. Q. R. Q. R. Q. R.
GPT-4 62.7 66.7 67.5 57.7 47.3 67.7 73.9 68.8
Qwen-turbo 75.3 83.7 68.0 62.7 58.2 67.8 79.3 68.9
Moonshot-v1-128k 67.6 73.1 60.8 58.5 57.2 55.5 75.6 78.6
Claude-3-haiku-20240307 70.4 68.0 63.9 68.6 59.0 59.9 82.9 76.9
Yi-34b-chat-0205 60.5 61.4 70.7 57.8 65.3 57.8 70.6 69.2
Gemini-1.5-pro 61.0 74.5 55.2 70.6 45.0 48.5 58.9 70.3

Benchmarking on Conjunction-two Relationship Verification

The results shown in the tables are the accuracy of investigated LLMs on the task of conjunction-two relationship verification.

1) Accuracy (in %) on all positive samples.

LLM DBE-KT22 WDKG-Course
Q. R. Q. R.
GPT-4 55.5 11.1 92.3 69.2
Qwen-turbo 67.6 44.0 62.0 92.0
Moonshot-v1-128k 67.0 67.0 85.0 92.0
Claude-3-haiku-20240307 78.0 78.0 92.0 85.0
Yi-34b-chat-0205 44.0 56.0 85.0 77.0
Gemini-1.5-pro 33.0 67.0 77.0 69.0

2) Accuracy (in %) on negative p1 samples.

LLM DBE-KT22 WDKG-Course
Q. R. Q. R.
GPT-4 89.00 78.00 54.00 54.00
Qwen-turbo 67.00 89.00 69.00 62.00
Moonshot-v1-128k 33.00 67.00 23.00 0.00
Claude-3-haiku-20240307 56.00 44.00 15.00 46.00
Yi-34b-chat-0205 78.00 78.00 31.00 54.00
Gemini-1.5-pro 89.00 89.00 77.00 77.00

3) Accuracy (in %) on negative p2 samples.

LLM DBE-KT22 WDKG-Course
Q. R. Q. R.
GPT-4 56.00 67.00 69.00 54.00
Qwen-turbo 44.00 56.00 46.00 23.00
Moonshot-v1-128k 44.00 44.00 8.00 8.00
Claude-3-haiku-20240307 44.00 22.00 54.00 69.00
Yi-34b-chat-0205 56.00 56.00 38.00 69.00
Gemini-1.5-pro 89.00 89.00 92.00 69.00

4) Accuracy (in %) on negative p12 samples.

LLM DBE-KT22 WDKG-Course
Q. R. Q. R.
GPT-4 100.00 89.00 54.00 92.00
Qwen-turbo 56.00 78.00 85.00 54.00
Moonshot-v1-128k 33.00 78.00 46.00 15.00
Claude-3-haiku-20240307 67.00 78.00 77.00 92.00
Yi-34b-chat-0205 67.00 89.00 46.00 62.00
Gemini-1.5-pro 89.00 89.00 77.00 69.00

5) Accuracy (in %) on all samples.

LLM DBE-KT22 WDKG-Course
Q. R. Q. R.
GPT-4 68.58 44.55 75.65 67.93
Qwen-turbo 61.63 59.17 64.33 69.17
Moonshot-v1-128k 51.83 65.00 55.33 49.83
Claude-3-haiku-20240307 66.83 63.00 70.33 77.00
Yi-34b-chat-0205 55.50 65.17 61.67 69.33
Gemini-1.5-pro 61.00 78.00 79.50 70.33

Benchmarking on Conjunction-three Relationship Verification

The results shown in the tables are the accuracy of investigated LLMs on the task of conjunction-three relationship verification.

1) Accuracy (in %) on all positive samples.

LLM DBE-KT22 WDKG-Course
Q. R. Q. R.
GPT-4 60.0 80.0 67.0 67.0
Qwen-turbo 80.0 60.0 56.0 100.0
Moonshot-v1-128k 60.0 80.0 100.0 100.0
Claude-3-haiku-20240307 80.0 100.0 89.0 89.0
Yi-34b-chat-0205 60.0 80.0 100.0 89.0
Gemini-1.5-pro 20.0 40.0 56.0 89.0

2) Accuracy (in %) on negative p1 samples.

LLM DBE-KT22 WDKG-Course
Q. R. Q. R.
GPT-4 60.0 60.0 67.0 56.0
Qwen-turbo 60.0 60.0 67.0 67.0
Moonshot-v1-128k 60.0 40.0 56.0 22.0
Claude-3-haiku-20240307 80.0 80.0 67.0 56.0
Yi-34b-chat-0205 80.0 80.0 67.0 56.0
Gemini-1.5-pro 80.0 80.0 56.0 56.0

3) Accuracy (in %) on negative p2 samples.

LLM DBE-KT22 WDKG-Course
Q. R. Q. R.
GPT-4 80.0 80.0 67.0 56.0
Qwen-turbo 60.0 60.0 89.0 44.0
Moonshot-v1-128k 60.0 80.0 56.0 0.0
Claude-3-haiku-20240307 80.0 20.0 33.0 22.0
Yi-34b-chat-0205 80.0 100.0 56.0 56.0
Gemini-1.5-pro 80.0 100.0 100.0 56.0

4) Accuracy (in %) on negative p3 samples.

LLM DBE-KT22 WDKG-Course
Q. R. Q. R.
GPT-4 60.0 40.0 78.0 33.0
Qwen-turbo 80.0 60.0 78.0 44.0
Moonshot-v1-128k 60.0 80.0 22.0 11.0
Claude-3-haiku-20240307 80.0 80.0 56.0 56.0
Yi-34b-chat-0205 100.0 80.0 67.0 67.0
Gemini-1.5-pro 60.0 60.0 44.0 56.0

5) Accuracy (in %) on negative p12 samples.

LLM DBE-KT22 WDKG-Course
Q. R. Q. R.
GPT-4 60.0 80.0 89.0 89.0
Qwen-turbo 100.0 80.0 89.0 67.0
Moonshot-v1-128k 100.0 80.0 67.0 44.0
Claude-3-haiku-20240307 60.0 40.0 78.0 44.0
Yi-34b-chat-0205 80.0 80.0 78.0 56.0
Gemini-1.5-pro 100.0 60.0 67.0 56.0

6) Accuracy (in %) on negative p13 samples.

LLM DBE-KT22 WDKG-Course
Q. R. Q. R.
GPT-4 40.0 60.0 78.0 89.0
Qwen-turbo 40.0 40.0 78.0 56.0
Moonshot-v1-128k 60.0 20.0 67.0 22.0
Claude-3-haiku-20240307 40.0 0.0 33.0 11.0
Yi-34b-chat-0205 80.0 80.0 56.0 67.0
Gemini-1.5-pro 60.0 80.0 67.0 56.0

7) Accuracy (in %) on negative p23 samples.

LLM DBE-KT22 WDKG-Course
Q. R. Q. R.
GPT-4 80.0 60.0 56.0 67.0
Qwen-turbo 40.0 80.0 67.0 22.0
Moonshot-v1-128k 100.0 80.0 67.0 11.0
Claude-3-haiku-20240307 60.0 40.0 56.0 22.0
Yi-34b-chat-0205 100.0 100.0 44.0 56.0
Gemini-1.5-pro 80.0 80.0 56.0 56.0

8) Accuracy (in %) on negative p123 samples.

LLM DBE-KT22 WDKG-Course
Q. R. Q. R.
GPT-4 100.0 100.0 89.0 89.0
Qwen-turbo 80.0 100.0 100.0 100.0
Moonshot-v1-128k 80.0 40.0 89.0 67.0
Claude-3-haiku-20240307 80.0 40.0 78.0 67.0
Yi-34b-chat-0205 100.0 100.0 89.0 56.0
Gemini-1.5-pro 100.0 60.0 78.0 67.0

9) Accuracy (in %) on all samples.

LLM DBE-KT22 WDKG-Course
Q. R. Q. R.
GPT-4 64.3 74.3 70.9 67.7
Qwen-turbo 72.9 64.3 68.6 78.6
Moonshot-v1-128k 67.1 70.0 80.3 62.6
Claude-3-haiku-20240307 74.3 71.4 73.1 64.4
Yi-34b-chat-0205 74.3 84.3 82.6 74.1
Gemini-1.5-pro 50.0 57.1 61.4 73.3

Benchmarking on Two-hop Reasoning

The results shown in the tables are the accuracy of investigated LLMs on the task of two-hop reasoning.

1) Accuracy (in %) on all positive samples.

LLM DBE-KT22 WDKG-Course Junyi-Prerequisites
Q. R. Q. R. Q. R.
GPT-4 93.8 71.9 100.0 71.9 100.0 65.6
qwen-turbo 90.6 56.3 90.6 71.9 93.8 71.9
Moonshot-v1-128k 100.0 59.4 100.0 68.8 93.8 59.4
Claude-3-haiku-20240307 93.8 93.8 100.0 78.1 93.8 90.6
Yi-34b-chat-0205 65.6 59.4 68.8 56.3 71.9 56.3
Gemini-1.5-pro 65.6 56.3 90.6 68.8 71.9 59.4

2) Accuracy (in %) on invert relationships samples.

LLM DBE-KT22 WDKG-Course Junyi-Prerequisites
Q. R. Q. R. Q. R.
GPT-4 25.0 37.5 50.0 100.0 25.0 0.0
qwen-turbo 12.5 50.0 75.0 62.5 0.0 37.5
Moonshot-v1-128k 12.5 25.0 0.0 25.0 25.0 75.0
Claude-3-haiku-20240307 37.5 25.0 12.5 37.5 12.5 12.5
Yi-34b-chat-0205 12.5 25.0 50.0 62.5 37.5 25.0
Gemini-1.5-pro 37.5 50.0 50.0 50.0 25.0 37.5

3) Accuracy (in %) on replace terminal samples.

LLM DBE-KT22 WDKG-Course Junyi-Prerequisites
Q. R. Q. R. Q. R.
GPT-4 0.0 50.0 0.0 37.5 0.0 75.0
qwen-turbo 25.0 75.0 37.5 50.0 12.5 75.0
Moonshot-v1-128k 0.0 75.0 0.0 50.0 0.0 50.0
Claude-3-haiku-20240307 37.5 25.0 0.0 37.5 0.0 37.5
Yi-34b-chat-0205 62.5 62.5 25.0 75.0 0.0 50.0
Gemini-1.5-pro 37.5 62.5 0.0 37.5 12.5 50.0

4) Accuracy (in %) on replace intermediate samples.

LLM DBE-KT22 WDKG-Course Junyi-Prerequisites
Q. R. Q. R. Q. R.
GPT-4 0.0 50.0 0.0 50.0 0.0 25.0
qwen-turbo 25.0 50.0 25.0 37.5 12.5 50.0
Moonshot-v1-128k 12.5 37.5 0.0 25.0 0.0 25.0
Claude-3-haiku-20240307 12.5 50.0 0.0 25.0 0.0 0.0
Yi-34b-chat-0205 50.0 50.0 0.0 75.0 37.5 37.5
Gemini-1.5-pro 25.0 37.5 12.5 25.0 12.5 25.0

5) Accuracy (in %) on path disruption samples.

LLM DBE-KT22 WDKG-Course Junyi-Prerequisites
Q. R. Q. R. Q. R.
GPT-4 0.0 25.0 0.0 50.0 0.0 37.5
qwen-turbo 37.5 62.5 12.5 12.5 50.0 62.5
Moonshot-v1-128k 12.5 37.5 0.0 50.0 12.5 62.5
Claude-3-haiku-20240307 25.0 25.0 0.0 12.5 25.0 0.0
Yi-34b-chat-0205 25.0 62.5 12.5 75.0 37.5 62.5
Gemini-1.5-pro 12.5 50.0 12.5 50.0 25.0 0.0

6) Accuracy (in %) on all samples.

LLM DBE-KT22 WDKG-Course Junyi-Prerequisites
Q. R. Q. R. Q. R.
GPT-4 50.0 56.3 56.3 65.6 53.1 50.0
qwen-turbo 57.8 57.8 64.1 56.3 56.3 64.1
Moonshot-v1-128k 54.7 51.6 50.0 53.1 51.6 56.3
Claude-3-haiku-20240307 60.9 62.5 51.6 53.1 51.6 51.6
Yi-34b-chat-0205 51.6 54.7 45.3 64.1 50.0 50.0
Gemini-1.5-pro 46.9 53.1 54.7 54.7 45.3 43.8

Benchmarking on Three-hop Reasoning

The results shown in the tables are the accuracy of investigated LLMs on the task of three-hop reasoning.

1) Accuracy (in %) on all positive samples.

LLM DBE-KT22 WDKG-Course Junyi-Prerequisites
Q. R. Q. R. Q. R.
GPT-4 96.0 56.0 100.0 92.0 100.0 52.0
qwen-turbo 100.0 64.0 96.0 64.0 96.0 32.0
Moonshot-v1-128k 100.0 72.0 100.0 60.0 92.0 48.0
Claude-3-haiku-20240307 88.0 84.0 100.0 84.0 100.0 88.0
Yi-34b-chat-0205 84.0 84.0 88.0 28.0 96.0 68.0
Gemini-1.5-pro 84.0 72.0 88.0 60.0 88.0 56.0

2) Accuracy (in %) on all negative samples.

LLM DBE-KT22 WDKG-Course Junyi-Prerequisites
Q. R. Q. R. Q. R.
GPT-4 0.0 60.0 0.0 16.0 0.0 44.0
qwen-turbo 8.0 56.0 8.0 36.0 12.0 60.0
Moonshot-v1-128k 0.0 52.0 0.0 44.0 4.0 76.0
Claude-3-haiku-20240307 0.0 12.0 0.0 20.0 16.0 24.0
Yi-34b-chat-0205 8.0 56.0 4.0 76.0 4.0 52.0
Gemini-1.5-pro 32.0 80.0 0.0 44.0 28.0 76.0

3) Accuracy (in %) on all samples.

LLM DBE-KT22 WDKG-Course Junyi-Prerequisites
Q. R. Q. R. Q. R.
GPT-4 48.0 58.0 50.0 54.0 50.0 48.0
qwen-turbo 54.0 60.0 52.0 50.0 54.0 46.0
Moonshot-v1-128k 50.0 62.0 50.0 52.0 48.0 62.0
Claude-3-haiku-20240307 44.0 48.0 50.0 52.0 58.0 56.0
Yi-34b-chat-0205 46.0 70.0 46.0 52.0 50.0 60.0
Gemini-1.5-pro 58.0 76.0 44.0 52.0 58.0 66.0

Benchmarking on Relationships Extraction with Node Pairs

The results shown in the following tables are the accuracy, recall, precision, AUROC and AUPRC of investigated LLMs on the task of relationship extraction with node pairs.

1) Accuracy, precision and recall values (in %) of all investigated LLMs on relationship extraction tasks with binary outputs on node pairs from four datasets.

LLM DBE-KT22 WDKG-Course WDKG-KnoeledgePoints Junyi-Prerequisites
Acc Recall Prec Acc Recall Prec Acc Recall Prec Acc Recall Prec
GPT-4 68.0 50.0 87.5 52.0 28.6 66.7 36.0 0.0 0.0 68.0 60.0 81.8
Qwen-turbo 64.0 50.0 77.8 56.0 35.7 71.4 60.0 33.3 100.0 72.0 66.7 83.3
Moonshot-v1-128k 60.0 42.9 75.0 56.6 28.6 80.0 40.0 6.7 50.0 72.0 53.3 100.0
Claude-3-haiku-20240307 84.0 78.6 91.7 80.0 78.6 84.6 56.0 46.7 70.0 72.0 60.0 90.0
Yi-34b-chat-0205 60.0 35.7 83.3 64.0 50.0 77.8 60.0 40.0 85.7 56.0 33.3 83.3
Gemini-1.5-pro 52.0 28.6 66.7 52.0 21.4 75.0 36.0 6.7 33.3 56.0 26.7 100.0

2) AUROC (in %) of all investigated LLMs on relationship extraction tasks with binary outputs on node pairs from four datasets.

LLM DBE-KT22 WDKG-Course WDKG-KnoeledgePoints Junyi-Prerequisites
Binary Float Binary Float Binary Float Binary Float
GPT-4 70.5 89.3 55.2 77.3 46.0 67.3 70.0 92.3
Qwen-turbo 65.9 64.3 58.8 71.0 66.7 63.6 73.3 78.5
Moonshot-v1-128k 62.3 67.2 59.7 36.7 48.3 45.6 76.7 75.0
Claude-3-haiku-20240307 84.7 79.5 80.2 66.7 58.3 78.7 75.0 83.7
Yi-34b-chat-0205 63.3 81.8 65.9 93.3 65.0 44.9 61.7 96.8
Gemini-1.5-pro 55.2 82.5 56.2 73.0 33.3 43.4 63.3 92.0

3) AUPRC (in %) of all investigated LLMs on relationship extraction tasks with binary outputs on node pairs from four datasets.

LLM DBE-KT22 WDKG-Course WDKG-KnoeledgePoints Junyi-Prerequisites
Binary Float Binary Float Binary Float Binary Float
GPT-4 71.8 86.0 59.1 73.9 60.0 81.2 73.1 90.8
Qwen-turbo 66.9 64.7 61.5 61.0 73.3 77.4 75.6 78.0
Moonshot-v1-128k 64.1 61.6 62.9 40.0 59.3 72.1 81.3 78.9
Claude-3-haiku-20240307 84.0 65.9 78.5 58.1 64.7 87.5 78.0 79.6
Yi-34b-chat-0205 65.8 74.4 66.9 86.7 70.3 64.8 67.8 94.1
Gemini-1.5-pro 59.1 76.9 60.1 63.2 58.2 64.1 70.7 93.4

Benchmarking on Relationships Extraction with Subgraphs

The results shown in the following tables are the AUROC and AUPRC of investigated LLMs on the task of relationship extraction with subgraphs on DBE-KT22 dataset.

1) Average AUROC (in %) of investigated LLMs on the task of relationship extraction with subgraphs on DBE-KT22 dataset.

LLM Subgraph 5 Subgraph 10 Subgraph 15
B F B F B F
GPT-4 54.5 65.4 69.9 70.1 60.6 66.2
Qwen-turbo 67.0 77.3 60.8 71.4 56.8 65.8
Moonshot-v1-128k 51.0 70.4 61.0 65.4 56.1 60.0
Claude-3-haiku-20240307 60.5 70.5 50.4 66.3 47.9 59.9
Yi-34b-chat-0205 61.5 75.1 54.7 61.0 50.9 59.0
Gemini-1.5-pro 63.0 80.2 59.9 72.2 54.5 63.5

2) Average AUPRC (in %) of investigated LLMs on the task of relationship extraction with subgraphs on DBE-KT22 dataset.

LLM Subgraph 5 Subgraph 10 Subgraph 15
B F B F B F
GPT-4 61.2 72.3 63.7 69.2 58.8 63.4
Qwen-turbo 69.7 77.0 53.6 71.0 53.4 63.5
Moonshot-v1-128k 54.3 71.3 55.5 64.5 54.5 57.8
Claude-3-haiku-20240307 59.7 70.6 47.7 61.4 52.8 56.3
Yi-34b-chat-0205 61.5 79.6 49.4 61.1 51.6 57.5
Gemini-1.5-pro 63.0 79.7 54.1 63.0 53.8 58.1