Benchmarking LLMs on KGs with KCs and prerequisite relationships.

Benchmarking on One-hop Relationship Verification
Benchmarking on Conjunction-two Relationship Verification
Benchmarking on Conjunction-three Relationship Verification
Benchmarking on Two-hop Reasoning
Benchmarking on Three-hop Reasoning
Benchmarking on Relationships Extraction with Node Pairs
Benchmarking on Relationships Extraction with Subgraphs

Benchmarking on One-hop Relationship Verification

The results shown in the tables are the accuracy of investigated LLMs on the task of one-hop relationship verification.

1) Accuracy (in %) on all positive samples.

LLM	DBE-KT22		WDKG-Course		WDKG-KnowledgePoints		Junyi-Prerequisites
LLM		Q.	R.	Q.	R.	Q.	R.	Q.	R.
GPT-4	28.6	46.6	54.4	56.6	20.2	8.6	58.9	46.7
Qwen-turbo	50.2	64.6	49.9	68.6	42.1	62.7	72.2	76.8
Moonshot-v1-128k	46.5	66.0	56.7	92.4	38.1	30.4	80.5	68.5
Claude-3-haiku-20240307	56.3	81.9	82.9	86.5	65.9	52.9	80.9	76.8
Yi-34b-chat-0205	26.4	26.3	54.6	76.7	56.2	34.7	62.5	50.1
Gemini-1.5-pro	23.9	60.4	22.1	64.5	23.9	8.4	24.8	64.2

2) Accuracy (in %) on all negative samples.

LLM	DBE-KT22		WDKG-Course		WDKG-KnowledgePoints		Junyi-Prerequisites
LLM		Q.	R.	Q.	R.	Q.	R.	Q.	R.
GPT-4	96.8	86.8	80.6	58.8	82.3	86.7	88.8	90.9
Qwen-turbo	84.4	86.8	86.1	56.7	74.3	72.9	94.4	97.0
Moonshot-v1-128k	88.7	80.2	64.9	24.6	76.3	80.6	70.7	88.7
Claude-3-haiku-20240307	84.5	54.1	45.0	50.7	52.1	66.9	54.7	72.3
Yi-34b-chat-0205	94.6	96.5	86.9	38.7	74.1	80.9	78.7	88.3
Gemini-1.5-pro	98.1	88.6	88.3	76.7	66.1	88.6	93.0	76.4

3) Accuracy (in %) on all samples.

LLM	DBE-KT22		WDKG-Course		WDKG-KnowledgePoints		Junyi-Prerequisites
LLM		Q.	R.	Q.	R.	Q.	R.	Q.	R.
GPT-4	62.7	66.7	67.5	57.7	47.3	67.7	73.9	68.8
Qwen-turbo	75.3	83.7	68.0	62.7	58.2	67.8	79.3	68.9
Moonshot-v1-128k	67.6	73.1	60.8	58.5	57.2	55.5	75.6	78.6
Claude-3-haiku-20240307	70.4	68.0	63.9	68.6	59.0	59.9	82.9	76.9
Yi-34b-chat-0205	60.5	61.4	70.7	57.8	65.3	57.8	70.6	69.2
Gemini-1.5-pro	61.0	74.5	55.2	70.6	45.0	48.5	58.9	70.3

Benchmarking on Conjunction-two Relationship Verification

The results shown in the tables are the accuracy of investigated LLMs on the task of conjunction-two relationship verification.

1) Accuracy (in %) on all positive samples.

LLM	DBE-KT22		WDKG-Course
LLM		Q.	R.	Q.	R.
GPT-4	55.5	11.1	92.3	69.2
Qwen-turbo	67.6	44.0	62.0	92.0
Moonshot-v1-128k	67.0	67.0	85.0	92.0
Claude-3-haiku-20240307	78.0	78.0	92.0	85.0
Yi-34b-chat-0205	44.0	56.0	85.0	77.0
Gemini-1.5-pro	33.0	67.0	77.0	69.0

2) Accuracy (in %) on negative p1 samples.

LLM	DBE-KT22		WDKG-Course
LLM		Q.	R.	Q.	R.
GPT-4	89.00	78.00	54.00	54.00
Qwen-turbo	67.00	89.00	69.00	62.00
Moonshot-v1-128k	33.00	67.00	23.00	0.00
Claude-3-haiku-20240307	56.00	44.00	15.00	46.00
Yi-34b-chat-0205	78.00	78.00	31.00	54.00
Gemini-1.5-pro	89.00	89.00	77.00	77.00

3) Accuracy (in %) on negative p2 samples.

LLM	DBE-KT22		WDKG-Course
LLM		Q.	R.	Q.	R.
GPT-4	56.00	67.00	69.00	54.00
Qwen-turbo	44.00	56.00	46.00	23.00
Moonshot-v1-128k	44.00	44.00	8.00	8.00
Claude-3-haiku-20240307	44.00	22.00	54.00	69.00
Yi-34b-chat-0205	56.00	56.00	38.00	69.00
Gemini-1.5-pro	89.00	89.00	92.00	69.00

4) Accuracy (in %) on negative p12 samples.

LLM	DBE-KT22		WDKG-Course
LLM		Q.	R.	Q.	R.
GPT-4	100.00	89.00	54.00	92.00
Qwen-turbo	56.00	78.00	85.00	54.00
Moonshot-v1-128k	33.00	78.00	46.00	15.00
Claude-3-haiku-20240307	67.00	78.00	77.00	92.00
Yi-34b-chat-0205	67.00	89.00	46.00	62.00
Gemini-1.5-pro	89.00	89.00	77.00	69.00

5) Accuracy (in %) on all samples.

LLM	DBE-KT22		WDKG-Course
LLM		Q.	R.	Q.	R.
GPT-4	68.58	44.55	75.65	67.93
Qwen-turbo	61.63	59.17	64.33	69.17
Moonshot-v1-128k	51.83	65.00	55.33	49.83
Claude-3-haiku-20240307	66.83	63.00	70.33	77.00
Yi-34b-chat-0205	55.50	65.17	61.67	69.33
Gemini-1.5-pro	61.00	78.00	79.50	70.33

Benchmarking on Conjunction-three Relationship Verification

The results shown in the tables are the accuracy of investigated LLMs on the task of conjunction-three relationship verification.

1) Accuracy (in %) on all positive samples.

LLM	DBE-KT22		WDKG-Course
LLM		Q.	R.	Q.	R.
GPT-4	60.0	80.0	67.0	67.0
Qwen-turbo	80.0	60.0	56.0	100.0
Moonshot-v1-128k	60.0	80.0	100.0	100.0
Claude-3-haiku-20240307	80.0	100.0	89.0	89.0
Yi-34b-chat-0205	60.0	80.0	100.0	89.0
Gemini-1.5-pro	20.0	40.0	56.0	89.0

2) Accuracy (in %) on negative p1 samples.

LLM	DBE-KT22		WDKG-Course
LLM		Q.	R.	Q.	R.
GPT-4	60.0	60.0	67.0	56.0
Qwen-turbo	60.0	60.0	67.0	67.0
Moonshot-v1-128k	60.0	40.0	56.0	22.0
Claude-3-haiku-20240307	80.0	80.0	67.0	56.0
Yi-34b-chat-0205	80.0	80.0	67.0	56.0
Gemini-1.5-pro	80.0	80.0	56.0	56.0

3) Accuracy (in %) on negative p2 samples.

LLM	DBE-KT22		WDKG-Course
LLM		Q.	R.	Q.	R.
GPT-4	80.0	80.0	67.0	56.0
Qwen-turbo	60.0	60.0	89.0	44.0
Moonshot-v1-128k	60.0	80.0	56.0	0.0
Claude-3-haiku-20240307	80.0	20.0	33.0	22.0
Yi-34b-chat-0205	80.0	100.0	56.0	56.0
Gemini-1.5-pro	80.0	100.0	100.0	56.0

4) Accuracy (in %) on negative p3 samples.

LLM	DBE-KT22		WDKG-Course
LLM		Q.	R.	Q.	R.
GPT-4	60.0	40.0	78.0	33.0
Qwen-turbo	80.0	60.0	78.0	44.0
Moonshot-v1-128k	60.0	80.0	22.0	11.0
Claude-3-haiku-20240307	80.0	80.0	56.0	56.0
Yi-34b-chat-0205	100.0	80.0	67.0	67.0
Gemini-1.5-pro	60.0	60.0	44.0	56.0

5) Accuracy (in %) on negative p12 samples.

LLM	DBE-KT22		WDKG-Course
LLM		Q.	R.	Q.	R.
GPT-4	60.0	80.0	89.0	89.0
Qwen-turbo	100.0	80.0	89.0	67.0
Moonshot-v1-128k	100.0	80.0	67.0	44.0
Claude-3-haiku-20240307	60.0	40.0	78.0	44.0
Yi-34b-chat-0205	80.0	80.0	78.0	56.0
Gemini-1.5-pro	100.0	60.0	67.0	56.0

6) Accuracy (in %) on negative p13 samples.

LLM	DBE-KT22		WDKG-Course
LLM		Q.	R.	Q.	R.
GPT-4	40.0	60.0	78.0	89.0
Qwen-turbo	40.0	40.0	78.0	56.0
Moonshot-v1-128k	60.0	20.0	67.0	22.0
Claude-3-haiku-20240307	40.0	0.0	33.0	11.0
Yi-34b-chat-0205	80.0	80.0	56.0	67.0
Gemini-1.5-pro	60.0	80.0	67.0	56.0

7) Accuracy (in %) on negative p23 samples.

LLM	DBE-KT22		WDKG-Course
LLM		Q.	R.	Q.	R.
GPT-4	80.0	60.0	56.0	67.0
Qwen-turbo	40.0	80.0	67.0	22.0
Moonshot-v1-128k	100.0	80.0	67.0	11.0
Claude-3-haiku-20240307	60.0	40.0	56.0	22.0
Yi-34b-chat-0205	100.0	100.0	44.0	56.0
Gemini-1.5-pro	80.0	80.0	56.0	56.0

8) Accuracy (in %) on negative p123 samples.

LLM	DBE-KT22		WDKG-Course
LLM		Q.	R.	Q.	R.
GPT-4	100.0	100.0	89.0	89.0
Qwen-turbo	80.0	100.0	100.0	100.0
Moonshot-v1-128k	80.0	40.0	89.0	67.0
Claude-3-haiku-20240307	80.0	40.0	78.0	67.0
Yi-34b-chat-0205	100.0	100.0	89.0	56.0
Gemini-1.5-pro	100.0	60.0	78.0	67.0

9) Accuracy (in %) on all samples.

LLM	DBE-KT22		WDKG-Course
LLM		Q.	R.	Q.	R.
GPT-4	64.3	74.3	70.9	67.7
Qwen-turbo	72.9	64.3	68.6	78.6
Moonshot-v1-128k	67.1	70.0	80.3	62.6
Claude-3-haiku-20240307	74.3	71.4	73.1	64.4
Yi-34b-chat-0205	74.3	84.3	82.6	74.1
Gemini-1.5-pro	50.0	57.1	61.4	73.3

Benchmarking on Two-hop Reasoning

The results shown in the tables are the accuracy of investigated LLMs on the task of two-hop reasoning.

1) Accuracy (in %) on all positive samples.

LLM	DBE-KT22		WDKG-Course		Junyi-Prerequisites
LLM		Q.	R.	Q.	R.	Q.	R.
GPT-4	93.8	71.9	100.0	71.9	100.0	65.6
qwen-turbo	90.6	56.3	90.6	71.9	93.8	71.9
Moonshot-v1-128k	100.0	59.4	100.0	68.8	93.8	59.4
Claude-3-haiku-20240307	93.8	93.8	100.0	78.1	93.8	90.6
Yi-34b-chat-0205	65.6	59.4	68.8	56.3	71.9	56.3
Gemini-1.5-pro	65.6	56.3	90.6	68.8	71.9	59.4

2) Accuracy (in %) on invert relationships samples.

LLM	DBE-KT22		WDKG-Course		Junyi-Prerequisites
LLM		Q.	R.	Q.	R.	Q.	R.
GPT-4	25.0	37.5	50.0	100.0	25.0	0.0
qwen-turbo	12.5	50.0	75.0	62.5	0.0	37.5
Moonshot-v1-128k	12.5	25.0	0.0	25.0	25.0	75.0
Claude-3-haiku-20240307	37.5	25.0	12.5	37.5	12.5	12.5
Yi-34b-chat-0205	12.5	25.0	50.0	62.5	37.5	25.0
Gemini-1.5-pro	37.5	50.0	50.0	50.0	25.0	37.5

3) Accuracy (in %) on replace terminal samples.

LLM	DBE-KT22		WDKG-Course		Junyi-Prerequisites
LLM		Q.	R.	Q.	R.	Q.	R.
GPT-4	0.0	50.0	0.0	37.5	0.0	75.0
qwen-turbo	25.0	75.0	37.5	50.0	12.5	75.0
Moonshot-v1-128k	0.0	75.0	0.0	50.0	0.0	50.0
Claude-3-haiku-20240307	37.5	25.0	0.0	37.5	0.0	37.5
Yi-34b-chat-0205	62.5	62.5	25.0	75.0	0.0	50.0
Gemini-1.5-pro	37.5	62.5	0.0	37.5	12.5	50.0

4) Accuracy (in %) on replace intermediate samples.

LLM	DBE-KT22		WDKG-Course		Junyi-Prerequisites
LLM		Q.	R.	Q.	R.	Q.	R.
GPT-4	0.0	50.0	0.0	50.0	0.0	25.0
qwen-turbo	25.0	50.0	25.0	37.5	12.5	50.0
Moonshot-v1-128k	12.5	37.5	0.0	25.0	0.0	25.0
Claude-3-haiku-20240307	12.5	50.0	0.0	25.0	0.0	0.0
Yi-34b-chat-0205	50.0	50.0	0.0	75.0	37.5	37.5
Gemini-1.5-pro	25.0	37.5	12.5	25.0	12.5	25.0

5) Accuracy (in %) on path disruption samples.

LLM	DBE-KT22		WDKG-Course		Junyi-Prerequisites
LLM		Q.	R.	Q.	R.	Q.	R.
GPT-4	0.0	25.0	0.0	50.0	0.0	37.5
qwen-turbo	37.5	62.5	12.5	12.5	50.0	62.5
Moonshot-v1-128k	12.5	37.5	0.0	50.0	12.5	62.5
Claude-3-haiku-20240307	25.0	25.0	0.0	12.5	25.0	0.0
Yi-34b-chat-0205	25.0	62.5	12.5	75.0	37.5	62.5
Gemini-1.5-pro	12.5	50.0	12.5	50.0	25.0	0.0

6) Accuracy (in %) on all samples.

LLM	DBE-KT22		WDKG-Course		Junyi-Prerequisites
LLM		Q.	R.	Q.	R.	Q.	R.
GPT-4	50.0	56.3	56.3	65.6	53.1	50.0
qwen-turbo	57.8	57.8	64.1	56.3	56.3	64.1
Moonshot-v1-128k	54.7	51.6	50.0	53.1	51.6	56.3
Claude-3-haiku-20240307	60.9	62.5	51.6	53.1	51.6	51.6
Yi-34b-chat-0205	51.6	54.7	45.3	64.1	50.0	50.0
Gemini-1.5-pro	46.9	53.1	54.7	54.7	45.3	43.8

Benchmarking on Three-hop Reasoning

The results shown in the tables are the accuracy of investigated LLMs on the task of three-hop reasoning.

1) Accuracy (in %) on all positive samples.

LLM	DBE-KT22		WDKG-Course		Junyi-Prerequisites
LLM		Q.	R.	Q.	R.	Q.	R.
GPT-4	96.0	56.0	100.0	92.0	100.0	52.0
qwen-turbo	100.0	64.0	96.0	64.0	96.0	32.0
Moonshot-v1-128k	100.0	72.0	100.0	60.0	92.0	48.0
Claude-3-haiku-20240307	88.0	84.0	100.0	84.0	100.0	88.0
Yi-34b-chat-0205	84.0	84.0	88.0	28.0	96.0	68.0
Gemini-1.5-pro	84.0	72.0	88.0	60.0	88.0	56.0

2) Accuracy (in %) on all negative samples.

LLM	DBE-KT22		WDKG-Course		Junyi-Prerequisites
LLM		Q.	R.	Q.	R.	Q.	R.
GPT-4	0.0	60.0	0.0	16.0	0.0	44.0
qwen-turbo	8.0	56.0	8.0	36.0	12.0	60.0
Moonshot-v1-128k	0.0	52.0	0.0	44.0	4.0	76.0
Claude-3-haiku-20240307	0.0	12.0	0.0	20.0	16.0	24.0
Yi-34b-chat-0205	8.0	56.0	4.0	76.0	4.0	52.0
Gemini-1.5-pro	32.0	80.0	0.0	44.0	28.0	76.0

3) Accuracy (in %) on all samples.

LLM	DBE-KT22		WDKG-Course		Junyi-Prerequisites
LLM		Q.	R.	Q.	R.	Q.	R.
GPT-4	48.0	58.0	50.0	54.0	50.0	48.0
qwen-turbo	54.0	60.0	52.0	50.0	54.0	46.0
Moonshot-v1-128k	50.0	62.0	50.0	52.0	48.0	62.0
Claude-3-haiku-20240307	44.0	48.0	50.0	52.0	58.0	56.0
Yi-34b-chat-0205	46.0	70.0	46.0	52.0	50.0	60.0
Gemini-1.5-pro	58.0	76.0	44.0	52.0	58.0	66.0

Benchmarking on Relationships Extraction with Node Pairs

The results shown in the following tables are the accuracy, recall, precision, AUROC and AUPRC of investigated LLMs on the task of relationship extraction with node pairs.

1) Accuracy, precision and recall values (in %) of all investigated LLMs on relationship extraction tasks with binary outputs on node pairs from four datasets.

LLM	DBE-KT22			WDKG-Course			WDKG-KnoeledgePoints			Junyi-Prerequisites
LLM	Acc	Recall	Prec	Acc	Recall	Prec	Acc	Recall	Prec	Acc	Recall	Prec
GPT-4	68.0	50.0	87.5	52.0	28.6	66.7	36.0	0.0	0.0	68.0	60.0	81.8
Qwen-turbo	64.0	50.0	77.8	56.0	35.7	71.4	60.0	33.3	100.0	72.0	66.7	83.3
Moonshot-v1-128k	60.0	42.9	75.0	56.6	28.6	80.0	40.0	6.7	50.0	72.0	53.3	100.0
Claude-3-haiku-20240307	84.0	78.6	91.7	80.0	78.6	84.6	56.0	46.7	70.0	72.0	60.0	90.0
Yi-34b-chat-0205	60.0	35.7	83.3	64.0	50.0	77.8	60.0	40.0	85.7	56.0	33.3	83.3
Gemini-1.5-pro	52.0	28.6	66.7	52.0	21.4	75.0	36.0	6.7	33.3	56.0	26.7	100.0

2) AUROC (in %) of all investigated LLMs on relationship extraction tasks with binary outputs on node pairs from four datasets.

LLM	DBE-KT22		WDKG-Course		WDKG-KnoeledgePoints		Junyi-Prerequisites
LLM	Binary	Float	Binary	Float	Binary	Float	Binary	Float
GPT-4	70.5	89.3	55.2	77.3	46.0	67.3	70.0	92.3
Qwen-turbo	65.9	64.3	58.8	71.0	66.7	63.6	73.3	78.5
Moonshot-v1-128k	62.3	67.2	59.7	36.7	48.3	45.6	76.7	75.0
Claude-3-haiku-20240307	84.7	79.5	80.2	66.7	58.3	78.7	75.0	83.7
Yi-34b-chat-0205	63.3	81.8	65.9	93.3	65.0	44.9	61.7	96.8
Gemini-1.5-pro	55.2	82.5	56.2	73.0	33.3	43.4	63.3	92.0

3) AUPRC (in %) of all investigated LLMs on relationship extraction tasks with binary outputs on node pairs from four datasets.

LLM	DBE-KT22		WDKG-Course		WDKG-KnoeledgePoints		Junyi-Prerequisites
LLM	Binary	Float	Binary	Float	Binary	Float	Binary	Float
GPT-4	71.8	86.0	59.1	73.9	60.0	81.2	73.1	90.8
Qwen-turbo	66.9	64.7	61.5	61.0	73.3	77.4	75.6	78.0
Moonshot-v1-128k	64.1	61.6	62.9	40.0	59.3	72.1	81.3	78.9
Claude-3-haiku-20240307	84.0	65.9	78.5	58.1	64.7	87.5	78.0	79.6
Yi-34b-chat-0205	65.8	74.4	66.9	86.7	70.3	64.8	67.8	94.1
Gemini-1.5-pro	59.1	76.9	60.1	63.2	58.2	64.1	70.7	93.4

Benchmarking on Relationships Extraction with Subgraphs

The results shown in the following tables are the AUROC and AUPRC of investigated LLMs on the task of relationship extraction with subgraphs on DBE-KT22 dataset.

1) Average AUROC (in %) of investigated LLMs on the task of relationship extraction with subgraphs on DBE-KT22 dataset.

LLM	Subgraph 5		Subgraph 10		Subgraph 15
LLM	B	F	B	F	B	F
GPT-4	54.5	65.4	69.9	70.1	60.6	66.2
Qwen-turbo	67.0	77.3	60.8	71.4	56.8	65.8
Moonshot-v1-128k	51.0	70.4	61.0	65.4	56.1	60.0
Claude-3-haiku-20240307	60.5	70.5	50.4	66.3	47.9	59.9
Yi-34b-chat-0205	61.5	75.1	54.7	61.0	50.9	59.0
Gemini-1.5-pro	63.0	80.2	59.9	72.2	54.5	63.5

2) Average AUPRC (in %) of investigated LLMs on the task of relationship extraction with subgraphs on DBE-KT22 dataset.

LLM	Subgraph 5		Subgraph 10		Subgraph 15
LLM	B	F	B	F	B	F
GPT-4	61.2	72.3	63.7	69.2	58.8	63.4
Qwen-turbo	69.7	77.0	53.6	71.0	53.4	63.5
Moonshot-v1-128k	54.3	71.3	55.5	64.5	54.5	57.8
Claude-3-haiku-20240307	59.7	70.6	47.7	61.4	52.8	56.3
Yi-34b-chat-0205	61.5	79.6	49.4	61.1	51.6	57.5
Gemini-1.5-pro	63.0	79.7	54.1	63.0	53.8	58.1

On this page

Benchmarking on One-hop Relationship Verification

1) Accuracy (in %) on all positive samples.

2) Accuracy (in %) on all negative samples.

3) Accuracy (in %) on all samples.

Benchmarking on Conjunction-two Relationship Verification

1) Accuracy (in %) on all positive samples.

2) Accuracy (in %) on negative p1 samples.

3) Accuracy (in %) on negative p2 samples.

4) Accuracy (in %) on negative p12 samples.

5) Accuracy (in %) on all samples.

Benchmarking on Conjunction-three Relationship Verification

1) Accuracy (in %) on all positive samples.

2) Accuracy (in %) on negative p1 samples.

3) Accuracy (in %) on negative p2 samples.

4) Accuracy (in %) on negative p3 samples.

5) Accuracy (in %) on negative p12 samples.

6) Accuracy (in %) on negative p13 samples.

7) Accuracy (in %) on negative p23 samples.

8) Accuracy (in %) on negative p123 samples.

9) Accuracy (in %) on all samples.

Benchmarking on Two-hop Reasoning

1) Accuracy (in %) on all positive samples.

2) Accuracy (in %) on invert relationships samples.

3) Accuracy (in %) on replace terminal samples.

4) Accuracy (in %) on replace intermediate samples.

5) Accuracy (in %) on path disruption samples.

6) Accuracy (in %) on all samples.

Benchmarking on Three-hop Reasoning

1) Accuracy (in %) on all positive samples.

2) Accuracy (in %) on all negative samples.

3) Accuracy (in %) on all samples.

Benchmarking on Relationships Extraction with Node Pairs

1) Accuracy, precision and recall values (in %) of all investigated LLMs on relationship extraction tasks with binary outputs on node pairs from four datasets.

2) AUROC (in %) of all investigated LLMs on relationship extraction tasks with binary outputs on node pairs from four datasets.

3) AUPRC (in %) of all investigated LLMs on relationship extraction tasks with binary outputs on node pairs from four datasets.

Benchmarking on Relationships Extraction with Subgraphs

1) Average AUROC (in %) of investigated LLMs on the task of relationship extraction with subgraphs on DBE-KT22 dataset.

2) Average AUPRC (in %) of investigated LLMs on the task of relationship extraction with subgraphs on DBE-KT22 dataset.