Benchmarking on One-hop Relationship Verification
The results shown in the tables are the accuracy of investigated LLMs on the task of one-hop relationship verification.
1) Accuracy (in %) on all positive samples.
LLM |
DBE-KT22 |
WDKG-Course |
WDKG-KnowledgePoints |
Junyi-Prerequisites |
|
Q. |
R. |
Q. |
R. |
Q. |
R. |
Q. |
R. |
GPT-4 |
28.6 |
46.6 |
54.4 |
56.6 |
20.2 |
8.6 |
58.9 |
46.7 |
Qwen-turbo |
50.2 |
64.6 |
49.9 |
68.6 |
42.1 |
62.7 |
72.2 |
76.8 |
Moonshot-v1-128k |
46.5 |
66.0 |
56.7 |
92.4 |
38.1 |
30.4 |
80.5 |
68.5 |
Claude-3-haiku-20240307 |
56.3 |
81.9 |
82.9 |
86.5 |
65.9 |
52.9 |
80.9 |
76.8 |
Yi-34b-chat-0205 |
26.4 |
26.3 |
54.6 |
76.7 |
56.2 |
34.7 |
62.5 |
50.1 |
Gemini-1.5-pro |
23.9 |
60.4 |
22.1 |
64.5 |
23.9 |
8.4 |
24.8 |
64.2 |
2) Accuracy (in %) on all negative samples.
LLM |
DBE-KT22 |
WDKG-Course |
WDKG-KnowledgePoints |
Junyi-Prerequisites |
|
Q. |
R. |
Q. |
R. |
Q. |
R. |
Q. |
R. |
GPT-4 |
96.8 |
86.8 |
80.6 |
58.8 |
82.3 |
86.7 |
88.8 |
90.9 |
Qwen-turbo |
84.4 |
86.8 |
86.1 |
56.7 |
74.3 |
72.9 |
94.4 |
97.0 |
Moonshot-v1-128k |
88.7 |
80.2 |
64.9 |
24.6 |
76.3 |
80.6 |
70.7 |
88.7 |
Claude-3-haiku-20240307 |
84.5 |
54.1 |
45.0 |
50.7 |
52.1 |
66.9 |
54.7 |
72.3 |
Yi-34b-chat-0205 |
94.6 |
96.5 |
86.9 |
38.7 |
74.1 |
80.9 |
78.7 |
88.3 |
Gemini-1.5-pro |
98.1 |
88.6 |
88.3 |
76.7 |
66.1 |
88.6 |
93.0 |
76.4 |
3) Accuracy (in %) on all samples.
LLM |
DBE-KT22 |
WDKG-Course |
WDKG-KnowledgePoints |
Junyi-Prerequisites |
|
Q. |
R. |
Q. |
R. |
Q. |
R. |
Q. |
R. |
GPT-4 |
62.7 |
66.7 |
67.5 |
57.7 |
47.3 |
67.7 |
73.9 |
68.8 |
Qwen-turbo |
75.3 |
83.7 |
68.0 |
62.7 |
58.2 |
67.8 |
79.3 |
68.9 |
Moonshot-v1-128k |
67.6 |
73.1 |
60.8 |
58.5 |
57.2 |
55.5 |
75.6 |
78.6 |
Claude-3-haiku-20240307 |
70.4 |
68.0 |
63.9 |
68.6 |
59.0 |
59.9 |
82.9 |
76.9 |
Yi-34b-chat-0205 |
60.5 |
61.4 |
70.7 |
57.8 |
65.3 |
57.8 |
70.6 |
69.2 |
Gemini-1.5-pro |
61.0 |
74.5 |
55.2 |
70.6 |
45.0 |
48.5 |
58.9 |
70.3 |
Benchmarking on Conjunction-two Relationship Verification
The results shown in the tables are the accuracy of investigated LLMs on the task of conjunction-two relationship verification.
1) Accuracy (in %) on all positive samples.
LLM |
DBE-KT22 |
WDKG-Course |
|
Q. |
R. |
Q. |
R. |
GPT-4 |
55.5 |
11.1 |
92.3 |
69.2 |
Qwen-turbo |
67.6 |
44.0 |
62.0 |
92.0 |
Moonshot-v1-128k |
67.0 |
67.0 |
85.0 |
92.0 |
Claude-3-haiku-20240307 |
78.0 |
78.0 |
92.0 |
85.0 |
Yi-34b-chat-0205 |
44.0 |
56.0 |
85.0 |
77.0 |
Gemini-1.5-pro |
33.0 |
67.0 |
77.0 |
69.0 |
2) Accuracy (in %) on negative p1 samples.
LLM |
DBE-KT22 |
WDKG-Course |
|
Q. |
R. |
Q. |
R. |
GPT-4 |
89.00 |
78.00 |
54.00 |
54.00 |
Qwen-turbo |
67.00 |
89.00 |
69.00 |
62.00 |
Moonshot-v1-128k |
33.00 |
67.00 |
23.00 |
0.00 |
Claude-3-haiku-20240307 |
56.00 |
44.00 |
15.00 |
46.00 |
Yi-34b-chat-0205 |
78.00 |
78.00 |
31.00 |
54.00 |
Gemini-1.5-pro |
89.00 |
89.00 |
77.00 |
77.00 |
3) Accuracy (in %) on negative p2 samples.
LLM |
DBE-KT22 |
WDKG-Course |
|
Q. |
R. |
Q. |
R. |
GPT-4 |
56.00 |
67.00 |
69.00 |
54.00 |
Qwen-turbo |
44.00 |
56.00 |
46.00 |
23.00 |
Moonshot-v1-128k |
44.00 |
44.00 |
8.00 |
8.00 |
Claude-3-haiku-20240307 |
44.00 |
22.00 |
54.00 |
69.00 |
Yi-34b-chat-0205 |
56.00 |
56.00 |
38.00 |
69.00 |
Gemini-1.5-pro |
89.00 |
89.00 |
92.00 |
69.00 |
4) Accuracy (in %) on negative p12 samples.
LLM |
DBE-KT22 |
WDKG-Course |
|
Q. |
R. |
Q. |
R. |
GPT-4 |
100.00 |
89.00 |
54.00 |
92.00 |
Qwen-turbo |
56.00 |
78.00 |
85.00 |
54.00 |
Moonshot-v1-128k |
33.00 |
78.00 |
46.00 |
15.00 |
Claude-3-haiku-20240307 |
67.00 |
78.00 |
77.00 |
92.00 |
Yi-34b-chat-0205 |
67.00 |
89.00 |
46.00 |
62.00 |
Gemini-1.5-pro |
89.00 |
89.00 |
77.00 |
69.00 |
5) Accuracy (in %) on all samples.
LLM |
DBE-KT22 |
WDKG-Course |
|
Q. |
R. |
Q. |
R. |
GPT-4 |
68.58 |
44.55 |
75.65 |
67.93 |
Qwen-turbo |
61.63 |
59.17 |
64.33 |
69.17 |
Moonshot-v1-128k |
51.83 |
65.00 |
55.33 |
49.83 |
Claude-3-haiku-20240307 |
66.83 |
63.00 |
70.33 |
77.00 |
Yi-34b-chat-0205 |
55.50 |
65.17 |
61.67 |
69.33 |
Gemini-1.5-pro |
61.00 |
78.00 |
79.50 |
70.33 |
Benchmarking on Conjunction-three Relationship Verification
The results shown in the tables are the accuracy of investigated LLMs on the task of conjunction-three relationship verification.
1) Accuracy (in %) on all positive samples.
LLM |
DBE-KT22 |
WDKG-Course |
|
Q. |
R. |
Q. |
R. |
GPT-4 |
60.0 |
80.0 |
67.0 |
67.0 |
Qwen-turbo |
80.0 |
60.0 |
56.0 |
100.0 |
Moonshot-v1-128k |
60.0 |
80.0 |
100.0 |
100.0 |
Claude-3-haiku-20240307 |
80.0 |
100.0 |
89.0 |
89.0 |
Yi-34b-chat-0205 |
60.0 |
80.0 |
100.0 |
89.0 |
Gemini-1.5-pro |
20.0 |
40.0 |
56.0 |
89.0 |
2) Accuracy (in %) on negative p1 samples.
LLM |
DBE-KT22 |
WDKG-Course |
|
Q. |
R. |
Q. |
R. |
GPT-4 |
60.0 |
60.0 |
67.0 |
56.0 |
Qwen-turbo |
60.0 |
60.0 |
67.0 |
67.0 |
Moonshot-v1-128k |
60.0 |
40.0 |
56.0 |
22.0 |
Claude-3-haiku-20240307 |
80.0 |
80.0 |
67.0 |
56.0 |
Yi-34b-chat-0205 |
80.0 |
80.0 |
67.0 |
56.0 |
Gemini-1.5-pro |
80.0 |
80.0 |
56.0 |
56.0 |
3) Accuracy (in %) on negative p2 samples.
LLM |
DBE-KT22 |
WDKG-Course |
|
Q. |
R. |
Q. |
R. |
GPT-4 |
80.0 |
80.0 |
67.0 |
56.0 |
Qwen-turbo |
60.0 |
60.0 |
89.0 |
44.0 |
Moonshot-v1-128k |
60.0 |
80.0 |
56.0 |
0.0 |
Claude-3-haiku-20240307 |
80.0 |
20.0 |
33.0 |
22.0 |
Yi-34b-chat-0205 |
80.0 |
100.0 |
56.0 |
56.0 |
Gemini-1.5-pro |
80.0 |
100.0 |
100.0 |
56.0 |
4) Accuracy (in %) on negative p3 samples.
LLM |
DBE-KT22 |
WDKG-Course |
|
Q. |
R. |
Q. |
R. |
GPT-4 |
60.0 |
40.0 |
78.0 |
33.0 |
Qwen-turbo |
80.0 |
60.0 |
78.0 |
44.0 |
Moonshot-v1-128k |
60.0 |
80.0 |
22.0 |
11.0 |
Claude-3-haiku-20240307 |
80.0 |
80.0 |
56.0 |
56.0 |
Yi-34b-chat-0205 |
100.0 |
80.0 |
67.0 |
67.0 |
Gemini-1.5-pro |
60.0 |
60.0 |
44.0 |
56.0 |
5) Accuracy (in %) on negative p12 samples.
LLM |
DBE-KT22 |
WDKG-Course |
|
Q. |
R. |
Q. |
R. |
GPT-4 |
60.0 |
80.0 |
89.0 |
89.0 |
Qwen-turbo |
100.0 |
80.0 |
89.0 |
67.0 |
Moonshot-v1-128k |
100.0 |
80.0 |
67.0 |
44.0 |
Claude-3-haiku-20240307 |
60.0 |
40.0 |
78.0 |
44.0 |
Yi-34b-chat-0205 |
80.0 |
80.0 |
78.0 |
56.0 |
Gemini-1.5-pro |
100.0 |
60.0 |
67.0 |
56.0 |
6) Accuracy (in %) on negative p13 samples.
LLM |
DBE-KT22 |
WDKG-Course |
|
Q. |
R. |
Q. |
R. |
GPT-4 |
40.0 |
60.0 |
78.0 |
89.0 |
Qwen-turbo |
40.0 |
40.0 |
78.0 |
56.0 |
Moonshot-v1-128k |
60.0 |
20.0 |
67.0 |
22.0 |
Claude-3-haiku-20240307 |
40.0 |
0.0 |
33.0 |
11.0 |
Yi-34b-chat-0205 |
80.0 |
80.0 |
56.0 |
67.0 |
Gemini-1.5-pro |
60.0 |
80.0 |
67.0 |
56.0 |
7) Accuracy (in %) on negative p23 samples.
LLM |
DBE-KT22 |
WDKG-Course |
|
Q. |
R. |
Q. |
R. |
GPT-4 |
80.0 |
60.0 |
56.0 |
67.0 |
Qwen-turbo |
40.0 |
80.0 |
67.0 |
22.0 |
Moonshot-v1-128k |
100.0 |
80.0 |
67.0 |
11.0 |
Claude-3-haiku-20240307 |
60.0 |
40.0 |
56.0 |
22.0 |
Yi-34b-chat-0205 |
100.0 |
100.0 |
44.0 |
56.0 |
Gemini-1.5-pro |
80.0 |
80.0 |
56.0 |
56.0 |
8) Accuracy (in %) on negative p123 samples.
LLM |
DBE-KT22 |
WDKG-Course |
|
Q. |
R. |
Q. |
R. |
GPT-4 |
100.0 |
100.0 |
89.0 |
89.0 |
Qwen-turbo |
80.0 |
100.0 |
100.0 |
100.0 |
Moonshot-v1-128k |
80.0 |
40.0 |
89.0 |
67.0 |
Claude-3-haiku-20240307 |
80.0 |
40.0 |
78.0 |
67.0 |
Yi-34b-chat-0205 |
100.0 |
100.0 |
89.0 |
56.0 |
Gemini-1.5-pro |
100.0 |
60.0 |
78.0 |
67.0 |
9) Accuracy (in %) on all samples.
LLM |
DBE-KT22 |
WDKG-Course |
|
Q. |
R. |
Q. |
R. |
GPT-4 |
64.3 |
74.3 |
70.9 |
67.7 |
Qwen-turbo |
72.9 |
64.3 |
68.6 |
78.6 |
Moonshot-v1-128k |
67.1 |
70.0 |
80.3 |
62.6 |
Claude-3-haiku-20240307 |
74.3 |
71.4 |
73.1 |
64.4 |
Yi-34b-chat-0205 |
74.3 |
84.3 |
82.6 |
74.1 |
Gemini-1.5-pro |
50.0 |
57.1 |
61.4 |
73.3 |
Benchmarking on Two-hop Reasoning
The results shown in the tables are the accuracy of investigated LLMs on the task of two-hop reasoning.
1) Accuracy (in %) on all positive samples.
LLM |
DBE-KT22 |
WDKG-Course |
Junyi-Prerequisites |
|
Q. |
R. |
Q. |
R. |
Q. |
R. |
GPT-4 |
93.8 |
71.9 |
100.0 |
71.9 |
100.0 |
65.6 |
qwen-turbo |
90.6 |
56.3 |
90.6 |
71.9 |
93.8 |
71.9 |
Moonshot-v1-128k |
100.0 |
59.4 |
100.0 |
68.8 |
93.8 |
59.4 |
Claude-3-haiku-20240307 |
93.8 |
93.8 |
100.0 |
78.1 |
93.8 |
90.6 |
Yi-34b-chat-0205 |
65.6 |
59.4 |
68.8 |
56.3 |
71.9 |
56.3 |
Gemini-1.5-pro |
65.6 |
56.3 |
90.6 |
68.8 |
71.9 |
59.4 |
2) Accuracy (in %) on invert relationships samples.
LLM |
DBE-KT22 |
WDKG-Course |
Junyi-Prerequisites |
|
Q. |
R. |
Q. |
R. |
Q. |
R. |
GPT-4 |
25.0 |
37.5 |
50.0 |
100.0 |
25.0 |
0.0 |
qwen-turbo |
12.5 |
50.0 |
75.0 |
62.5 |
0.0 |
37.5 |
Moonshot-v1-128k |
12.5 |
25.0 |
0.0 |
25.0 |
25.0 |
75.0 |
Claude-3-haiku-20240307 |
37.5 |
25.0 |
12.5 |
37.5 |
12.5 |
12.5 |
Yi-34b-chat-0205 |
12.5 |
25.0 |
50.0 |
62.5 |
37.5 |
25.0 |
Gemini-1.5-pro |
37.5 |
50.0 |
50.0 |
50.0 |
25.0 |
37.5 |
3) Accuracy (in %) on replace terminal samples.
LLM |
DBE-KT22 |
WDKG-Course |
Junyi-Prerequisites |
|
Q. |
R. |
Q. |
R. |
Q. |
R. |
GPT-4 |
0.0 |
50.0 |
0.0 |
37.5 |
0.0 |
75.0 |
qwen-turbo |
25.0 |
75.0 |
37.5 |
50.0 |
12.5 |
75.0 |
Moonshot-v1-128k |
0.0 |
75.0 |
0.0 |
50.0 |
0.0 |
50.0 |
Claude-3-haiku-20240307 |
37.5 |
25.0 |
0.0 |
37.5 |
0.0 |
37.5 |
Yi-34b-chat-0205 |
62.5 |
62.5 |
25.0 |
75.0 |
0.0 |
50.0 |
Gemini-1.5-pro |
37.5 |
62.5 |
0.0 |
37.5 |
12.5 |
50.0 |
LLM |
DBE-KT22 |
WDKG-Course |
Junyi-Prerequisites |
|
Q. |
R. |
Q. |
R. |
Q. |
R. |
GPT-4 |
0.0 |
50.0 |
0.0 |
50.0 |
0.0 |
25.0 |
qwen-turbo |
25.0 |
50.0 |
25.0 |
37.5 |
12.5 |
50.0 |
Moonshot-v1-128k |
12.5 |
37.5 |
0.0 |
25.0 |
0.0 |
25.0 |
Claude-3-haiku-20240307 |
12.5 |
50.0 |
0.0 |
25.0 |
0.0 |
0.0 |
Yi-34b-chat-0205 |
50.0 |
50.0 |
0.0 |
75.0 |
37.5 |
37.5 |
Gemini-1.5-pro |
25.0 |
37.5 |
12.5 |
25.0 |
12.5 |
25.0 |
5) Accuracy (in %) on path disruption samples.
LLM |
DBE-KT22 |
WDKG-Course |
Junyi-Prerequisites |
|
Q. |
R. |
Q. |
R. |
Q. |
R. |
GPT-4 |
0.0 |
25.0 |
0.0 |
50.0 |
0.0 |
37.5 |
qwen-turbo |
37.5 |
62.5 |
12.5 |
12.5 |
50.0 |
62.5 |
Moonshot-v1-128k |
12.5 |
37.5 |
0.0 |
50.0 |
12.5 |
62.5 |
Claude-3-haiku-20240307 |
25.0 |
25.0 |
0.0 |
12.5 |
25.0 |
0.0 |
Yi-34b-chat-0205 |
25.0 |
62.5 |
12.5 |
75.0 |
37.5 |
62.5 |
Gemini-1.5-pro |
12.5 |
50.0 |
12.5 |
50.0 |
25.0 |
0.0 |
6) Accuracy (in %) on all samples.
LLM |
DBE-KT22 |
WDKG-Course |
Junyi-Prerequisites |
|
Q. |
R. |
Q. |
R. |
Q. |
R. |
GPT-4 |
50.0 |
56.3 |
56.3 |
65.6 |
53.1 |
50.0 |
qwen-turbo |
57.8 |
57.8 |
64.1 |
56.3 |
56.3 |
64.1 |
Moonshot-v1-128k |
54.7 |
51.6 |
50.0 |
53.1 |
51.6 |
56.3 |
Claude-3-haiku-20240307 |
60.9 |
62.5 |
51.6 |
53.1 |
51.6 |
51.6 |
Yi-34b-chat-0205 |
51.6 |
54.7 |
45.3 |
64.1 |
50.0 |
50.0 |
Gemini-1.5-pro |
46.9 |
53.1 |
54.7 |
54.7 |
45.3 |
43.8 |
Benchmarking on Three-hop Reasoning
The results shown in the tables are the accuracy of investigated LLMs on the task of three-hop reasoning.
1) Accuracy (in %) on all positive samples.
LLM |
DBE-KT22 |
WDKG-Course |
Junyi-Prerequisites |
|
Q. |
R. |
Q. |
R. |
Q. |
R. |
GPT-4 |
96.0 |
56.0 |
100.0 |
92.0 |
100.0 |
52.0 |
qwen-turbo |
100.0 |
64.0 |
96.0 |
64.0 |
96.0 |
32.0 |
Moonshot-v1-128k |
100.0 |
72.0 |
100.0 |
60.0 |
92.0 |
48.0 |
Claude-3-haiku-20240307 |
88.0 |
84.0 |
100.0 |
84.0 |
100.0 |
88.0 |
Yi-34b-chat-0205 |
84.0 |
84.0 |
88.0 |
28.0 |
96.0 |
68.0 |
Gemini-1.5-pro |
84.0 |
72.0 |
88.0 |
60.0 |
88.0 |
56.0 |
2) Accuracy (in %) on all negative samples.
LLM |
DBE-KT22 |
WDKG-Course |
Junyi-Prerequisites |
|
Q. |
R. |
Q. |
R. |
Q. |
R. |
GPT-4 |
0.0 |
60.0 |
0.0 |
16.0 |
0.0 |
44.0 |
qwen-turbo |
8.0 |
56.0 |
8.0 |
36.0 |
12.0 |
60.0 |
Moonshot-v1-128k |
0.0 |
52.0 |
0.0 |
44.0 |
4.0 |
76.0 |
Claude-3-haiku-20240307 |
0.0 |
12.0 |
0.0 |
20.0 |
16.0 |
24.0 |
Yi-34b-chat-0205 |
8.0 |
56.0 |
4.0 |
76.0 |
4.0 |
52.0 |
Gemini-1.5-pro |
32.0 |
80.0 |
0.0 |
44.0 |
28.0 |
76.0 |
3) Accuracy (in %) on all samples.
LLM |
DBE-KT22 |
WDKG-Course |
Junyi-Prerequisites |
|
Q. |
R. |
Q. |
R. |
Q. |
R. |
GPT-4 |
48.0 |
58.0 |
50.0 |
54.0 |
50.0 |
48.0 |
qwen-turbo |
54.0 |
60.0 |
52.0 |
50.0 |
54.0 |
46.0 |
Moonshot-v1-128k |
50.0 |
62.0 |
50.0 |
52.0 |
48.0 |
62.0 |
Claude-3-haiku-20240307 |
44.0 |
48.0 |
50.0 |
52.0 |
58.0 |
56.0 |
Yi-34b-chat-0205 |
46.0 |
70.0 |
46.0 |
52.0 |
50.0 |
60.0 |
Gemini-1.5-pro |
58.0 |
76.0 |
44.0 |
52.0 |
58.0 |
66.0 |
The results shown in the following tables are the accuracy, recall, precision, AUROC and AUPRC of investigated LLMs on the task of relationship extraction with node pairs.
1) Accuracy, precision and recall values (in %) of all investigated LLMs on relationship extraction tasks with binary outputs on node pairs from four datasets.
LLM |
DBE-KT22 |
WDKG-Course |
WDKG-KnoeledgePoints |
Junyi-Prerequisites |
Acc |
Recall |
Prec |
Acc |
Recall |
Prec |
Acc |
Recall |
Prec |
Acc |
Recall |
Prec |
GPT-4 |
68.0 |
50.0 |
87.5 |
52.0 |
28.6 |
66.7 |
36.0 |
0.0 |
0.0 |
68.0 |
60.0 |
81.8 |
Qwen-turbo |
64.0 |
50.0 |
77.8 |
56.0 |
35.7 |
71.4 |
60.0 |
33.3 |
100.0 |
72.0 |
66.7 |
83.3 |
Moonshot-v1-128k |
60.0 |
42.9 |
75.0 |
56.6 |
28.6 |
80.0 |
40.0 |
6.7 |
50.0 |
72.0 |
53.3 |
100.0 |
Claude-3-haiku-20240307 |
84.0 |
78.6 |
91.7 |
80.0 |
78.6 |
84.6 |
56.0 |
46.7 |
70.0 |
72.0 |
60.0 |
90.0 |
Yi-34b-chat-0205 |
60.0 |
35.7 |
83.3 |
64.0 |
50.0 |
77.8 |
60.0 |
40.0 |
85.7 |
56.0 |
33.3 |
83.3 |
Gemini-1.5-pro |
52.0 |
28.6 |
66.7 |
52.0 |
21.4 |
75.0 |
36.0 |
6.7 |
33.3 |
56.0 |
26.7 |
100.0 |
LLM |
DBE-KT22 |
WDKG-Course |
WDKG-KnoeledgePoints |
Junyi-Prerequisites |
Binary |
Float |
Binary |
Float |
Binary |
Float |
Binary |
Float |
GPT-4 |
70.5 |
89.3 |
55.2 |
77.3 |
46.0 |
67.3 |
70.0 |
92.3 |
Qwen-turbo |
65.9 |
64.3 |
58.8 |
71.0 |
66.7 |
63.6 |
73.3 |
78.5 |
Moonshot-v1-128k |
62.3 |
67.2 |
59.7 |
36.7 |
48.3 |
45.6 |
76.7 |
75.0 |
Claude-3-haiku-20240307 |
84.7 |
79.5 |
80.2 |
66.7 |
58.3 |
78.7 |
75.0 |
83.7 |
Yi-34b-chat-0205 |
63.3 |
81.8 |
65.9 |
93.3 |
65.0 |
44.9 |
61.7 |
96.8 |
Gemini-1.5-pro |
55.2 |
82.5 |
56.2 |
73.0 |
33.3 |
43.4 |
63.3 |
92.0 |
LLM |
DBE-KT22 |
WDKG-Course |
WDKG-KnoeledgePoints |
Junyi-Prerequisites |
Binary |
Float |
Binary |
Float |
Binary |
Float |
Binary |
Float |
GPT-4 |
71.8 |
86.0 |
59.1 |
73.9 |
60.0 |
81.2 |
73.1 |
90.8 |
Qwen-turbo |
66.9 |
64.7 |
61.5 |
61.0 |
73.3 |
77.4 |
75.6 |
78.0 |
Moonshot-v1-128k |
64.1 |
61.6 |
62.9 |
40.0 |
59.3 |
72.1 |
81.3 |
78.9 |
Claude-3-haiku-20240307 |
84.0 |
65.9 |
78.5 |
58.1 |
64.7 |
87.5 |
78.0 |
79.6 |
Yi-34b-chat-0205 |
65.8 |
74.4 |
66.9 |
86.7 |
70.3 |
64.8 |
67.8 |
94.1 |
Gemini-1.5-pro |
59.1 |
76.9 |
60.1 |
63.2 |
58.2 |
64.1 |
70.7 |
93.4 |
The results shown in the following tables are the AUROC and AUPRC of investigated LLMs on the task of relationship extraction with subgraphs on DBE-KT22 dataset.
LLM |
Subgraph 5 |
Subgraph 10 |
Subgraph 15 |
B |
F |
B |
F |
B |
F |
GPT-4 |
54.5 |
65.4 |
69.9 |
70.1 |
60.6 |
66.2 |
Qwen-turbo |
67.0 |
77.3 |
60.8 |
71.4 |
56.8 |
65.8 |
Moonshot-v1-128k |
51.0 |
70.4 |
61.0 |
65.4 |
56.1 |
60.0 |
Claude-3-haiku-20240307 |
60.5 |
70.5 |
50.4 |
66.3 |
47.9 |
59.9 |
Yi-34b-chat-0205 |
61.5 |
75.1 |
54.7 |
61.0 |
50.9 |
59.0 |
Gemini-1.5-pro |
63.0 |
80.2 |
59.9 |
72.2 |
54.5 |
63.5 |
LLM |
Subgraph 5 |
Subgraph 10 |
Subgraph 15 |
B |
F |
B |
F |
B |
F |
GPT-4 |
61.2 |
72.3 |
63.7 |
69.2 |
58.8 |
63.4 |
Qwen-turbo |
69.7 |
77.0 |
53.6 |
71.0 |
53.4 |
63.5 |
Moonshot-v1-128k |
54.3 |
71.3 |
55.5 |
64.5 |
54.5 |
57.8 |
Claude-3-haiku-20240307 |
59.7 |
70.6 |
47.7 |
61.4 |
52.8 |
56.3 |
Yi-34b-chat-0205 |
61.5 |
79.6 |
49.4 |
61.1 |
51.6 |
57.5 |
Gemini-1.5-pro |
63.0 |
79.7 |
54.1 |
63.0 |
53.8 |
58.1 |