dbpedia - Stuck loading large dataset with GraphDB
问题描述
When I load this DBpedia (2015-10, en, ~1 billion triples ) into GraphDB 9.1.1 the CPU load drops to 0% after around 13M triples and idles henceforth. The process does not terminate until I kill it manually.
The machine has enough disc space and sufficient more RAM than the 512GB assigned via the Xmx CMD option to java.
The file that I tried to load is provided here: https://hobbitdata.informatik.uni-leipzig.de/dbpedia_2015-10_en_wo-comments_c.nt.zst
It can be decompressed with:
zstd -d "dbpedia_2015-10_en_wo-comments_c.nt.zst" -o "dbpedia_2015-10_en_wo-comments_c.nt"
I use the following command to load the data:
java -Xmx512G -cp "$HOME/graphdb/graphdb-free-9.1.1/lib/*" -Dgraphdb.dist=$HOME/graphdb/graphdb-free-9.1.1 -Dgraphdb.home.data=$HOME/dbpedia2015/data/ -Djdk.xml.entityExpansionLimit=0 com.ontotext.graphdb.loadrdf.LoadRDF -f -m parallel -p -c $HOME/graphdb/graphdb-dbpedia2015.ttl $HOME/dbpedia_2015-10_en_wo-comments_c.nt
$HOME/graphdb/graphdb-dbpedia2015.ttl
looks like:
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix rep: <http://www.openrdf.org/config/repository#>.
@prefix sr: <http://www.openrdf.org/config/repository/sail#>.
@prefix sail: <http://www.openrdf.org/config/sail#>.
@prefix owlim: <http://www.ontotext.com/trree/owlim#>.
[] a rep:Repository ;
rep:repositoryID "dbpedia2015" ;
rdfs:label "Repository for dataset dbpedia2015" ;
rep:repositoryImpl [
rep:repositoryType "graphdb:FreeSailRepository" ;
sr:sailImpl [
sail:sailType "graphdb:FreeSail" ;
# ruleset to use
owlim:ruleset "rdfsplus-optimized" ;
# disable context index(because my data do not uses contexts)
owlim:enable-context-index "false" ;
# indexes to speed up the read queries
owlim:enablePredicateList "true" ;
owlim:enable-literal-index "true" ;
owlim:in-memory-literal-properties "true" ;
]
].
The log of the output is:
16:11:07.438 [main] INFO com.ontotext.graphdb.loadrdf.Params - MODE: parallel
16:11:07.439 [main] INFO com.ontotext.graphdb.loadrdf.Params - STOP ON FIRST ERROR: false
16:11:07.439 [main] INFO com.ontotext.graphdb.loadrdf.Params - PARTIAL LOAD: true
16:11:07.439 [main] INFO com.ontotext.graphdb.loadrdf.Params - CONFIG FILE: /home/me/graphdb-dbpedia2015.ttl
16:11:07.444 [main] INFO com.ontotext.graphdb.loadrdf.LoadRDF - Attaching to location: /home/me/graphdb/dbpedia2015/data
16:11:07.618 [main] INFO c.o.t.u.l.LimitedObjectCacheFactory - Using LRU cache type: synch
16:11:08.025 [main] WARN com.ontotext.plugin.literals-index - Rebuilding literals indexes. Starting from id:1
16:11:08.029 [main] WARN com.ontotext.plugin.literals-index - Complete in 0.004, num entries indexed:0
16:11:08.780 [main] INFO c.o.rio.parallel.ParallelLoader - Data will be parsed + resolved + loaded.
16:11:08.788 [main] INFO c.o.rio.parallel.ParallelLoader - Using 128 threads for inference
16:11:09.984 [main] INFO com.ontotext.graphdb.loadrdf.LoadRDF - Loading file: dbpedia_2015-10_en_wo-comments_c.nt
16:11:09.991 [main] INFO c.o.rio.parallel.ParallelLoader - Using 128 threads for inference
16:11:19.987 [main] INFO c.o.rio.parallel.ParallelRDFInserter - Parsed 2,111,690 stmts. Rate: 211,147 st/s. Statements overall: 2,111,690. Global average rate: 211,000 st/s. Now: Tue Mar 10 16:11:19 UTC 2020. Total memory: 22144M, Free memory: 4890M, Max memory: 524288M.
16:11:30.515 [main] INFO c.o.rio.parallel.ParallelRDFInserter - Parsed 3,955,363 stmts. Rate: 192,662 st/s. Statements overall: 3,955,363. Global average rate: 192,596 st/s. Now: Tue Mar 10 16:11:30 UTC 2020. Total memory: 66432M, Free memory: 53925M, Max memory: 524288M.
16:11:40.515 [main] INFO c.o.rio.parallel.ParallelRDFInserter - Parsed 6,889,662 stmts. Rate: 225,661 st/s. Statements overall: 6,889,662. Global average rate: 225,609 st/s. Now: Tue Mar 10 16:11:40 UTC 2020. Total memory: 199296M, Free memory: 177241M, Max memory: 524288M.
16:11:51.185 [main] INFO c.o.rio.parallel.ParallelRDFInserter - Parsed 9,124,978 stmts. Rate: 221,474 st/s. Statements overall: 9,124,978. Global average rate: 221,437 st/s. Now: Tue Mar 10 16:11:51 UTC 2020. Total memory: 199296M, Free memory: 185106M, Max memory: 524288M.
16:12:02.877 [main] INFO c.o.rio.parallel.ParallelRDFInserter - Parsed 11,083,153 stmts. Rate: 209,539 st/s. Statements overall: 11,083,153. Global average rate: 209,511 st/s. Now: Tue Mar 10 16:12:02 UTC 2020. Total memory: 199296M, Free memory: 184331M, Max memory: 524288M.
16:12:15.800 [main] INFO c.o.rio.parallel.ParallelRDFInserter - Parsed 13,166,352 stmts. Rate: 200,047 st/s. Statements overall: 13,166,352. Global average rate: 200,026 st/s. Now: Tue Mar 10 16:12:15 UTC 2020. Total memory: 329312M, Free memory: 313496M, Max memory: 524288M.
Any idea why it is stuck after around 13M triples?
解决方案
首先 - 为进程分配更少的 Xmx(大约 38-42 GB 就足够了)。数据库将需要额外的内存用于堆外,因此请确保不要分配所有内存。如果您仍然无法加载数据集,请发送进程的 jstack,或者如果您使用 Oracle JDK,则可以使用 Java Flight Records:
jcmd <pid> VM.unlock_commercial_features
jcmd <pid> JFR.start duration=60s name=production filename=production.jfr settings=profile
将持续时间设置为一个值,这将允许跟踪执行。您可以将结果发送到 support@ontotext.com,因为它将包含有关您的环境的信息。
另一种选择是使用预加载工具 - 它的目的是加载大型数据集 - http://graphdb.ontotext.com/documentation/enterprise/loading-data-using-preload.html
推荐阅读
- c++ - 输出显示带有整数的多个字符串值
- python - 运行经过训练的模型时不会创建边界框
- excel - 如何更改超链接的名称?
- android - 迁移到应用中心后无法发布静默应用更新,带我去微软页面注册
- sql - 在SQL中的字符串中的两个字符之间选择一个字符串
- linux - 如何从 /bin 的随机选择中调用带有程序名称的 whatis 命令?
- tensorflow - 保存和恢复自定义模型 tensorflow
- javascript - 如何将两个数组的值传递给 Javascript 函数中的标识符?
- java - 如何每隔 s 秒调用一次方法
- python - 如何使用 Python 中的 Arduino 和 Tkinter 根据给定时间输入一个时间表,其中将在其中打开和关闭