首页 > 解决方案 > 尝试使用 tdb2.tdbloader 加载 Wikidata truthy-latest.nt 导致代码:58/PROHIBITED_COMPONENT_PRESENT in USER

问题描述

使用 Apache Jena Fuseki 我正在尝试从Wikidata加载 latest-truthy.nt 数据集,但在尝试导入文件时出现以下错误。受到来自Bitplan的以下成功的启发,他们确实取得了成功。

错误日志:

14:36:16 INFO  loader          :: Add: 198.500.000 latest-truthy.nt (Batch: 453.309 / Avg: 213.382)
14:36:17 ERROR riot            :: [line: 198884173, col: 87] Bad IRI: <https://abertillerymuseum@btconnect.com> Code: 58/PROHIBITED_COMPONENT_PRESENT in USER: A component that is prohibited by the scheme is present.
org.apache.jena.riot.RiotException: [line: 198884173, col: 87] Bad IRI: <https://abertillerymuseum@btconnect.com> Code: 58/PROHIBITED_COMPONENT_PRESENT in USER: A component that is prohibited by the scheme is present.
    at org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerStd.error(ErrorHandlerFactory.java:146)
    at org.apache.jena.riot.system.ParserProfileStd.internalMakeIRI(ParserProfileStd.java:112)
    at org.apache.jena.riot.system.ParserProfileStd.resolveIRI(ParserProfileStd.java:85)
    at org.apache.jena.riot.system.ParserProfileStd.createURI(ParserProfileStd.java:187)
    at org.apache.jena.riot.system.ParserProfileStd.create(ParserProfileStd.java:259)
    at org.apache.jena.riot.lang.LangNTriples.tokenAsNode(LangNTriples.java:70)
    at org.apache.jena.riot.lang.LangNTuple.parseTriple(LangNTuple.java:109)
    at org.apache.jena.riot.lang.LangNTriples.parseOne(LangNTriples.java:61)
    at org.apache.jena.riot.lang.LangNTriples.runParser(LangNTriples.java:53)
    at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:43)
    at org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserRegistry.java:184)
    at org.apache.jena.riot.RDFParser.read(RDFParser.java:357)
    at org.apache.jena.riot.RDFParser.parseURI(RDFParser.java:323)
    at org.apache.jena.riot.RDFParser.parse(RDFParser.java:298)
    at org.apache.jena.riot.RDFParserBuilder.parse(RDFParserBuilder.java:550)
    at org.apache.jena.tdb2.loader.base.LoaderOps.inputFile(LoaderOps.java:107)
    at org.apache.jena.tdb2.loader.base.LoaderBase.loadOne(LoaderBase.java:125)
    at org.apache.jena.tdb2.loader.base.LoaderBase.lambda$load$0(LoaderBase.java:102)
    at java.base/java.util.ArrayList.forEach(ArrayList.java:1541)
    at org.apache.jena.tdb2.loader.base.LoaderBase.load(LoaderBase.java:99)
    at tdb2.tdbloader.lambda$execBulkLoad$4(tdbloader.java:196)
    at org.apache.jena.atlas.lib.Timer.time(Timer.java:85)
    at tdb2.tdbloader.execBulkLoad(tdbloader.java:194)
    at tdb2.tdbloader.loadQuads(tdbloader.java:175)
    at tdb2.tdbloader.exec(tdbloader.java:136)
    at org.apache.jena.cmd.CmdMain.mainMethod(CmdMain.java:92)
    at org.apache.jena.cmd.CmdMain.mainRun(CmdMain.java:58)
    at org.apache.jena.cmd.CmdMain.mainRun(CmdMain.java:45)
    at tdb2.tdbloader.main(tdbloader.java:64)

导入脚本:

@ECHO off
cd apache-jena-4.0.0
echo start import on %DATE% %TIME%

tdb2_tdbloader --loader=parallel --loc "C:\fuseki\data" "F:\latest-truthy.nt" > tdb2-out.log 2> tdb2-err.log

echo finish import on %DATE% %TIME%
pause

文件结构:

- C:/fuseki/
-- apache-jena-4.0.0/
-- apache-jena-fuseki-4.0.0/
-- data/
-- startfusekidb.bat
-- wikidata2fuseki.bat

- F:/
-- latest-truthy.nt

这是Fuseki的问题吗?我无法自己打开 .nt 文件来解决问题。有没有我可以使用的标志,所以它跳过了使用 tdbloader 对给定导入的验证?

我也在 Wikidata 的 IRC 频道中询问这个问题,看看他们是否可以帮助我。

更新:我得到了 IRC 某人的回答,他们告诉我数据集中存在很多错误 Wikidata中的错误所以我知道需要找到一种方法来跳过与错误相关的行并继续加载。但是Fuseki TDB2 命令没有显示任何帮助。

还尝试 --help 输出以下内容,从而表明不存在跳过?

c:\fuseki\apache-jena-4.0.0\bin>tdb2_tdbloader -h
tdbloader--loader= [--desc DATASET | --loc DIR] FILE ...
  Location
      --loc=DIR              Location (a directory)
      --tdb=                 Assembler description file
      --graph=IRI            Act on a named graph
      --loader=              Loader to use: 'basic', 'phased' (default), 'sequential', 'parallel' or 'light'
      --syntax=LANG          Syntax of data from stdin
  Symbol definition
      --set                  Set a configuration symbol to a value
      --mem=FILE             Execute on an in-memory TDB database (for testing)
      --desc=                Assembler description file
  General
      -v   --verbose         Verbose
      -q   --quiet           Run with minimal output
      --debug                Output information for debugging
      --help
      --version              Version information
      --strict               Operate in strict SPARQL mode (no extensions of any kind)

标签: rdfwikidatafusekitdbtdbloader

解决方案


@NLxDoDge - 感谢您指出我的 BITPlan 成功故事。实际上,wikidata nt 转储可能包含与 Jena 4.1 不兼容的三元组导入 - 我今天在人类住区的https://wdumps.toolforge.org/dump/1607遇到了类似的问题。

一个三元组:

<http://www.wikidata.org/entity/Q992883> <http://www.wikidata.org/prop/direct/P856> <http://www.sonora.gob.mx/portal/Runscript.asp?p=ASP\\pg239.asp> .

会破坏节目给出错误:

10:45:06 ERROR riot            :: [line: 6964090, col: 139] Illegal unicode escape sequence value: \\ (0x5C)
org.apache.jena.riot.RiotException: [line: 6964090, col: 139] Illegal unicode escape sequence value: \\ (0x5C)

我只是编辑了 1.2 GB wdump-1607.nt 文件 vim,您可以在其中使用

:6964090

然后保存文件

:wq!

在尝试 wikidata 完全导入之前,您可能希望使用这个 100 MB 的小转储文件来尝试您的环境,最终需要 >2TB 的SSD!工作的磁盘空间。

请在下面找到我用于导入转储和启动 fuseki 服务器的脚本。

你应该得到类似的结果

tail -f tdb2--err.log 
10:57:24 INFO  loader          :: Loader = LoaderPhased
10:57:24 INFO  loader          :: Start: wdump-1607.nt
10:57:27 INFO  loader          :: Add: 500,000 wdump-1607.nt (Batch: 170,706 / Avg: 170,706)
10:57:29 INFO  loader          :: Add: 1,000,000 wdump-1607.nt (Batch: 255,102 / Avg: 204,540)
10:57:31 INFO  loader          :: Add: 1,500,000 wdump-1607.nt (Batch: 229,885 / Avg: 212,344)
10:57:33 INFO  loader          :: Add: 2,000,000 wdump-1607.nt (Batch: 245,579 / Avg: 219,780)
10:57:36 INFO  loader          :: Add: 2,500,000 wdump-1607.nt (Batch: 185,804 / Avg: 212,026)
10:57:39 INFO  loader          :: Add: 3,000,000 wdump-1607.nt (Batch: 146,627 / Avg: 197,355)
10:57:43 INFO  loader          :: Add: 3,500,000 wdump-1607.nt (Batch: 140,567 / Avg: 186,587)
10:57:46 INFO  loader          :: Add: 4,000,000 wdump-1607.nt (Batch: 142,166 / Avg: 179,573)
10:57:50 INFO  loader          :: Add: 4,500,000 wdump-1607.nt (Batch: 134,444 / Avg: 173,116)
10:57:53 WARN  riot            :: [line: 4869426, col: 86] Bad IRI: <http://:goku.com> Code: 57/REQUIRED_COMPONENT_MISSING in HOST: A component that is required by the scheme is missing.
10:57:54 INFO  loader          :: Add: 5,000,000 wdump-1607.nt (Batch: 143,307 / Avg: 169,589)
10:57:54 INFO  loader          ::   Elapsed: 29.48 seconds [2021/08/14 10:57:54 CEST]
10:57:57 INFO  loader          :: Add: 5,500,000 wdump-1607.nt (Batch: 139,314 / Avg: 166,303)
10:58:01 INFO  loader          :: Add: 6,000,000 wdump-1607.nt (Batch: 146,842 / Avg: 164,487)
10:58:04 INFO  loader          :: Add: 6,500,000 wdump-1607.nt (Batch: 143,678 / Avg: 162,674)
10:58:08 INFO  loader          :: Add: 7,000,000 wdump-1607.nt (Batch: 142,085 / Avg: 161,008)
10:58:11 INFO  loader          :: Add: 7,500,000 wdump-1607.nt (Batch: 144,300 / Avg: 159,775)
10:58:15 INFO  loader          :: Add: 8,000,000 wdump-1607.nt (Batch: 141,362 / Avg: 158,484)
10:58:18 INFO  loader          :: Add: 8,500,000 wdump-1607.nt (Batch: 141,083 / Avg: 157,343)
10:58:22 INFO  loader          :: Add: 9,000,000 wdump-1607.nt (Batch: 147,492 / Avg: 156,761)
10:58:23 INFO  loader          :: Finished: wdump-1607.nt: 9,179,041 tuples in 58.91s (Avg: 155,812)
10:58:32 INFO  loader          :: Finish - index SPO
10:58:32 INFO  loader          :: Start replay index SPO
10:58:32 INFO  loader          :: Index set:  SPO => SPO->POS, SPO->OSP
10:58:32 INFO  loader          :: Add: 1,000,000 Index (Batch: 8,928,571 / Avg: 8,928,571)
10:58:34 INFO  loader          :: Add: 2,000,000 Index (Batch: 508,130 / Avg: 961,538)
10:58:36 INFO  loader          :: Add: 3,000,000 Index (Batch: 381,388 / Avg: 638,026)
10:58:39 INFO  loader          :: Add: 4,000,000 Index (Batch: 370,233 / Avg: 540,321)
10:58:42 INFO  loader          :: Add: 5,000,000 Index (Batch: 362,450 / Avg: 492,029)
10:58:45 INFO  loader          :: Add: 6,000,000 Index (Batch: 370,644 / Avg: 466,562)
10:58:47 INFO  loader          :: Add: 7,000,000 Index (Batch: 367,647 / Avg: 449,293)
10:58:50 INFO  loader          :: Add: 8,000,000 Index (Batch: 366,166 / Avg: 436,895)
10:58:53 INFO  loader          :: Add: 9,000,000 Index (Batch: 380,952 / Avg: 429,881)
10:58:54 INFO  loader          :: Index set:  SPO => SPO->POS, SPO->OSP [9,174,324 items, 22.0 seconds]
10:58:58 INFO  loader          :: Finish - index OSP
10:58:59 INFO  loader          :: Finish - index POS
10:58:59 INFO  loader          :: Time = 94.428 seconds : Triples = 9,179,041 : Rate = 97,207 /s

运行 fuseki 的脚本

#!/bin/bash
# WF 2020-06-25
# WF 2021-08-14
# Jena Fuseki server installation
# see https://jena.apache.org/documentation/fuseki2/fuseki-run.html
version=4.1.0
fuseki=apache-jena-fuseki-$version
if [ ! -d $fuseki ]
then
  if [ ! -f $fuseki.tar.gz ]
  then
    wget http://archive.apache.org/dist/jena/binaries/$fuseki.tar.gz
  else
    echo $fuseki.tar.gz already downloaded
  fi
  echo "unpacking $fuseki.tar.gz"
  tar xvfz $fuseki.tar.gz
else
  echo $fuseki already downloaded and unpacked
fi
cd $fuseki
java -jar fuseki-server.jar --tdb2 --loc=../data /wdhs

加载数据的脚本

#!/bin/bash
# WF 2020-10-05
# WF 2021-08-14

# global settings
jena=apache-jena-4.1.0
tgz=$jena.tar.gz
mirror=https://downloads.apache.org/jena/binaries
jenaurl=$mirror/$tgz
base=$(pwd)
#base=/hd/luxio/gnd
data=$base/data
tdbloader=$jena/bin/tdb2.tdbloader

getjena() {
# download
if [ ! -f $tgz ]
then
  echo "downloading $tgz from $jenaurl"
    wget $jenaurl
else
  echo "$tgz already downloaded"
fi
# unpack
if [ ! -d $jena ]
then
  echo "unpacking $jena from $tgz"
    tar xvzf $tgz
else
  echo "$jena already unpacked"
fi
# create data directory
if [ ! -d $data ]
then
  echo "creating $data directory"
  mkdir -p $data
else
  echo "$data directory already created"
fi
}

#
# show the given timestamp
#
timestamp() {
 local msg="$1"
 local ts=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
 echo "$msg at $ts"
}

#
# load data for the given data dir and input
#
loaddata() {
    local data="$1"
    local input="$2"
  timestamp "start loading $input to $data"
  $tdbloader --loc "$data" "$input" > tdb2-$phase-out.log 2> tdb2-$phase-err.log
    timestamp "finished loading $input to $data"
}

getjena
export TMPDIR=$base/tmp
if [ ! -d $TMPDIR ]
then
  echo "creating temporary directory $TMPDIR"
  mkdir $TMPDIR
else
  echo "using temporary directory $TMPDIR"
fi
for file in *.nt 
do
  loaddata $data $file 
done

推荐阅读