rdf - 尝试使用 tdb2.tdbloader 加载 Wikidata truthy-latest.nt 导致代码:58/PROHIBITED_COMPONENT_PRESENT in USER
问题描述
使用 Apache Jena Fuseki 我正在尝试从Wikidata加载 latest-truthy.nt 数据集,但在尝试导入文件时出现以下错误。受到来自Bitplan的以下成功的启发,他们确实取得了成功。
错误日志:
14:36:16 INFO loader :: Add: 198.500.000 latest-truthy.nt (Batch: 453.309 / Avg: 213.382)
14:36:17 ERROR riot :: [line: 198884173, col: 87] Bad IRI: <https://abertillerymuseum@btconnect.com> Code: 58/PROHIBITED_COMPONENT_PRESENT in USER: A component that is prohibited by the scheme is present.
org.apache.jena.riot.RiotException: [line: 198884173, col: 87] Bad IRI: <https://abertillerymuseum@btconnect.com> Code: 58/PROHIBITED_COMPONENT_PRESENT in USER: A component that is prohibited by the scheme is present.
at org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerStd.error(ErrorHandlerFactory.java:146)
at org.apache.jena.riot.system.ParserProfileStd.internalMakeIRI(ParserProfileStd.java:112)
at org.apache.jena.riot.system.ParserProfileStd.resolveIRI(ParserProfileStd.java:85)
at org.apache.jena.riot.system.ParserProfileStd.createURI(ParserProfileStd.java:187)
at org.apache.jena.riot.system.ParserProfileStd.create(ParserProfileStd.java:259)
at org.apache.jena.riot.lang.LangNTriples.tokenAsNode(LangNTriples.java:70)
at org.apache.jena.riot.lang.LangNTuple.parseTriple(LangNTuple.java:109)
at org.apache.jena.riot.lang.LangNTriples.parseOne(LangNTriples.java:61)
at org.apache.jena.riot.lang.LangNTriples.runParser(LangNTriples.java:53)
at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:43)
at org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserRegistry.java:184)
at org.apache.jena.riot.RDFParser.read(RDFParser.java:357)
at org.apache.jena.riot.RDFParser.parseURI(RDFParser.java:323)
at org.apache.jena.riot.RDFParser.parse(RDFParser.java:298)
at org.apache.jena.riot.RDFParserBuilder.parse(RDFParserBuilder.java:550)
at org.apache.jena.tdb2.loader.base.LoaderOps.inputFile(LoaderOps.java:107)
at org.apache.jena.tdb2.loader.base.LoaderBase.loadOne(LoaderBase.java:125)
at org.apache.jena.tdb2.loader.base.LoaderBase.lambda$load$0(LoaderBase.java:102)
at java.base/java.util.ArrayList.forEach(ArrayList.java:1541)
at org.apache.jena.tdb2.loader.base.LoaderBase.load(LoaderBase.java:99)
at tdb2.tdbloader.lambda$execBulkLoad$4(tdbloader.java:196)
at org.apache.jena.atlas.lib.Timer.time(Timer.java:85)
at tdb2.tdbloader.execBulkLoad(tdbloader.java:194)
at tdb2.tdbloader.loadQuads(tdbloader.java:175)
at tdb2.tdbloader.exec(tdbloader.java:136)
at org.apache.jena.cmd.CmdMain.mainMethod(CmdMain.java:92)
at org.apache.jena.cmd.CmdMain.mainRun(CmdMain.java:58)
at org.apache.jena.cmd.CmdMain.mainRun(CmdMain.java:45)
at tdb2.tdbloader.main(tdbloader.java:64)
导入脚本:
@ECHO off
cd apache-jena-4.0.0
echo start import on %DATE% %TIME%
tdb2_tdbloader --loader=parallel --loc "C:\fuseki\data" "F:\latest-truthy.nt" > tdb2-out.log 2> tdb2-err.log
echo finish import on %DATE% %TIME%
pause
文件结构:
- C:/fuseki/
-- apache-jena-4.0.0/
-- apache-jena-fuseki-4.0.0/
-- data/
-- startfusekidb.bat
-- wikidata2fuseki.bat
- F:/
-- latest-truthy.nt
这是Fuseki的问题吗?我无法自己打开 .nt 文件来解决问题。有没有我可以使用的标志,所以它跳过了使用 tdbloader 对给定导入的验证?
我也在 Wikidata 的 IRC 频道中询问这个问题,看看他们是否可以帮助我。
更新:我得到了 IRC 某人的回答,他们告诉我数据集中存在很多错误 Wikidata中的错误所以我知道需要找到一种方法来跳过与错误相关的行并继续加载。但是Fuseki TDB2 命令没有显示任何帮助。
还尝试 --help 输出以下内容,从而表明不存在跳过?
c:\fuseki\apache-jena-4.0.0\bin>tdb2_tdbloader -h
tdbloader--loader= [--desc DATASET | --loc DIR] FILE ...
Location
--loc=DIR Location (a directory)
--tdb= Assembler description file
--graph=IRI Act on a named graph
--loader= Loader to use: 'basic', 'phased' (default), 'sequential', 'parallel' or 'light'
--syntax=LANG Syntax of data from stdin
Symbol definition
--set Set a configuration symbol to a value
--mem=FILE Execute on an in-memory TDB database (for testing)
--desc= Assembler description file
General
-v --verbose Verbose
-q --quiet Run with minimal output
--debug Output information for debugging
--help
--version Version information
--strict Operate in strict SPARQL mode (no extensions of any kind)
解决方案
@NLxDoDge - 感谢您指出我的 BITPlan 成功故事。实际上,wikidata nt 转储可能包含与 Jena 4.1 不兼容的三元组导入 - 我今天在人类住区的https://wdumps.toolforge.org/dump/1607遇到了类似的问题。
一个三元组:
<http://www.wikidata.org/entity/Q992883> <http://www.wikidata.org/prop/direct/P856> <http://www.sonora.gob.mx/portal/Runscript.asp?p=ASP\\pg239.asp> .
会破坏节目给出错误:
10:45:06 ERROR riot :: [line: 6964090, col: 139] Illegal unicode escape sequence value: \\ (0x5C)
org.apache.jena.riot.RiotException: [line: 6964090, col: 139] Illegal unicode escape sequence value: \\ (0x5C)
我只是编辑了 1.2 GB wdump-1607.nt 文件 vim,您可以在其中使用
:6964090
然后保存文件
:wq!
在尝试 wikidata 完全导入之前,您可能希望使用这个 100 MB 的小转储文件来尝试您的环境,最终需要 >2TB 的SSD!工作的磁盘空间。
请在下面找到我用于导入转储和启动 fuseki 服务器的脚本。
你应该得到类似的结果
tail -f tdb2--err.log
10:57:24 INFO loader :: Loader = LoaderPhased
10:57:24 INFO loader :: Start: wdump-1607.nt
10:57:27 INFO loader :: Add: 500,000 wdump-1607.nt (Batch: 170,706 / Avg: 170,706)
10:57:29 INFO loader :: Add: 1,000,000 wdump-1607.nt (Batch: 255,102 / Avg: 204,540)
10:57:31 INFO loader :: Add: 1,500,000 wdump-1607.nt (Batch: 229,885 / Avg: 212,344)
10:57:33 INFO loader :: Add: 2,000,000 wdump-1607.nt (Batch: 245,579 / Avg: 219,780)
10:57:36 INFO loader :: Add: 2,500,000 wdump-1607.nt (Batch: 185,804 / Avg: 212,026)
10:57:39 INFO loader :: Add: 3,000,000 wdump-1607.nt (Batch: 146,627 / Avg: 197,355)
10:57:43 INFO loader :: Add: 3,500,000 wdump-1607.nt (Batch: 140,567 / Avg: 186,587)
10:57:46 INFO loader :: Add: 4,000,000 wdump-1607.nt (Batch: 142,166 / Avg: 179,573)
10:57:50 INFO loader :: Add: 4,500,000 wdump-1607.nt (Batch: 134,444 / Avg: 173,116)
10:57:53 WARN riot :: [line: 4869426, col: 86] Bad IRI: <http://:goku.com> Code: 57/REQUIRED_COMPONENT_MISSING in HOST: A component that is required by the scheme is missing.
10:57:54 INFO loader :: Add: 5,000,000 wdump-1607.nt (Batch: 143,307 / Avg: 169,589)
10:57:54 INFO loader :: Elapsed: 29.48 seconds [2021/08/14 10:57:54 CEST]
10:57:57 INFO loader :: Add: 5,500,000 wdump-1607.nt (Batch: 139,314 / Avg: 166,303)
10:58:01 INFO loader :: Add: 6,000,000 wdump-1607.nt (Batch: 146,842 / Avg: 164,487)
10:58:04 INFO loader :: Add: 6,500,000 wdump-1607.nt (Batch: 143,678 / Avg: 162,674)
10:58:08 INFO loader :: Add: 7,000,000 wdump-1607.nt (Batch: 142,085 / Avg: 161,008)
10:58:11 INFO loader :: Add: 7,500,000 wdump-1607.nt (Batch: 144,300 / Avg: 159,775)
10:58:15 INFO loader :: Add: 8,000,000 wdump-1607.nt (Batch: 141,362 / Avg: 158,484)
10:58:18 INFO loader :: Add: 8,500,000 wdump-1607.nt (Batch: 141,083 / Avg: 157,343)
10:58:22 INFO loader :: Add: 9,000,000 wdump-1607.nt (Batch: 147,492 / Avg: 156,761)
10:58:23 INFO loader :: Finished: wdump-1607.nt: 9,179,041 tuples in 58.91s (Avg: 155,812)
10:58:32 INFO loader :: Finish - index SPO
10:58:32 INFO loader :: Start replay index SPO
10:58:32 INFO loader :: Index set: SPO => SPO->POS, SPO->OSP
10:58:32 INFO loader :: Add: 1,000,000 Index (Batch: 8,928,571 / Avg: 8,928,571)
10:58:34 INFO loader :: Add: 2,000,000 Index (Batch: 508,130 / Avg: 961,538)
10:58:36 INFO loader :: Add: 3,000,000 Index (Batch: 381,388 / Avg: 638,026)
10:58:39 INFO loader :: Add: 4,000,000 Index (Batch: 370,233 / Avg: 540,321)
10:58:42 INFO loader :: Add: 5,000,000 Index (Batch: 362,450 / Avg: 492,029)
10:58:45 INFO loader :: Add: 6,000,000 Index (Batch: 370,644 / Avg: 466,562)
10:58:47 INFO loader :: Add: 7,000,000 Index (Batch: 367,647 / Avg: 449,293)
10:58:50 INFO loader :: Add: 8,000,000 Index (Batch: 366,166 / Avg: 436,895)
10:58:53 INFO loader :: Add: 9,000,000 Index (Batch: 380,952 / Avg: 429,881)
10:58:54 INFO loader :: Index set: SPO => SPO->POS, SPO->OSP [9,174,324 items, 22.0 seconds]
10:58:58 INFO loader :: Finish - index OSP
10:58:59 INFO loader :: Finish - index POS
10:58:59 INFO loader :: Time = 94.428 seconds : Triples = 9,179,041 : Rate = 97,207 /s
运行 fuseki 的脚本
#!/bin/bash
# WF 2020-06-25
# WF 2021-08-14
# Jena Fuseki server installation
# see https://jena.apache.org/documentation/fuseki2/fuseki-run.html
version=4.1.0
fuseki=apache-jena-fuseki-$version
if [ ! -d $fuseki ]
then
if [ ! -f $fuseki.tar.gz ]
then
wget http://archive.apache.org/dist/jena/binaries/$fuseki.tar.gz
else
echo $fuseki.tar.gz already downloaded
fi
echo "unpacking $fuseki.tar.gz"
tar xvfz $fuseki.tar.gz
else
echo $fuseki already downloaded and unpacked
fi
cd $fuseki
java -jar fuseki-server.jar --tdb2 --loc=../data /wdhs
加载数据的脚本
#!/bin/bash
# WF 2020-10-05
# WF 2021-08-14
# global settings
jena=apache-jena-4.1.0
tgz=$jena.tar.gz
mirror=https://downloads.apache.org/jena/binaries
jenaurl=$mirror/$tgz
base=$(pwd)
#base=/hd/luxio/gnd
data=$base/data
tdbloader=$jena/bin/tdb2.tdbloader
getjena() {
# download
if [ ! -f $tgz ]
then
echo "downloading $tgz from $jenaurl"
wget $jenaurl
else
echo "$tgz already downloaded"
fi
# unpack
if [ ! -d $jena ]
then
echo "unpacking $jena from $tgz"
tar xvzf $tgz
else
echo "$jena already unpacked"
fi
# create data directory
if [ ! -d $data ]
then
echo "creating $data directory"
mkdir -p $data
else
echo "$data directory already created"
fi
}
#
# show the given timestamp
#
timestamp() {
local msg="$1"
local ts=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
echo "$msg at $ts"
}
#
# load data for the given data dir and input
#
loaddata() {
local data="$1"
local input="$2"
timestamp "start loading $input to $data"
$tdbloader --loc "$data" "$input" > tdb2-$phase-out.log 2> tdb2-$phase-err.log
timestamp "finished loading $input to $data"
}
getjena
export TMPDIR=$base/tmp
if [ ! -d $TMPDIR ]
then
echo "creating temporary directory $TMPDIR"
mkdir $TMPDIR
else
echo "using temporary directory $TMPDIR"
fi
for file in *.nt
do
loaddata $data $file
done
推荐阅读
- java - 如何使用 @Valid List 的 BindingResult 获取错误列表
- windows - 更改密码自定义凭据提供程序
- javascript - 在尝试在 React 中渲染组件时收到预期的赋值或函数调用,而是看到表达式 no-unused-expressions 错误
- xamarin - Xamarin - 在 VS2017 中自定义 iPhoneSimulator 列表
- python - 如何使用 discord.py 机器人,在他们加入 VC 时赋予用户角色并在他们离开时将其删除
- python - 如何合并多个熊猫数据框列
- c - 'segmentation fault (core dumped)',输入没有指针的字符串
- reporting-services - 在 SSRS 中使用除法时出错
- hadoop - windows上hadoop入门
- jquery - jQuery - 文档准备好的更新功能()