首页 > 解决方案 > 为什么 SPARQL 查询在 R 包中比在 Land Registry 中的查询时间长 30 倍?

问题描述

当我在Land Registry控制台上运行以下 SPARQL 查询时,它需要 c.0.4 秒并返回所有 2599 个结果:

prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix ukhpi: <http://landregistry.data.gov.uk/def/ukhpi/>
              
              SELECT
                ?stripped_regionName ?stripped_date ?ukhpi ?avprice ?volume ?newbuildvolume ?regionName ?regionId ?region 
              
              WHERE
              {
                VALUES ?regionId {<http://landregistry.data.gov.uk/id/region/southampton>   <http://landregistry.data.gov.uk/id/region/london>  <http://landregistry.data.gov.uk/id/region/england> <http://landregistry.data.gov.uk/id/region/wales>   <http://landregistry.data.gov.uk/id/region/scotland>    <http://landregistry.data.gov.uk/id/region/barking> <http://landregistry.data.gov.uk/id/region/southwark>   <http://landregistry.data.gov.uk/id/region/westminster> <http://landregistry.data.gov.uk/id/region/merton> <http://landregistry.data.gov.uk/id/region/greenwich> <http://landregistry.data.gov.uk/id/region/camden>}
              
                    ?region ukhpi:refRegion  ?regionId .
                    ?region ukhpi:refMonth ?date .
                  ?region ukhpi:housePriceIndex ?ukhpi .
                    ?region ukhpi:averagePrice ?avprice .
                    ?region ukhpi:salesVolume ?volume .
                    ?region ukhpi:salesVolumeNewBuild ?newbuildvolume .
                
                  ?regionId rdfs:label ?regionName
                  FILTER (langMatches( lang(?regionName), "EN") ) .
                  BIND (STR(?regionName)  AS ?stripped_regionName) .
                  BIND (STR(?date)  AS ?stripped_date) .
              }

当我使用 R 的 SPARQL 包运行相同的查询时,返回所有 2599 个结果需要 c.15.0 秒:

  endpoint <- "https://landregistry.data.gov.uk/landregistry/query"
  query <- '
                  prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
                  prefix ukhpi: <http://landregistry.data.gov.uk/def/ukhpi/>
                  
              
              
              SELECT
                ?stripped_regionName ?stripped_date ?ukhpi ?avprice ?volume ?newbuildvolume ?regionName ?regionId ?region 
              
              WHERE
              {
                VALUES ?regionId {<http://landregistry.data.gov.uk/id/region/southampton>   <http://landregistry.data.gov.uk/id/region/london>  <http://landregistry.data.gov.uk/id/region/england> <http://landregistry.data.gov.uk/id/region/wales>   <http://landregistry.data.gov.uk/id/region/scotland>    <http://landregistry.data.gov.uk/id/region/barking> <http://landregistry.data.gov.uk/id/region/southwark>   <http://landregistry.data.gov.uk/id/region/westminster> <http://landregistry.data.gov.uk/id/region/merton> <http://landregistry.data.gov.uk/id/region/greenwich> <http://landregistry.data.gov.uk/id/region/camden>}
              
                    ?region ukhpi:refRegion  ?regionId .
                    ?region ukhpi:refMonth ?date .
                  ?region ukhpi:housePriceIndex ?ukhpi .
                    ?region ukhpi:averagePrice ?avprice .
                    ?region ukhpi:salesVolume ?volume .
                    ?region ukhpi:salesVolumeNewBuild ?newbuildvolume .
                
                  ?regionId rdfs:label ?regionName
                  FILTER (langMatches( lang(?regionName), "EN") ) .
                  BIND (STR(?regionName)  AS ?stripped_regionName) .
                  BIND (STR(?date)  AS ?stripped_date) .
              }'


  qd <- SPARQL(endpoint, query)
  hpi_df <- qd$results

有没有办法在通过 R 运行时加快查询速度,或者延迟是不可避免的?我希望有一个修复,但想象这可能是因为土地注册控制台始终连接,但我的 R 查询需要先连接到服务器。

标签: rsparql

解决方案


只是为了回答这个问题,据我所见,它确实似乎是缓慢的 R 包。

我的解决方法是下载所有数据并预加载,然后创建一个函数,仅在每次有人加载应用程序时查询和添加新数据。

将查询返回的数据过滤到所需的最少字段也有帮助。


推荐阅读