首页 > 解决方案 > 使用文本文件删除 neo4j 中的停用词

问题描述

我在 neo4j 中成功加载了 CSV 文件,我想删除数据集中的停用词。我在文本文件中有单独的停用词列表。我找到了一个使用停用词的示例代码。但我想用我的停用词列表替换它。我需要如何进行?我们可以在一个查询中加载 2 个数据集(kbv5.txt 和 stopwords.txt)吗?

我想在我的代码中包含停用词列表文件:

LOAD CSV FROM "file:///kbv5.txt"  as row fieldterminator "."
with row
unwind row as text
with reduce(t=tolower(text), delim in 
["","",",",".","!","?",'"',":",";","'","-"] | replace(t,delim,"")) as 
normalized
with [w in split(normalized," ") | trim(w)] as words
unwind range(0,size(words)-2) as idx
MERGE (w1:Word {name:words[idx]})
ON CREATE SET w1.count = 1
ON MATCH SET w1.count = w1.count + 1
MERGE (w2:Word {name:words[idx+1]})
ON CREATE SET w2.count = 1
ON MATCH SET w2.count = w2.count + (case when idx = size(words)-2 then 1 
else 0 end)
MERGE (w1)-[r:NEXT]->(w2)
 ON CREATE SET r.count = 1 ON MATCH SET r.count = r.count +1

使用停用词的示例代码:

with "Great device, but the calls drop too frequently." as text
with replace(replace(tolower(text),".",""),",","") as normalized
with [w in split(normalized," ") | trim(w)] as words
with [w in words WHERE NOT w IN ["the","an","on"]] as words
UNWIND range(0,size(words)-2) as idx
MERGE (w1:Word {name:words[idx]})
MERGE (w2:Word {name:words[idx+1]})
MERGE (w1)-[:NEXT]->(w2)

提前致谢

标签: neo4jcyphergraph-databases

解决方案


此代码演示了如何从文本中删除停用词。试试看; 它不会向您的数据库写入任何内容。您可以在导入后立即在代码顶部附近执行此操作。

WITH SPLIT('some of these words are unnecessary',' ') AS text, 
     SPLIT('are but of in the these',' ') AS stopwords
RETURN FILTER (word IN text WHERE NOT word IN stopwords)

推荐阅读