首页 > 解决方案 > searching in CLOB for words in a list/table

问题描述

I have a large table with a clob column (+100,000 rows) from which I need to search for specific words within a certain timeframe.

{select id, clob_field,  dbms_lob.instr(clob_field, '.doc',1,1) as doc,  --ideally want .doc
      dbms_lob.instr(clob_field, '.docx',1,1) as docx, --ideally want .docx
      dbms_lob.instr(clob_field, '.DOC',1,1) as DOC,  --ideally want .DOC
      dbms_lob.instr(clob_field, '.DOCX',1,1) as DOCX  --ideally want .DOCX
 from clob_table, search_words s
 where (to_char(date_entered, 'DD-MON-YYYY') 
      between to_date('01-SEP-2018') and to_date('30-SEP-2018'))
 AND (contains(clob_field, s.words )>0)  ;}

The set of words are '.doc', '.DOC', '.docx', and '.docx'. When I use CONTAINS() it seems to ignore the dot and so provides me with lots of rows, but not with the document extensions in it. It finds emails with .doc as part of the address, so the doc will have a period on either side of it.

i.e. mail.doc.george@here.com

I don't want those occurrences. I have tried it with a space at the end of the word and it ignores the spaces. I have put these in a search table I created, as shown above, and it still ignores the spaces. Any suggestions?

Thanks!!

标签: oracleselectcontainsclob

解决方案


这里有两个建议。

简单、低效的方法是使用除 CONTAINS 之外的东西。众所周知,上下文索引很难正确处理。因此,您可以执行以下操作,而不是最后一行:

AND regexp_instr(clob_field, '\.docx', 1,1,0,'i') > 0

我认为这应该可行,但它可能会很慢。那是您使用索引的时候。但是 Oracle Text 索引比普通索引更复杂。这个旧文档解释说,标点符号(在索引参数中定义)没有被索引,因为 Oracle Text 的重点是索引words。如果要将特殊字符作为单词的一部分进行索引,则需要将其添加到printjoin字符集中。该文档解释了如何,但我将其粘贴在这里。您需要删除现有的 CONTEXT 索引并使用此首选项重新创建它:

begin
ctx_ddl.create_preference('mylex', 'BASIC_LEXER');
ctx_ddl.set_attribute('mylex', 'printjoins', '._-'); -- periods, underscores, dashes can be parts of words
end;
/

CREATE INDEX myindex on clob_table(clob_field) INDEXTYPE IS CTXSYS.CONTEXT
  parameters ('LEXER mylex');

请记住,默认情况下 CONTEXT 索引不区分大小写;我认为这就是您想要的,但仅供参考,您可以通过在词法分析器上将“mixed_case”属性设置为“Y”来更改它,就在您设置上面的 printjoins 属性的下方。

此外,您似乎正在尝试搜索以.docx 结尾的单词,但 CONTAINS 不是 INSTR - 默认情况下它匹配整个单词,而不是字符串。你可能想修改你的查询来做AND contains(clob_field, '%.docx')>0


推荐阅读