首页 > 解决方案 > 为什么 SQL Server 全文搜索索引 SCR 或 SUR 首字母缩写词后跟一个数字?

问题描述

我发现 SQL Server 全文搜索的一个非常奇怪的行为是索引 SUR、SCR 和可能的其他一些首字母缩略词,以及它后面的数字 - 作为“精确匹配”。

SELECT * FROM sys.dm_fts_parser ('"SUR 12345"', 1033, 0, 0)
关键词 group_id 短语 ID 发生 特殊术语 显示术语 扩展类型 source_term
苏尔 1 2 3 4 5 1 0 1 完全符合 苏尔 12345 0 苏尔 12345
nn 1 2 3 4 5 苏尔 1 0 1 完全符合 nn12345sur 0 苏尔 12345
SELECT * FROM sys.dm_fts_parser ('"SCR 12345"', 1033, 0, 0)
关键词 group_id 短语 ID 发生 特殊术语 显示术语 扩展类型 source_term
SCR 1 2 3 4 5 1 0 1 完全符合 SCR 12345 0 可控硅 12345
nn 1 2 3 4 5 屏幕 1 0 1 完全符合 nn12345scr 0 可控硅 12345

其他首字母缩写词或文本,包括小写 sur,不受影响:

SELECT * FROM sys.dm_fts_parser ('"sur 12345"', 1033, 0, 0)
关键词 group_id 短语 ID 发生 特殊术语 显示术语 扩展类型 source_term
苏尔 1 0 1 完全符合 苏尔 0 苏尔 12345
1 2 3 4 5 1 0 2 完全符合 12345 0 苏尔 12345
nn 1 2 3 4 5 1 0 2 完全符合 nn12345 0 苏尔 12345
SELECT * FROM sys.dm_fts_parser ('"ABC 12345"', 1033, 0, 0)
关键词 group_id 短语 ID 发生 特殊术语 显示术语 扩展类型 source_term
美国广播公司 1 0 1 完全符合 美国广播公司 0 ABC 12345
1 2 3 4 5 1 0 2 完全符合 12345 0 ABC 12345
nn 1 2 3 4 5 1 0 2 完全符合 nn12345 0 ABC 12345
SELECT * FROM sys.dm_fts_parser ('"XYZ 76"', 1033, 0, 0)
关键词 group_id 短语 ID 发生 特殊术语 显示术语 扩展类型 source_term
xyz 1 0 1 完全符合 xyz 0 XYZ 76
7 6 1 0 2 完全符合 76 0 XYZ 76
7 6 1 0 2 完全符合 nn76 0 XYZ 76

这种行为似乎出乎意料,很可能是错误的,但我也可能遗漏了一些与断字有关的明显内容(尝试过 1033 和 2057 - 效果相同)。我在 SQL Server 2019 Linux 15.0.4053.23 和 2017 CU20 和 CU25 上复制了它,我可以立即访问它们。

有没有人有类似的问题和解决方案,以便 SUR、SCR 和任何其他可能损坏的首字母缩写词将独立于以下数字进行索引?

编辑:

将语言更改为 0(中性)会导致奇怪的行为 - 当使用 SUR 首字母缩写词时,它不能解决问题,但会修复 SCR 首字母缩写词!

SELECT * FROM sys.dm_fts_parser ('"SUR 12345"', 0, 0, 0)
关键词 group_id 短语 ID 发生 特殊术语 显示术语 扩展类型 source_term
苏尔 1 2 3 4 5 1 0 1 完全符合 苏尔 12345 0 苏尔 12345
nn 1 2 3 4 5 苏尔 1 0 1 完全符合 nn12345sur 0 苏尔 12345
SELECT * FROM sys.dm_fts_parser ('"SCR 12345"', 0, 0, 0)
关键词 group_id 短语 ID 发生 特殊术语 显示术语 扩展类型 source_term
scr 1 0 1 完全符合 scr 0 可控硅 12345
1 2 3 4 5 1 0 2 完全符合 12345 0 可控硅 12345
nn 1 2 3 4 5 1 0 2 完全符合 nn12345 0 可控硅 12345

我决定给这个问题一个赏金,因为理想情况下我需要通过重新配置数据库索引来解决找不到搜索词的问题。

为了帮助重现下面的问题,是一个创建数据库的脚本(带有注释掉的 DROP 脚本以帮助重置状态)

/*
DROP FULLTEXT INDEX ON EnglishTexts
DROP FULLTEXT INDEX ON NeutralTexts
DROP FULLTEXT CATALOG TestSearchCatalog
USE master
DROP DATABASE TestSearch
*/

CREATE DATABASE TestSearch
GO

USE [TestSearch]
GO

CREATE FULLTEXT CATALOG TestSearchCatalog WITH ACCENT_SENSITIVITY = OFF
GO

CREATE TABLE EnglishTexts (Id INT IDENTITY(1,1) NOT NULL, Text NVARCHAR(MAX), CONSTRAINT PK_EnglishTexts PRIMARY KEY CLUSTERED (Id))
CREATE FULLTEXT INDEX ON EnglishTexts (Text LANGUAGE 'English') KEY INDEX PK_EnglishTexts ON ([TestSearchCatalog]) WITH (CHANGE_TRACKING = AUTO, STOPLIST = OFF)
INSERT INTO EnglishTexts(Text) VALUES ('PRFX 12233')
INSERT INTO EnglishTexts(Text) VALUES ('SUR 12233')
INSERT INTO EnglishTexts(Text) VALUES ('SCR 12233')

CREATE TABLE NeutralTexts (Id INT IDENTITY(1,1) NOT NULL, Text NVARCHAR(MAX), CONSTRAINT PK_NeutralTexts PRIMARY KEY CLUSTERED (Id))
CREATE FULLTEXT INDEX ON NeutralTexts (Text LANGUAGE 'Neutral') KEY INDEX PK_NeutralTexts ON ([TestSearchCatalog]) WITH (CHANGE_TRACKING = AUTO, STOPLIST = OFF)
INSERT INTO NeutralTexts(Text) VALUES ('PRFX 12233')
INSERT INTO NeutralTexts(Text) VALUES ('SUR 12233')
INSERT INTO NeutralTexts(Text) VALUES ('SCR 12233')

-- following query returns 1 row but should 3 - a possible bug in english word breaker
SELECT * FROM EnglishTexts WHERE CONTAINS(Text, '"12233"')

-- following query returns 2 rows but should 3 - neutral language word breaker is also treating SUR acronym specially - another bug?
SELECT * FROM NeutralTexts WHERE CONTAINS(Text, '"12233"')

-- following query returns 1 row but should 3 - forcing neutral language on a query on english index should apply neutral language (i might misunderstand if this is even possible without a neutral index)
SELECT * FROM EnglishTexts WHERE CONTAINS(Text, '"12233"', LANGUAGE 0)

-- following query returns 2 rows but should 3 - using neutral language on neutral language indexed table should not make a difference
SELECT * FROM NeutralTexts WHERE CONTAINS(Text, '"12233"', LANGUAGE 0)

-- for reference - English word breaker does not split SCR with 12233 and SUR with 12233, causing above problems
SELECT * FROM sys.dm_fts_parser ('"SCR 12233 SUR 12233"', 1033, 0, 0)

-- for reference - Neutral word breaker correctly splits SCR and 12233 but not SUR with 12233
SELECT * FROM sys.dm_fts_parser ('"SCR 12233 SUR 12233"', 0, 0, 0)

标签: sqlsql-serverparsingindexingfull-text-search

解决方案


我检查你的情况很有趣。经过一些工作和研发,我发现有些东西可能对你有帮助。

问题是 1 或 2 个字符是停用词,因此当用户搜索包含它的短语时,全文索引搜索引擎会跳过它。解决方案:

1) Drop all full text indexes and full text catalog

2)Create a stop list and set in the stoplist Action=Delete All Stopwords and Full-Text Language=English

3)Create all full text indexes and full text catalog through scripts as given below instead of wizard

4)Attach stoplist to the full text indexes when creating them


/*******************Drop and create FULL TEXT CATALOG for AbstractSearch*************************************/

/****** Drop:  FullTextCatalog [abstractSearch]    Script Date: 12/02/2011 13:10:21 ******/

GO

IF  EXISTS (SELECT * FROM sys.fulltext_indexes fti WHERE fti.object_id = OBJECT_ID(N'[dbo].[Abstract]'))

ALTER FULLTEXT INDEX ON [dbo].[Abstract] DISABLE

GO

/****** Object:  FullTextIndex     Script Date: 12/02/2011 13:10:21 ******/

IF  EXISTS (SELECT * FROM sys.fulltext_indexes fti WHERE fti.object_id = OBJECT_ID(N'[dbo].[Abstract]'))

DROP FULLTEXT INDEX ON [dbo].[Abstract]

GO

IF  EXISTS (SELECT * FROM sys.fulltext_indexes fti WHERE fti.object_id = OBJECT_ID(N'[dbo].[Author]'))

ALTER FULLTEXT INDEX ON [dbo].[Author] DISABLE

GO

/****** Object:  FullTextIndex     Script Date: 12/02/2011 13:10:21 ******/

IF  EXISTS (SELECT * FROM sys.fulltext_indexes fti WHERE fti.object_id = OBJECT_ID(N'[dbo].[Author]'))
DROP FULLTEXT INDEX ON [dbo].[Author]

GO

IF  EXISTS (SELECT * FROM sys.fulltext_indexes fti WHERE fti.object_id = OBJECT_ID(N'[dbo].[Synonym]'))
ALTER FULLTEXT INDEX ON [dbo].[Synonym] DISABLE

GO

/****** Object:  FullTextIndex     Script Date: 12/02/2011 13:10:21 ******/

IF  EXISTS (SELECT * FROM sys.fulltext_indexes fti WHERE fti.object_id = OBJECT_ID(N'[dbo].[Synonym]'))
DROP FULLTEXT INDEX ON [dbo].[Synonym]
GO

IF  EXISTS (SELECT * FROM sysfulltextcatalogs ftc WHERE ftc.name = N'abstractSearch')

DROP FULLTEXT CATALOG [abstractSearch]
GO


/****** Create:  FullTextCatalog [abstractSearch]    Script Date: 12/02/2011 13:10:21 ******/

CREATE FULLTEXT CATALOG [abstractSearch]

AS DEFAULT

GO

/****** Create:  FullTextIndex on  Abstract with stoplist set to custom stoplist   Script Date: 12/02/2011 13:10:21 ******/

CREATE FULLTEXT INDEX ON dbo.Abstract

(abstractTitle, abstractDescription)

KEY INDEX PK_Abstract
ON [abstractSearch]

WITH STOPLIST = [AbstractSearchStopList]


/******  Create:  FullTextIndex on  Synonym with stoplist set to custom stoplist    Script Date: 12/02/2011 13:10:21 ******/
CREATE FULLTEXT INDEX ON dbo.Synonym

(synonyms,keywordSynonym)

KEY INDEX PK_Synonyms

ON [abstractSearch]

WITH STOPLIST = [AbstractSearchStopList]


/******  Create:  FullTextIndex on  Author with stoplist set to custom stoplist   Script Date: 12/02/2011 13:10:21 ******/

CREATE FULLTEXT INDEX ON dbo.Author

(firstName,lastName,middleName)

KEY INDEX PK_Author

ON [abstractSearch]
WITH STOPLIST = [AbstractSearchStopList]

推荐阅读