How full text search works within InstantKB



Full-text search supports almost 50 diverse languages, such as English, Spanish, Chinese, Japanese, Arabic, Bengali, and Hindi. Each of the columns contained in the full-text index is associated with a Microsoft Windows locale identifier (LCID) that equates to a language that is supported by full-text search. For example, LCID 1033 equates to U.S English, and LCID 2057 equates to British English. For each supported full-text language, SQL Server provides linguistic components that support indexing and querying full-text data that is stored in that language.

Language-specific components include the following:

  • Word breakers and stemmers.
  • Stoplists.
  • Thesaurus files.
  • Filters (iFilters).

Search Example

 To explain how search works with full text enabled let's say we search for "Skype for Business" within InstantKB.

The SQL server linguistic components (word breakers, stemmers, stoplists, thesaurus & filters) will be applied against your supplied search terms as detailed below.

Word Breakers

Word breakers perform linguistic analysis finding word boundaries (word-breaking) and conjugating verbs (stemming).

Word breakers and stemmers are language specific, and the rules for linguistic analysis differ for different languages. For a given language, a word breaker identifies individual words by determining where word boundaries exist based on the lexical rules of the language. Each word (also known as a token) is inserted into the full-text index.In our example the SQL Server word breaker would likely  break up our original search term into three separate terms - "Skype", "for" and "Business".

If you want to avoid SQL server breaking the search query into several separate terms you can create a custom dictionary to control the behaviour of the SQL Server word breakers. Further information on creating a custom dictionary can be found here 

https://docs.microsoft.com/en-us/sql/relational-databases/search/customize-the-behavior-of-word-breakers-with-a-custom-dictionary

https://technet.microsoft.com/library/cc263242.aspx

Within SQL Server you can get a list of registered word breakers like so...

EXEC sp_help_fulltext_system_components 'wordbreaker';  
GO

To get all English word breakers we would use the English LCID like so...

EXEC sp_help_fulltext_system_components @component_type = 'wordbreaker', @param = 1033;  
GO

Stemmer

The stemmer generates inflectional forms of our broken search term based on the rules of that language (for example, "running", "ran", and "runner" are various forms of the word "run").

In our example the SQL Server stemmer generation would likely add additional search terms to our original query. For exampe "Businesses" is a count noun of Business so this would like be added. This would now leave us with 4 separate terms "Skype", "for", "Business" & "Businesses".

Stop List

Next each of our original & inflected search terms will be checked against the internal SQL Server stop list. A system stop list is provided by SQL Server that contains a basic set of stop words (also known as noise words). A stop word is a word that does not help the search and is ignored by full-text queries. For example, for the English locale words such as "a", "and", "is", and "the" are considered stopwords. Typically, you will need to configure one or more thesaurus files and stoplists. For more information, see https://technet.microsoft.com/en-us/library/ms142551(v=sql.110).aspx

To view a list of English stop words you can execute the following query within SQL Server

SELECT * FROM sys.fulltext_system_stopwords WHERE language_id=1033;

You can get a list of supported languages like so...

SELECT * FROM sys.fulltext_languages

To identify the language_id for your existing full text indexes you can use the following query

SELECT * FROM sys.fulltext_index_columns;

Further information can be found at https://msdn.microsoft.com/en-us/library/ms142509(v=sql.100).aspx

In our original search example the SQL Server stop list component will likely remove the term "for" from our original search terms. This will now leave us with 3 separate terms to query "Skype", "Business" & "Businesses".

Thesaurus

SQL Server also installs a thesaurus file for each full-text language, as well as a global thesaurus file. The installed thesaurus files are essentially empty, but you can edit them to define synonyms for a specific language or business scenario. A synonym is a word or phrase that means exactly or nearly the same as another word or phrase in the same language, for example shut is a synonym of close.

By developing a thesaurus tailored to your full-text data, you can effectively broaden the scope of full-text queries on that data. For more information, see https://technet.microsoft.com/en-us/library/ms142491(v=sql.110).aspx
 
Providing your thesaurus definitions don't find any associated synonym's we will utilmately be left with 3 terms from our original query. So "Skype for Business" would be converted into "Skype", "Business", & "Businesses". These inflected terms not your original search query will ultimately be used to determine the relevancy ranking of each returned result.

Full-Text Search Indexing Process

As part of processing, the gathered text data is passed through a word breaker to separate the text into individual tokens, or keywords. The language used for tokenization is specified at the column level, or can be identified within varbinary(max), image, or xml data by the filter component.

When indexing data stored in a varbinary(max) or image column, the filter, which implements the IFilter interface, extracts text based on the specified file format for that data (for example, Microsoft Word). In some cases, the filter components require the varbinary(max), or image data to be written out to the filterdata folder, instead of being pushed into memory.

Additional processing may be performed to remove stopwords, and to normalize tokens before they are stored in the full-text index or an index fragment.

When a population has completed, a final merge process is triggered that merges the index fragments together into one master full-text index. This results in improved query performance since only the master index needs to be queried rather than a number of index fragments, and better scoring statistics may be used for relevance ranking.

Full-Text Querying Process

The query processor passes the full-text portions of a query to the Full-Text Engine for processing. The Full-Text Engine performs word breaking and, optionally, thesaurus expansions, stemming, and stopword (noise-word) processing.

TSQL v's Full Text Search

In contrast to full-text search, the LIKE Transact-SQL predicate works on character patterns only. Also, you cannot use the LIKE predicate to query formatted binary data. Furthermore, a LIKE query against a large amount of unstructured text data is much slower than an equivalent full-text query against the same data. A LIKE query against millions of rows of text data can take minutes to return; whereas a full-text query can take only seconds or less against the same data, depending on the number of rows that are returned.

How InsttantKB Search Ranking Works

The ranking algorithm used within the SQL Server full text search is quite complex and probably best explained by linking to a Microsoft article. You can learn how rank is calculated for each article here http://technet.microsoft.com/en-us/library/ms142524(v=sql.105).aspx

The algorithm is very heavily weighted to give more occurrences of a text match a better rank (bumping them up higher in your results).

This algorithm is completely internal and can't be modified.

Another article that details how ranking words within SQL Server can be found below. This article references SQL Server 2005 but is still applicaable within newer versions of SQL Server.

http://technet.microsoft.com/en-us/library/ms345119(v=sql.90).aspx

Improving relevancy within InstantKB

InstantKB employs a few techniques to help you control the articles that appear for specific search terms.

Probably the best way to increase the relevancy of an article is tto simply tag the article with the search terms you feel users will use to locate tthe article.If you tag an article with a specific term, InstantKB will weight any tags more heavily than matches found within the article title or article text.

Taking our original example if we add the tags "Skype", "Business", "Businesses" to an article we can ensure this article will appear very high within the relevancy ranking.

General Hints & Tips

Full-text queries are not case-sensitive. For example, searching for "Aluminium" or "aluminium" returns the same results.

Each version of SQL Server typically includes new word breakers that have better linguistic rules and are more accurate than earlier word breakers. Potentially, the new word breakers might behave slightly differently from the word breakers in full-text indexes that were imported from previous versions of SQL Server.

This is significant if a full-text catalog was imported when a database was upgraded to the current version of SQL Server. One or more languages used by the full-text indexes in the full-text catalog might now be associated with new word breakers.

All columns listed in a single full-text query function clause must use the same language.