Suppose your company has a web application that allows customers to enter comments, such as to provide feedback about their product experience. How do you derive knowledge from these comments? Perhaps, customers are complaining a lot about a particular product. Or, they are raving about your service. Enter semantic search – a new feature in SQL Server 2012.
Building upon full text search, semantic search allows you to search not only for words, but also for the meaning (semantics) of these works. Mark Tabladillo, Ph.D., gave us great presentation at our Atlanta BI January meeting. He demoed how semantic search can be used to find the most common phrases in a document and how to find similar documents from a given search criteria.
Besides external documents, you can apply semantic search to text data already stored in a database, such as a text-based Comments field. Again, this is possible because semantic search builds upon the full-text search capabilities of SQL Server. So, any column that supports full-text search can be enhanced with semantic search.
Going back to my scenario, here is what it takes to configure a text field for semantic search:
- When you use semantic search, SQL Server performs statistical analysis of the words in the column contents. This analysis requires base data that is provided as a SQL Server database. Because the database is not installed by default, as a perquisite of using semantic search, you need to run the semantic database installer. The installer can be found on the SQL Server setup disk in the following folders:
For x86, the setup is \x86\Setup\SemanticLanguageDatabase.msi
For x64, the setup is \x64\Setup\SemanticLanguageDatabase.msi
- The setup simply extracts the MDF and LDF files of the semantic database to a folder that you specify during the installation. Next, simply attach the semantics database to your SQL Server 2012 instance as you would with any other database.
- The next step is another step that you need to perform only once for each SQL Server instance. Register the semantics database using this command:
EXEC sp_fulltext_semantic_register_language_statistics_db @dbname = ‘SemanticsDB’
4. Next, create a full-text index using Statistical_Semantics option which is new with SQL Server 2012.
CREATE FULLTEXT INDEX ON [Production].[ProductReview]
[Comments] LANGUAGE [English] Statistical_Semantics
KEY INDEX [PK_ProductReview_ProductReviewID] ON ([AW2008FullTextCatalog], FILEGROUP [PRIMARY])
WITH (CHANGE_TRACKING = AUTO, STOPLIST = SYSTEM)
This command creates a semantic-enabled full-text index on the Comments column in the Production.ProductReview table on the AW2008FullTextCatalog full text catalog. The AW2008FullTextCatalog text catalog is included with the AdventureWorks2012 database. The CHANGE_TRACKING = AUTO clause instructs SQL Server to auto-update the index, and STOPLIST = SYSTEM specifies that the default full-text system STOPLIST should be used to parse words.
5. Now that you’ve done the ground work, you can perform the semantic search. For example, the following query returns the most frequent words for each document:
SELECT TOP(5) KEYP_TBL.document_key, KEYP_TBL.keyphrase, KEYP_TBL.score
) AS KEYP_TBL
ORDER BY KEYP_TBL.score DESC;
Note that the document_key matches the ProductReviewID column (primary key) of the [Production].[ProductReview] table allowing you to match the semantic search results to the document. For example, “socks” and “trip” were found in the row with the primary key of 1 whose Comments column contains the following text:
“I can’t believe I’m singing the praises of a pair of socks, but I just came back from a grueling 3-day ride and these socks really helped make the trip a blast. They’re lightweight yet really cushioned my feet all day. The reinforced toe is nearly bullet-proof and I didn’t experience any problems with rubbing or blisters like I have with other brands. I know it sounds silly, but it’s always the little stuff (like comfortable feet) that makes or breaks a long trip. I won’t go on another trip without them!”
You can feed these results to a data mining model if you want to learn which phrases customers tend to use together similar to identifying what products customers tend to purchase together (market basket analysis). “Together” here means within a specific document.
What if you want to get the most popular phrases across all comments? I think the following query should help although there might be a better way. The query groups phrases and sorts them in a descending order by the average score.
SELECT TOP(5) KEYP_TBL.keyphrase, AVG(score) AS Score
) AS KEYP_TBL
GROUP BY KEYP_TBL.keyphrase
ORDER BY AVG(KEYP_TBL.score) DESC;