Taking the Most Out of Linked Servers

What’s the first thing that comes to mind when you hear “linked server”? Performance issues or a snail image, right? Many developers don’t like linked servers because they are “slow” although they rarely bother to investigate the underling issue, much less to fix it.

Linked servers are a very useful and commonly used feature of SQL Server. They allow SQL Server to query external data sources, such as Oracle and DB2, and to present these data sources as SQL Servers to upstream systems. A query that involves a linked server is called a distributed query. The most common issue with a distributed query is that it doesn’t “remote”. In other words, while you expect the entire query (or parts of it) to be executed on the remote server, SQL Server downloads the data locally and then performs filters or joins. This could be very expensive with large tables,  hence the DQ notoriety. Conor Cunningham (Partner Software Architect, SQL Server Engine at Microsoft) covered this topic very well in his “Distributed Query Deep Dive” presentation a while back. I just want to add a few tips and tricks harvested from a recent project.

ETL developers typically implement incremental extraction to reduce data movement between systems. It’s a best practice. For example, when loading sales transactions, the query might ask for data since the last date the data was captured in the data warehouse. Here is what the distributed query might look like (note that 4-part reference notation):

select <columns> from [<linkedserver>].[<database>].[<schema>].[<table>] WHERE InvoiceDate>=‘2018-10-20’

Unfortunately, as simple this query is, it won’t remote as you can witness by the performance degradation and the query plan (the screenshot below uses Live Statistics):

Starting from right to left, we can see that the remote query fetches all rows from the target table before it applies the filter. In fact, unless you tweak the linked server properties, this query won’t remote if it filters on any type of column, not just date. However, you might be able to get a query with filters on other data types (except date) to remote if you set Collation Compatible to True on the linked server properties in SSMS, assuming of course that SQL Server and remote server use the same collation.

EXEC master.dbo.sp_serveroption @server=N'<linkedserver’, @optname=N’collation compatible’, @optvalue=N’true’

What’s so special about dates that prevents SQL Server from remoting them? Historically, date remoting was not done because of potential issues like different time zones and clock skews between the two servers. Unfortunately, SQL Server doesn’t give the developer the latitude to ignore such issues and force a distributed query to remote (I’ve tried many permutations of passing the date parameter with no luck). You can check the OLE DB Provider properties by enabling the OLEDB Provider Information event in the SQL Server Profiler to see if the provider implements SQLPROP_DATELITERALS. In my case, both the Oracle OLE DB Provider and Microsoft OLEDB Provider for DB2 didn’t implement this feature and I discovered this by the following property in the event results:

<DateLiteralsSupport>0</DateLiteralsSupport>

So, the only way to remote the query was to convert it from a distributed query to a pass-through the query either using OPENQUERY or EXEC AT.

select <column list> FROM OPENQUERY(<linkedserver>‘SELECT * from <database>.<schema>.<table> where INVDT >= ”2018-10-20”’)

EXEC (‘SELECT TOP 10 * FROM <database>.<schema>.<table> where INVDT >=?‘, ‘2018-10-20’) AT <linked server>

The advantage of the EXEC AT method is that the query can be more than 8,000 characters and you don’t have to concatenate parameters. This method requires that you first set the RPC and ‘RPC Out” properties of the linked server to True.

exec sp_serveroption @server='<linkedserver>’, @optname=’rpc’, @optvalue=’true’

exec sp_serveroption @server='<linkedserver>’, @optname=’rpc out’, @optvalue=’true’

This blog discussed some reasons why SQL Server might not remote even simple distributed queries. I recommended Microsoft make linked servers more flexible to support such common scenarios. It will be interesting to see how the forthcoming SQL Server 2019 external tables that use PolyBase will play out. They are expected to be more optimized for throughput than for OLTP operations. So, for OLAP queries we may get better performance in comparison with linked servers.