sql - IN vs. JOIN with large rowsets

Question

Welcome To Ask or Share your Answers For Others

sql - IN vs. JOIN with large rowsets

asked Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

I'm wanting to select rows in a table where the primary key is in another table. I'm not sure if I should use a JOIN or the IN operator in SQL Server 2005. Is there any significant performance difference between these two SQL queries with a large dataset (i.e. millions of rows)?

SELECT *
FROM a
WHERE a.c IN (SELECT d FROM b)

SELECT a.*
FROM a JOIN b ON a.c = b.d

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

322 views

1 Answer

深蓝 · Answer 1 · 2021-10-17T00:13:21+0000

Update:

This article in my blog summarizes both my answer and my comments to another answers, and shows actual execution plans:

IN vs. JOIN vs. EXISTS

SELECT  *
FROM    a
WHERE   a.c IN (SELECT d FROM b)

SELECT  a.*
FROM    a
JOIN    b
ON      a.c = b.d

These queries are not equivalent. They can yield different results if your table b is not key preserved (i. e. the values of b.d are not unique).

The equivalent of the first query is the following:

SELECT  a.*
FROM    a
JOIN    (
        SELECT  DISTINCT d
        FROM    b
        ) bo
ON      a.c = bo.d

If b.d is UNIQUE and marked as such (with a UNIQUE INDEX or UNIQUE CONSTRAINT), then these queries are identical and most probably will use identical plans, since SQL Server is smart enough to take this into account.

SQL Server can employ one of the following methods to run this query:

If there is an index on a.c, d is UNIQUE and b is relatively small compared to a, then the condition is propagated into the subquery and the plain INNER JOIN is used (with b leading)
If there is an index on b.d and d is not UNIQUE, then the condition is also propagated and LEFT SEMI JOIN is used. It can also be used for the condition above.
If there is an index on both b.d and a.c and they are large, then MERGE SEMI JOIN is used
If there is no index on any table, then a hash table is built on b and HASH SEMI JOIN is used.

Neither of these methods reevaluates the whole subquery each time.

See this entry in my blog for more detail on how this works:

Counting missing rows: SQL Server

There are links for all RDBMS's of the big four.

Categories

sql - IN vs. JOIN with large rowsets

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags