Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I'm analyzing the concept of anti semi-join in the SQL Server,

The Microsoft documentation that describes the keyword ANTISEMIJOIN of the U-SQL syntax quotes the following query for SQL:

SELECT * FROM A 
WHERE A.Key NOT IN (SELECT B.Key FROM B)

Others suggest the following query:

SELECT * FROM A 
WHERE NOT EXISTS (SELECT 1 FROM B WHERE A.Key = B.Key)

Both of them seem to return the same records, the query proposed by Microsoft seems to have better performance, however I found a case in which the two queries give different results and I don't know which of them is the right one. I show it with an example:

DECLARE @A TABLE ([OnA] [int], [DataA] [text])
DECLARE @B TABLE ([OnB] [int], [DataB] [text])

INSERT INTO @A VALUES (1, 'A1'), (2, 'A2'), (NULL, 'A3')
INSERT INTO @B VALUES (1, 'B1'), (3, 'B2'), (NULL, 'B3')

--LEFT SEMI-JOIN
SELECT * FROM @A 
WHERE OnA IN (SELECT OnB FROM @B)           
--Returns with ANSI_NULLS ON:  1 | A1              (Correct!)       
--Returns with ANSI_NULLS OFF: 1 | A1, NULL | A3   (Correct!)

SELECT * FROM @A 
WHERE EXISTS (SELECT 1 FROM @B WHERE OnA = OnB)  
--Returns with ANSI_NULLS ON:  1 | A1              (Correct!) 
--Returns with ANSI_NULLS OFF: 1 | A1              (Not correct!)("ANSI_NULLS OFF" insensitive!)

--LEFT ANTISEMI-JOIN
SELECT * FROM @A 
WHERE OnA NOT IN (SELECT OnB FROM @B)                
--Returns with ANSI_NULLS ON:  empty               (Not Correct and very dangerous!)                 
--Returns with ANSI_NULLS OFF: 2 | A2              (Correct!)

SELECT * FROM @A 
WHERE NOT EXISTS (SELECT 1 FROM @B WHERE OnA = OnB) 
--Returns with ANSI_NULLS ON:  2 | A2, NULL | A3   (Correct!)
--Returns with ANSI_NULLS OFF: 2 | A2, NULL | A3   (Not correct!)("ANSI_NULLS OFF" insensitive!)

It's evident that with the presence of NULL in the correspondence of the key columns, the query left antisemi-join quoted by Microsoft doesn't return (in normal case: when ANSI_NULLS ON) the complementary result to the left semi-join query and this is already a bug. In addition, zero records return always, it's very serious!

On the other hand, the other formulas with the EXIST and NOT EXIST operator also don't seem to be consistent with the "ANSI_NULLS OFF" setting. This error affects also the results of the "left/right semi join" operations!

So, I assume that the "semi join" operation is ANSI_NULLS proof in T-SQL/SQL Server, but it should be done only in this way:

--LEFT SEMI-JOIN
SELECT * FROM @A 
WHERE OnA IN (SELECT OnB FROM @B)

While the "antisemi join" operation is not safe and not ANSI_NULLS proof, so we have the responsibility to use the right variant in certainly static contexts:

--LEFT ANTISEMI-JOIN (when ANSI_NULLS is ON)
SELECT * FROM @A 
WHERE NOT EXISTS (SELECT 1 FROM @B WHERE OnA = OnB)

--LEFT ANTISEMI-JOIN (when ANSI_NULLS is OFF and it's possible that there is at least one record with NULL in key value)
SELECT * FROM @A 
WHERE OnA NOT IN (SELECT OnB FROM @B)

Do you all agree with me?

These are my questions:

  1. Would it be right to handle the "NullVsNull" eventuality with SET ANSI_NULLS rather than an explicit query?

  2. What would be the best single query to faithfully reflect the ANTI SEMI JOIN behavior regardless of the setting ANSI_NULLS?

  3. Is it possible that Microsoft and Others are both making a mistake?

  4. Why does the EXIST operator seem to be insensitive to "ANSI_NULLS OFF" setting?

  5. Why does the NOT IN operator always return EMPTY in the fifth case?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
119 views
Welcome To Ask or Share your Answers For Others

1 Answer

First of all, here's a Fiddle to help play with all this:

http://sqlfiddle.com/#!18/aa477/20

Now we can talk about what's going on. The answer here starts with a correct understanding of what NULL means.

People often think of NULL as meaning, "this value is empty", but this understanding is incomplete. It is better to think of NULL as meaning, "I don't know what this value is".

With the former (empty) understanding, comparing two NULL values ({empty} == {empty}) could be reasonably seen to produce true. But with the new correct (I don't know) understanding of NULL, we can see it makes much less sense to assume one {I don't know} will be equal to another {I don't know}. The correct answer to this comparison is more obviously still {I don't know}. Therefore NULL == NULL is still... NULL. Furthermore, NULL is falsy when forced into a boolean expression.


This is what the ANSI standard defines, what you should stick with pretty much all the time, and also the correct way to understand how NULL formally interacts with any of the mathematical/relational algebra join concepts.


If I really wanted to formally express this join type, I'd use the NOT EXISTS() variant. You could also write it like this:

SELECT @A.* 
FROM @A 
LEFT JOIN @B ON OnA = OnB
WHERE  @B.OnB IS NULL

We sometimes call this an Exclusion Join, but NOT EXISTS() also typically executes faster than the LEFT JOIN.

Again, for all of these options, understanding what is correct still hinges on a correct understanding of the NULL comparisons. Since the NULL from the A3 row is not known to be the same value as the NULL from the B3 row, that record should be included.

However, (and this is important), all this is in the theoretical abstract. In the practical concrete world, the correct result for your application depends on the meaning of the value. Sometimes a particular system, application, developer, or data model will decide NULL really does just mean "empty" for this purpose. Whether this is wise, or formally correct, is irrelevant. What is relevant is understanding the meaning of the data you're looking at and producing query results that answer the questions asked by your requirements.


There is still the question of whether this query shows a mistake or bug with Sql Server:

SELECT * FROM @A 
WHERE EXISTS (SELECT 1 FROM @B WHERE OnA = OnB) 

With ANSI_NULLS on (the default), things are correct. We don't know the NULL values are the same, and so the record should not be included. However, we the same results when ANSI_NULLS are off, even though we'd expect the NULL = NULL comparison to include that row. You may be correct about a bug here, though people smarter than me may explain why it isn't. But if so, I'm not really bothered by it. In my opinion, if you turn off ANSI_NULLS you're already playing with fire and deserve to run into this kind of thing.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...