Oracle中的半联结和反联结详解
当两张表进行联结的时候,如果表1中的数据行是否出现在结果集中需要根据表2中出现或不出现至少一个相匹配的数据行来判断,这种情况就会发生半联结;而反联结便是半联结的补集,它们会作为数据库中常见的联结方法如NESTED LOOPS,MERGE SORT JOIN,HASH JOIN的选项出现。
实际上半联结和反联结本身也可以被认同是两种联结方法;在CBO优化模式下,优化器能够根据实际情况灵活的转换执行语句从而实现半联结和反联结方法,毕竟没有什么SQL语法可以显式的调用半联结和反联结,它们只是SQL语句满足某些条件时优化器可以选择的选项而已,不过仍然有必要深入这两种选项在特定情况下带来的性能优势。
半联结
半联结通常都发生在使用含有IN和EXISTS的相关子查询的时候,=ANY的用法与IN相同,所以也会出现发生半联结的情况;不过也是有例外的,在11gR2版本中,优化器不会为任何包含在OR分支中的子查询选择半联结,这也是现在官档中唯一明确标识的限制条件,来看几种发生场景:
-- 使用IN关键字的相关子查询 => 发生NESTED LOOPS半联结
SQL> select department_name
2 from hr.departments dept
3 where department_id in (select department_id from hr.employees emp);
11 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 2605691773
----------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 10 | 190 | 3 (0)| 00:00:01 |
| 1 | NESTED LOOPS SEMI | | 10 | 190 | 3 (0)| 00:00:01 |
| 2 | TABLE ACCESS FULL| DEPARTMENTS | 27 | 432 | 3 (0)| 00:00:01 |
|* 3 | INDEX RANGE SCAN | EMP_DEPARTMENT_IX | 41 | 123 | 0 (0)| 00:00:01 |
----------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - access("DEPARTMENT_ID"="DEPARTMENT_ID")
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
11 consistent gets
0 physical reads
0 redo size
742 bytes sent via SQL*Net to client
524 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
11 rows processed
-- 使用EXISTS关键字的相关子查询 => 发生NESTED LOOPS半联结
SQL> select department_name
2 from hr.departments dept where exists
3 (select null from hr.employees emp where emp.department_id = dept.department_id);
11 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 2605691773
----------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 10 | 190 | 3 (0)| 00:00:01 |
| 1 | NESTED LOOPS SEMI | | 10 | 190 | 3 (0)| 00:00:01 |
| 2 | TABLE ACCESS FULL| DEPARTMENTS | 27 | 432 | 3 (0)| 00:00:01 |
|* 3 | INDEX RANGE SCAN | EMP_DEPARTMENT_IX | 41 | 123 | 0 (0)| 00:00:01 |
----------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - access("EMP"."DEPARTMENT_ID"="DEPT"."DEPARTMENT_ID")
Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
11 consistent gets
0 physical reads
0 redo size
742 bytes sent via SQL*Net to client
524 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
11 rows processed
-- 谓语中使用了OR分支中的EXISTS子查询 => 禁用半联结
SQL> select department_name
2 from hr.departments dept
3 where 1=2 OR exists
4 (select null from hr.employees emp where emp.department_id = dept.department_id);
11 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 440241596
----------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 27 | 432 | 4 (0)| 00:00:01 |
|* 1 | FILTER | | | | | |
| 2 | TABLE ACCESS FULL| DEPARTMENTS | 27 | 432 | 3 (0)| 00:00:01 |
|* 3 | INDEX RANGE SCAN | EMP_DEPARTMENT_IX | 2 | 6 | 1 (0)| 00:00:01 |
----------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter( EXISTS (SELECT 0 FROM "HR"."EMPLOYEES" "EMP" WHERE
"EMP"."DEPARTMENT_ID"=:B1))
3 - access("EMP"."DEPARTMENT_ID"=:B1)
Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
35 consistent gets
0 physical reads
0 redo size
742 bytes sent via SQL*Net to client
524 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
11 rows processed
从结果集来看,我们很容易联想到内联结,那为什么半联结通常来说会获得更高的性能呢?这实际也是半联结优化的关键,拿NESTED LOOPS来举例,在NESTED LOOPS联结中,驱动表被读取后需要逐个的进入内层循环来进行匹配工作,并且只有当外层循环的数据行与内层循环中的每一行数据匹配运算完成后才会结束一个结果集的获取;而相对而言,半联结的区别在于数据集1中的每一条记录只返回一次,而不管数据集2中有几条匹配的记录,因此,半联结会在找到子查询中匹配到的第一条数据后立即结束处理从而提高性能。
对于某些需要利用半联结来提高性能的场景,可以通过手动的方式控制半联结的执行计划,使用SEMIJOIN和NO_SEMIJOIN提示分别可以指定优化器使用和禁用半联结。
-- 使用NO_SEMIJOIN提示禁用半联结
SQL> select department_name
2 from hr.departments dept
3 where department_id in (select /*+ no_semijoin */department_id from hr.employees emp);
11 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 3372191744
------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 106 | 1802 | 4 (25)| 00:00:01 |
| 1 | VIEW | VM_NWVW_2 | 106 | 1802 | 4 (25)| 00:00:01 |
| 2 | HASH UNIQUE | | 106 | 2544 | 4 (25)| 00:00:01 |
| 3 | NESTED LOOPS | | 106 | 2544 | 3 (0)| 00:00:01 |
| 4 | TABLE ACCESS FULL| DEPARTMENTS | 27 | 567 | 3 (0)| 00:00:01 |
|* 5 | INDEX RANGE SCAN | EMP_DEPARTMENT_IX | 4 | 12 | 0 (0)| 00:00:01 |
------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
5 - access("DEPARTMENT_ID"="DEPARTMENT_ID")
Statistics
----------------------------------------------------------
506 recursive calls
0 db block gets
188 consistent gets
7 physical reads
0 redo size
742 bytes sent via SQL*Net to client
524 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
10 sorts (memory)
0 sorts (disk)
11 rows processed
除此之外,我们还可以使用_always_semi_join隐藏参数选择半联结的联结类型,有关_always_semi_join参数的可选值:
SQL> SELECT
2 PARNO_KSPVLD_VALUES pvalid_par#,
3 NAME_KSPVLD_VALUES pvalid_name,
4 VALUE_KSPVLD_VALUES pvalid_value,
5 DECODE(ISDEFAULT_KSPVLD_VALUES, 'FALSE', '', 'DEFAULT' ) pvalid_default
6 FROM
7 X$KSPVLD_VALUES
8 WHERE
9 LOWER(NAME_KSPVLD_VALUES) LIKE '%'||LOWER(nvl('&pname',name_kspvld_values))||'%'
10 ORDER BY
11 pvalid_par#,
12 pvalid_default,
13 pvalid_Value
14 /
PAR# PARAMETER VALUE DEFAULT
------ -------------------------------------------------- ------------------------------ -------
1705 _always_semi_join CHOOSE
_always_semi_join HASH
_always_semi_join MERGE
_always_semi_join NESTED_LOOPS
_always_semi_join OFF
该参数的默认值为choose,表示选用半联结的类型由优化器来决定,下面来使用_always_semi_join参数将上面的NESTED LOOPS半联结改变为HASH JOIN半联结:
-- 默认发生NESTED LOOPS SEMI
SQL> select department_name
2 from hr.departments dept
3 where department_id in (select department_id from hr.employees emp);
11 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 2605691773
----------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 10 | 190 | 3 (0)| 00:00:01 |
| 1 | NESTED LOOPS SEMI | | 10 | 190 | 3 (0)| 00:00:01 |
| 2 | TABLE ACCESS FULL| DEPARTMENTS | 27 | 432 | 3 (0)| 00:00:01 |
|* 3 | INDEX RANGE SCAN | EMP_DEPARTMENT_IX | 41 | 123 | 0 (0)| 00:00:01 |
----------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - access("DEPARTMENT_ID"="DEPARTMENT_ID")
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
11 consistent gets
0 physical reads
0 redo size
742 bytes sent via SQL*Net to client
524 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
11 rows processed
-- session级别修改参数
SQL> alter session set "_always_semi_join"=merge;
Session altered.
-- 发生MERGE JOIN SEMI
SQL> select department_name
2 from hr.departments dept
3 where department_id in (select department_id from hr.employees emp);
11 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 954076352
--------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 10 | 190 | 4 (25)| 00:00:01 |
| 1 | MERGE JOIN SEMI | | 10 | 190 | 4 (25)| 00:00:01 |
| 2 | TABLE ACCESS BY INDEX ROWID| DEPARTMENTS | 27 | 432 | 2 (0)| 00:00:01 |
| 3 | INDEX FULL SCAN | DEPT_ID_PK | 27 | | 1 (0)| 00:00:01 |
|* 4 | SORT UNIQUE | | 107 | 321 | 2 (50)| 00:00:01 |
| 5 | INDEX FULL SCAN | EMP_DEPARTMENT_IX | 107 | 321 | 1 (0)| 00:00:01 |
--------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
4 - access("DEPARTMENT_ID"="DEPARTMENT_ID")
filter("DEPARTMENT_ID"="DEPARTMENT_ID")
Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
5 consistent gets
1 physical reads
0 redo size
742 bytes sent via SQL*Net to client
523 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
1 sorts (memory)
0 sorts (disk)
11 rows processed
-- 从trace来看优化器的选择还是非常可靠的。。
反联结
从本质上来说,反联结和半联结很多相似的因素,反联结的发生通常是在使用含有NOT IN,NOT EXISTS的相关子查询的时候,同样,如果子查询谓语OR分支中,反联结也会被禁用,它和半联结主要的不同点还是在返回数据的匹配方式上,它是会返回在子查询中没有匹配到的数据行,不过其优化的原理是一致的,通过在子查询中找到第一条匹配记录而立即停止处理来提高效率,一下是发生的集中场景:
SQL> set autotrace traceonly
-- NOT IN 触发反联结
SQL> select department_name
2 from hr.departments
3 where department_id not in
4 (select department_id from hr.employees where department_id is not null);
16 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 3082375452
----------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 17 | 323 | 3 (0)| 00:00:01 |
| 1 | NESTED LOOPS ANTI | | 17 | 323 | 3 (0)| 00:00:01 |
| 2 | TABLE ACCESS FULL| DEPARTMENTS | 27 | 432 | 3 (0)| 00:00:01 |
|* 3 | INDEX RANGE SCAN | EMP_DEPARTMENT_IX | 41 | 123 | 0 (0)| 00:00:01 |
----------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - access("DEPARTMENT_ID"="DEPARTMENT_ID")
filter("DEPARTMENT_ID" IS NOT NULL)
Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
12 consistent gets
6 physical reads
0 redo size
985 bytes sent via SQL*Net to client
535 bytes received via SQL*Net from client
3 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
16 rows processed
-- EXISTS触发反联结
SQL> select department_name
2 from hr.departments dept
3 where not exists
4 (select null from hr.employees emp where emp.department_id = dept.department_id);
16 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 3082375452
----------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 17 | 323 | 3 (0)| 00:00:01 |
| 1 | NESTED LOOPS ANTI | | 17 | 323 | 3 (0)| 00:00:01 |
| 2 | TABLE ACCESS FULL| DEPARTMENTS | 27 | 432 | 3 (0)| 00:00:01 |
|* 3 | INDEX RANGE SCAN | EMP_DEPARTMENT_IX | 41 | 123 | 0 (0)| 00:00:01 |
----------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - access("EMP"."DEPARTMENT_ID"="DEPT"."DEPARTMENT_ID")
Statistics
----------------------------------------------------------
3 recursive calls
0 db block gets
13 consistent gets
0 physical reads
0 redo size
985 bytes sent via SQL*Net to client
535 bytes received via SQL*Net from client
3 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
16 rows processed
从上面的例子可以看出minus和outer join操作都可以巧妙的实现同样的结果,不过从执行计划来看,minus操作显然没有反联结操作优化,而使用outer join虽然发生了反联结优化,但是由于使用了带空值的虚拟记录来匹配数据行,不便于理解,因此实际还是不建议使用的。
如果想要手动控制反联结的执行计划,这里也有一些hint和参数可以使用,常用的hint有:
1.ANTIJOIN-进行反联结,优化器决定联结类型
2.USE_ANTI-老版本的提示,和ANTIJOIN功能一致
3.[NL_AJ] | [HASH_AJ] | [MERGE_AJ]-指定发生反联结的类型(10g开始被弃用,不过仍然可以生效)
在参数控制方面,也有个和_always_semi_join非常相同的_always_anti_join参数,用法完全一致;还有参数_optimizer_null_aware_antijoin,_optimizer_outer_to_anti_enable用于控制对含空值和外联结的反联结转换。
-- 使用hint显式指定反联结类型
SQL> select department_name
2 from hr.departments dept
3 where not exists (select /*+ hash_aj */ null from hr.employees emp
4 where emp.department_id = dept.department_id);
16 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 3587451639
----------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 17 | 323 | 5 (20)| 00:00:01 |
|* 1 | HASH JOIN ANTI | | 17 | 323 | 5 (20)| 00:00:01 |
| 2 | TABLE ACCESS FULL| DEPARTMENTS | 27 | 432 | 3 (0)| 00:00:01 |
| 3 | INDEX FULL SCAN | EMP_DEPARTMENT_IX | 107 | 321 | 1 (0)| 00:00:01 |
----------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - access("EMP"."DEPARTMENT_ID"="DEPT"."DEPARTMENT_ID")
Statistics
----------------------------------------------------------
566 recursive calls
0 db block gets
193 consistent gets
0 physical reads
0 redo size
985 bytes sent via SQL*Net to client
535 bytes received via SQL*Net from client
3 SQL*Net roundtrips to/from client
12 sorts (memory)
0 sorts (disk)
16 rows processed
-- 使用_optimizer_null_antijoin参数关闭反联结中的空值考虑选项-即返回空值的情况不使用反联结
SQL> alter session set "_optimizer_null_aware_antijoin"=false;
Session altered.
SQL> select department_name
2 from hr.departments
3 where department_id not in (select department_id from hr.employees);
no rows selected
Execution Plan
----------------------------------------------------------
Plan hash value: 3416340233
----------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 26 | 416 | 30 (0)| 00:00:01 |
|* 1 | FILTER | | | | | |
| 2 | TABLE ACCESS FULL| DEPARTMENTS | 27 | 432 | 3 (0)| 00:00:01 |
|* 3 | TABLE ACCESS FULL| EMPLOYEES | 2 | 6 | 2 (0)| 00:00:01 |
----------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter( NOT EXISTS (SELECT 0 FROM "HR"."EMPLOYEES" "EMPLOYEES"
WHERE LNNVL("DEPARTMENT_ID"<>:B1)))
3 - filter(LNNVL("DEPARTMENT_ID"<>:B1))
Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
172 consistent gets
0 physical reads
0 redo size
343 bytes sent via SQL*Net to client
513 bytes received via SQL*Net from client
1 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
0 rows processed