Greenplum secrets🎩

Секрет 18( IN или NOT IN )
Secret 18( To BE or NOT to BE! )
В продолжение сегодняшнего опроса, немного суровой правды.
Кто использует NOT IN - это вы зря,срочно меняйте код на NOT EXISTS.
Оказывается, IN и NOT IN ведут себя совершенно по разному, и если IN и EXISTS - дело вкуса, то NOT IN вызывает Broadcast Motion,
т.е. тираж всей фильтр таблицы на все узлы кластера и цена запроса (время) в обоих вариантах составляет почти 3 порядка.

Простой пример:
Создадим 2 табл-ы по 100 млн строк
Continuing with today's survey, a bit of bitter truth.
Those who use NOT IN - you are in danger.
It turns out that IN and NOT IN behave completely differently, and if IN and EXISTS are a matter of taste, then NOT IN causes Broadcast Motion,
i.e. the circulation of the entire filter table to all cluster nodes and the cost of the request (time) in both options is almost 3 orders of magnitude.
A simple example:
Let's create 2 tables of 100 million rows

create table public.t1   WITH (appendonly=true,orientation=column,compresstype=zstd,compresslevel=1)
as select generate_series(1,1e8::int) n distributed by(n);

create table public.t2   WITH (appendonly=true,orientation=column,compresstype=zstd,compresslevel=1)
as select * from public.t1  distributed by(n);

План запроса для NOT IN: Query plan for NOT IN:

explain analyze
select t1.* from public.t1 where t1.n not in(select n from public.t2)

Gather Motion 720:1  (slice2; segments: 720)  (cost=0.00..23275.28 rows=100000000 width=4) (actual time=1078136.740..1078136.740 rows=0 loops=1)
  ->  Hash Left Anti Semi (Not-In) Join  (cost=0.00..22392.73 rows=138889 width=4) (actual time=0.000..1078112.753 rows=0 loops=1)
        Hash Cond: (t1.n = t2.n)
        Extra Text: (seg0)   Initial batch 0:
(seg0)     Wrote 1153562K bytes to inner workfile.
(seg0)     Wrote 1604K bytes to outer workfile.
(seg0)   Initial batches 1..63:
"(seg0)     Read 1153562K bytes from inner workfile: 18311K avg x 63 nonempty batches, 18343K max."
"(seg0)     Read 1604K bytes from outer workfile: 26K avg x 63 nonempty batches, 27K max."
"(seg0)   Hash chain length 3.1 avg, 17 max, using 31848259 of 33554432 buckets."
        ->  Seq Scan on t1  (cost=0.00..432.38 rows=138889 width=4) (actual time=0.616..85.887 rows=139894 loops=1)
        ->  Hash  (cost=1000.99..1000.99 rows=100000000 width=4) (actual time=494428.395..494428.395 rows=100000000 loops=1)
              ->  Broadcast Motion 720:720  (slice1; segments: 720)  (cost=0.00..1000.99 rows=100000000 width=4) (actual time=4.791..247886.342 rows=100000000 loops=1)
                    ->  Seq Scan on t2  (cost=0.00..432.38 rows=138889 width=4) (actual time=0.398..242.821 rows=139894 loops=1)
Planning time: 38.989 ms
  (slice0)    Executor memory: 891K bytes.
"  (slice1)    Executor memory: 220K bytes avg x 720 workers, 220K bytes max (seg0)."
"* (slice2)    Executor memory: 82285K bytes avg x 720 workers, 82285K bytes max (seg0).  Work_mem: 36686K bytes max, 2343750K bytes wanted."
Memory used:  229376kB
Memory wanted:  2344250kB
Optimizer: Pivotal Optimizer (GPORCA)
Execution time: 1 078 696.860 ms

Не буду грузить вас тут планом и для exists в вечер Пт просто рез-т:
I won't burden you with a plan for EXISTS on Friday evening, just the result:

explain analyze
select t1.* from public.t1 where t1.n not in(select n from public.t2)

Optimizer: Pivotal Optimizer (GPORCA)
Execution time: 1752.198 ms

Также посмотрим на вариант с LEFT JOIN,который не вошел в опрос, но любит @andre_rumyanec
Let's also look at the option with LEFT JOIN (suggested by @andre_rumyanec)

explain analyze
select t1.* from public.t1 t1 left join public.t2 t2 on t1.n = t2.n
where t2.n is null

Optimizer: Pivotal Optimizer (GPORCA)
Execution time: 1502.517 ms

И таким образом, LEFT JOIN выбился в лидеры.
Единственное, если вы соблазнитесь им и для IN -фильтра, не забудьте обернуть рез-т DISTINCT-ом, т.к. если фильтр табл-а содержит дубли предикатного ключа, то возможно размножение строк.

Ни фото ниже для полного раскрытия темы, сравним планы золотого и серебряного призера, отставшего на 0.25 с.

❤1

697 viewsedited 19:29