Greenplum secrets🎩

План оптимизированного запроса

Gather Motion 864:1  (slice2; segments: 864)  (cost=0.00..5135.44 rows=64796936 width=28) (actual time=4121.411..86304.398 rows=64796937 loops=1)
  Merge Key: ((('1'::double precision * tst_snap.fee) / (grc_origin.bal)::double precision))
  ->  Sort  (cost=0.00..1134.33 rows=74997 width=28) (actual time=3499.849..3512.324 rows=75895 loops=1)
        Sort Key: ((('1'::double precision * tst_snap.fee) / (grc_origin.bal)::double precision))
        Sort Method:  quicksort  Memory: 9726048kB
        ->  Result  (cost=0.00..941.51 rows=74997 width=28) (actual time=344.682..3222.810 rows=75895 loops=1)
              ->  Hash Join  (cost=0.00..939.41 rows=74997 width=24) (actual time=338.428..2445.560 rows=75895 loops=1)
                    Hash Cond: ((grc_origin.n)::double precision = tst_snap.up)
"                    Extra Text: (seg302) Hash chain length 1.3 avg, 8 max, using 57270 of 131072 buckets."
                    ->  Redistribute Motion 864:864  (slice1; segments: 864)  (cost=0.00..442.66 rows=115741 width=16) (actual time=0.118..1403.245 rows=116893 loops=1)
                          Hash Key: (grc_origin.n)::double precision
                          ->  Seq Scan on grc_origin  (cost=0.00..433.42 rows=115741 width=16) (actual time=0.555..30.152 rows=116911 loops=1)
                    ->  Hash  (cost=432.24..432.24 rows=74997 width=16) (actual time=338.056..338.056 rows=75895 loops=1)
                          ->  Seq Scan on tst_snap  (cost=0.00..432.24 rows=74997 width=16) (actual time=1.638..178.638 rows=75895 loops=1)
Planning time: 39.724 ms
  (slice0)    Executor memory: 17457K bytes.
"  (slice1)    Executor memory: 340K bytes avg x 864 workers, 356K bytes max (seg644)."
"  (slice2)    Executor memory: 20983K bytes avg x 864 workers, 20983K bytes max (seg0).  Work_mem: 11257K bytes max."
Memory used:  540672kB
Optimizer: Pivotal Optimizer (GPORCA)
Execution time: 90525.993 ms

460 views19:28

Greenplum secrets🎩

Secret 25 (Bitcoin 2.0 not again but again: do it yourself if you want it well)
We talked a lot about the importance of pre-materialization using the example of SCD2 and now let me show a new example with the key transition.
I thought for a long time how to stretch an owl onto a globe here and the curve led me to crypto paradigm.

The creator of the idea receives 1 GRC, each n-th (by the time of generation of its address in the network) referral - 1/n GRC.
/* The emission, unlike the idea of Satoshi Nakamoto, is unlimited, since the infinite series does diverge, but that's not the point -) */
The founder can have no more than 12 referrals, and the same limitation applies to all levels of the pyramid, in other words, no more than 12 sons for each father.
The commission for transferring crypto is 0.01% of the transfer amount and is credited by the system to the father of the referral from whom he learned about the system.
Problem:
Let's consider the moment of evolution when there are 100 million participants in the system.
Let each make 1 transfer to anyone for any amount within his balance.
Let's ask ourselves - what is the reward for each participant in the system as a commission for the transfer.
Let's model the data:
System IDs

create table seq_100m
as select generate_series(1,1e8::int) n;

Father-son relationship table + balances

create table grc_origin   WITH (appendonly=true,orientation=column,compresstype=zstd,compresslevel=1)
as select n, trunc(n - (random()*12)) up, 1.0/n bal from seq_100m  distributed by(n);

Let's return the ancestors of the founding forefathers to the realm of the ids allocated to the system, if they accidentally went into oblivion

update grc_origin
set up = 1
where up < 1;

Table of transactions (transfers)

create table grc_pmt   WITH (appendonly=true,orientation=column,compresstype=zstd,compresslevel=1)
as select n n_from, (random()*1e8) n_to, bal * random() amt from  grc_origin distributed by(n_from);

Reference: total GRC in the system

select sum(bal) from grc_origin -- 18.99

The answer to the question posed is straightforward:

explain analyze
select x.n, bal, t.fee, 1.0 * t.fee / bal from grc_origin x
join (
select up, sum(amt * 0.0001) fee from grc_origin, grc_pmt
where n = n_from
group by 1) t
on x.n = t.up
order by 4 desc;
Execution time: 102615.943 ms(131005.427ms)

Where is the essence? - the listener can't stand it, like in the joke about the camel caravan. What is the secret of Damascus steel??
Answer: the above was not an optimal option.
How to improve it?
Pre-materialize the subquery t and use the result obtained

create table tst_snap   WITH (appendonly=true,orientation=column,compresstype=zstd,compresslevel=1)
as select up, sum(amt * 0.0001) fee from grc_origin, grc_pmt
where n = n_from
group by 1 distributed by(up);
--64,796,937 rows affected in 6 s 375 ms

Let's see how the plan has changed.

explain analyze
select x.n, bal, t.fee, 1.0 * t.fee / bal from grc_origin x
join tst_snap t
on x.n = t.up
order by 4 desc;
Execution time: 90525.993 ms(94188.913 ms)

The results of the rerun are given in brackets, also from the plans, which are traditionally given above.
Conclusion: Pre-materialization of CTE when changing the join key in the next slice can give a gain of 6 to 30%.

583 views14:42

Greenplum secrets🎩

Операторы NOT IN и NOT EXISTS дают идентичный датасет в GP? Do NOT IN and NOT EXISTS operators produce identical dataset in GP?

Anonymous Quiz

30%

70%

113 voters558 views09:35

Greenplum secrets🎩

Секрет 26 (Не оставляйте крошки без внимания)

Мы успели осветить проблемы, возникающие на пустых таблицах
или при отсутствии статистики.

На этот раз тривиальный join 3х таблиц чуть не уронил нам сервер, а все потому, что по одной из них не было статистики.
Спилл 31 TB вызвал запрос ниже, выполняемый по несколько раз на дню:
Secret 26 (Don't leave crumbs unattended)
We've already covered the problems that arise on empty tables
or in the absence of statistics.

This time, a trivial join of 3 tables almost brought down our server, and all because there were no statistics for one of them.
Spill 31 TB caused the query below, executed several times a day:

select b.deal_fee_rk, b.invalid_id, b.effective_date, max(b.version_id) version_id
                     from big b
                              join medium d
                                   on d.c_comiss_arr_rk = b.collection_rk
                              join tiny as f on f.type_debt_rk = b.c_debt_rk
                     where (b.version_id between 1 and 2732523)
                       and b.valid_flg is true
                     group by b.deal_fee_rk, b.invalid_id, b.effective_date;

где в AOCO zstd табл-ах big, medium, tiny было 1.5 млрд, 1 млн и 5 строк соотв-но.
Статы не было у последней.
where in AOCO zstd tables big, medium, tiny there were 1.5 billion, 1 million and 5 rows respectively.
The last one had no stats.

Если посмотреть на план запроса ниже, то увидим 2 Broadcast Motion,
где 1й ожидаемо тиражирует крошку tiny на все сегменты, 2й неожиданно дублирует рез-т join-а big и tiny также на все сегменты.
Почему GPORCA использует 2й Broadcast вместо Redistribute с учетом что ключ в следующем слайсе у medium известен - теряюсь в догадках.
Но тем ценнее данный секрет.
If you look at the query plan below, you'll see 2 Broadcast Motions,
where the 1st one, as expected, replicates the tiny crumb to all segments, the 2nd one unexpectedly duplicates the result of the join of big and tiny to all segments as well.
Why GPORCA uses the 2nd Broadcast instead of Redistribute, given that the key in the next slice of medium is known - I'm at a loss.
But this secret is even more valuable.

Я проверил, а что, если стата не собрана только по medium. Тут все гораздо лучше,
и в плане только 1 Broadcast и спилла нет.

В сухом остатке, если в вашем DWH табл-ы создаются PL/pgSQL ф-ей
и gp_autostats_mode_in_functions = on_change ( прописан явно или на уроне GUC ),
Галактика ваших данных (или финансов если арендуете клауд) в опасности, если вы на этом успокоились и не позаботились о статистике ничтожно малых таблиц.

I checked what if the stats are not collected only for medium table. Everything is much better here,
and in exec plan there is only 1 Broadcast and no spill.

The bottom line is, if in your DWH tables are created by PL/pgSQL functions
and gp_autostats_mode_in_functions = on_change (written explicitly or at the GUC level),
The galaxy of your data (or finances if you rent a cloud) is in danger if you calmed down at this and did not take care of the statistics of small tables.

Мораль - собирайте стату по маленьким табл-ам, хотя бы до 10 000 строк,
чтобы такие мины остались за периметром вашей платформы данных.
The moral is - collect stats for small tables, at least up to 10,000 rows,
so that such mines remain outside the perimeter of your data platform.

Всем хорошей Пт 13-го и оптимальных вычислений!

483 viewsedited 16:14

Greenplum secrets🎩

План исходного запроса (когда только tiny табл-а без статы):

Gather Motion 864:1  (slice4; segments: 864)  (cost=0.00..1371.49 rows=1 width=28) (actual time=1871747.423..1875490.705 rows=22129384 loops=1)
  ->  GroupAggregate  (cost=0.00..1371.49 rows=1 width=28) (actual time=1871747.210..1871768.905 rows=26077 loops=1)
"        Group Key: big.deal_fee_rk, big.invalid_id, big.effective_date"
        ->  Sort  (cost=0.00..1371.49 rows=1 width=28) (actual time=1871747.155..1871750.100 rows=26217 loops=1)
"              Sort Key: big.deal_fee_rk, big.invalid_id, big.effective_date"
              Sort Method:  quicksort  Memory: 2482272kB
              ->  Redistribute Motion 864:864  (slice3; segments: 864)  (cost=0.00..1371.49 rows=1 width=28) (actual time=1475307.979..1871714.814 rows=26217 loops=1)
"                    Hash Key: big.deal_fee_rk, big.invalid_id, big.effective_date"
                    ->  Hash Join  (cost=0.00..1371.49 rows=1 width=28) (actual time=1413328.254..1514773.884 rows=35025 loops=1)
                          Hash Cond: (medium.c_comiss_arr_rk = big.collection_rk)
                          Extra Text: (seg0)   Initial batch 0:
(seg0)     Wrote 10810674K bytes to inner workfile.
(seg0)     Wrote 23K bytes to outer workfile.
(seg0)   Overflow batches 1..255:
"(seg0)     Read 15508008K bytes from inner workfile: 60816K avg x 255 nonempty batches, 236543K max."
"(seg0)     Wrote 4697335K bytes to inner workfile: 36987K avg x 127 overflowing batches, 192573K max."
"(seg0)     Read 23K bytes from outer workfile: 1K avg x 253 nonempty batches, 1K max."
"(seg0)   Hash chain length 50.8 avg, 3476 max, using 4210779 of 33554432 buckets.Initial batch 0:"
""
                          Extra Text: (seg575) Initial batch 0:
(seg575)   Wrote 10810674K bytes to inner workfile.
(seg575)   Wrote 24K bytes to outer workfile.
(seg575) Overflow batches 1..255:
"(seg575)   Read 15507743K bytes from inner workfile: 60815K avg x 255 nonempty batches, 236532K max."
"(seg575)   Wrote 4697070K bytes to inner workfile: 36985K avg x 127 overflowing batches, 192563K max."
"(seg575)   Read 24K bytes from outer workfile: 1K avg x 252 nonempty batches, 1K max."
"(seg575) Hash chain length 50.8 avg, 3476 max, using 4210779 of 33554432 buckets."
                          ->  Seq Scan on medium  (cost=0.00..431.01 rows=1204 width=8) (actual time=0.998..1.228 rows=1305 loops=1)
                          ->  Hash  (cost=940.25..940.25 rows=1 width=36) (actual time=1413176.714..1413176.714 rows=213719427 loops=1)
                                ->  Broadcast Motion 864:864  (slice2; segments: 864)  (cost=0.00..940.25 rows=1 width=36) (actual time=444.691..616240.052 rows=213719427 loops=1)
                                      ->  Hash Join  (cost=0.00..940.20 rows=1 width=36) (actual time=609.326..2707.949 rows=261266 loops=1)
                                            Hash Cond: (big.c_debt_rk = tiny.type_debt_rk)
"                                            Extra Text: (seg427) Hash chain length 1.0 avg, 1 max, using 5 of 262144 buckets."
                                            ->  Seq Scan on big  (cost=0.00..486.84 rows=111417 width=44) (actual time=1.457..775.750 rows=299164 loops=1)
                                                  Filter: ((version_id >= 1) AND (version_id <= 2732523) AND (valid_flg IS TRUE))
                                            ->  Hash  (cost=431.01..431.01 rows=1 width=8) (actual time=375.789..375.789 rows=5 loops=1)
                                                  ->  Broadcast Motion 864:864  (slice1; segments: 864)  (cost=0.00..431.01 rows=1 width=8) (actual time=0.352..375.764 rows=5 loops=1)
                                                        ->  Seq Scan on tiny  (cost=0.00..431.00 rows=1 width=8) (actual time=10.545..10.553 rows=1 loops=1)
Planning time: 131.145 ms
  (slice0)    Executor memory: 2247K bytes.
"  (slice1)    Executor memory: 172K bytes avg x 864 workers, 172K bytes max (seg0)."

522 views16:14

Greenplum secrets🎩


"  (slice2)    Executor memory: 3231K bytes avg x 864 workers, 3231K bytes max (seg0).  Work_mem: 1K bytes max."
"* (slice3)    Executor memory: 107915K bytes avg x 864 workers, 107989K bytes max (seg792).  Work_mem: 66795K bytes max, 13357465K bytes wanted."
"  (slice4)    Executor memory: 3039K bytes avg x 864 workers, 5023K bytes max (seg2).  Work_mem: 4857K bytes max."
Memory used:  540672kB
Memory wanted:  40073392kB
Optimizer: Pivotal Optimizer (GPORCA)
Execution time: 1877810.477 ms

436 views16:14

Greenplum secrets🎩

План когда только medium табл-а без статы:

Gather Motion 864:1  (slice4; segments: 864)  (cost=0.00..1371.21 rows=1 width=28) (actual time=14458.358..18737.308 rows=22129384 loops=1)
  ->  GroupAggregate  (cost=0.00..1371.21 rows=1 width=28) (actual time=14453.359..14471.856 rows=26077 loops=1)
"        Group Key: big.deal_fee_rk, big.invalid_id, big.effective_date"
        ->  Sort  (cost=0.00..1371.21 rows=1 width=28) (actual time=14453.340..14455.898 rows=26217 loops=1)
"              Sort Key: big.deal_fee_rk, big.invalid_id, big.effective_date"
              Sort Method:  quicksort  Memory: 2478176kB
              ->  Redistribute Motion 864:864  (slice3; segments: 864)  (cost=0.00..1371.21 rows=1 width=28) (actual time=2317.858..14432.875 rows=26217 loops=1)
"                    Hash Key: big.deal_fee_rk, big.invalid_id, big.effective_date"
                    ->  Hash Join  (cost=0.00..1371.21 rows=1 width=28) (actual time=1931.744..8198.792 rows=11859767 loops=1)
                          Hash Cond: (big.c_debt_rk = tiny.type_debt_rk)
"                          Extra Text: (seg102) Hash chain length 1.0 avg, 1 max, using 1 of 262144 buckets."
"                          Extra Text: (seg300) Hash chain length 1.0 avg, 1 max, using 1 of 262144 buckets."
                          ->  Redistribute Motion 864:864  (slice2; segments: 864)  (cost=0.00..940.20 rows=1 width=36) (actual time=1908.235..4221.407 rows=11859767 loops=1)
                                Hash Key: big.c_debt_rk
                                ->  Hash Join  (cost=0.00..940.20 rows=1 width=36) (actual time=1906.750..2389.828 rows=34561 loops=1)
                                      Hash Cond: (big.collection_rk = medium.c_comiss_arr_rk)
"                                      Extra Text: (seg116) Hash chain length 4.0 avg, 16 max, using 257143 of 262144 buckets."
                                      ->  Seq Scan on big  (cost=0.00..486.84 rows=111417 width=44) (actual time=41.365..356.523 rows=299164 loops=1)
                                            Filter: ((version_id >= 1) AND (version_id <= 2732523) AND (valid_flg IS TRUE))
                                      ->  Hash  (cost=431.01..431.01 rows=1 width=8) (actual time=1719.715..1719.715 rows=1039929 loops=1)
                                            ->  Broadcast Motion 864:864  (slice1; segments: 864)  (cost=0.00..431.01 rows=1 width=8) (actual time=0.161..719.373 rows=1039929 loops=1)
                                                  ->  Seq Scan on medium  (cost=0.00..431.00 rows=1 width=8) (actual time=0.178..0.529 rows=1305 loops=1)
                          ->  Hash  (cost=431.00..431.00 rows=1 width=8) (actual time=0.807..0.807 rows=1 loops=1)
                                ->  Seq Scan on tiny  (cost=0.00..431.00 rows=1 width=8) (actual time=0.787..0.795 rows=1 loops=1)
Planning time: 129.081 ms
  (slice0)    Executor memory: 2247K bytes.
"  (slice1)    Executor memory: 181K bytes avg x 864 workers, 182K bytes max (seg243)."
"  (slice2)    Executor memory: 52431K bytes avg x 864 workers, 52431K bytes max (seg0).  Work_mem: 32498K bytes max."
"  (slice3)    Executor memory: 2241K bytes avg x 864 workers, 2407K bytes max (seg102).  Work_mem: 1K bytes max."
"  (slice4)    Executor memory: 3098K bytes avg x 864 workers, 5087K bytes max (seg2).  Work_mem: 4857K bytes max."
Memory used:  540672kB
Optimizer: Pivotal Optimizer (GPORCA)
Execution time: 21345.519 ms

560 views16:14

Greenplum secrets🎩

Секрет I (SQL внутри Эксель или Donald Duck возвращается)

За неимением новых секретов от GP буду делиться ништяками из мира OLAP, которые лично мне облегчают жизнь - не сочтите за спам - нумерация оных римскими цифрами.

Давно хотел покрутить мои данные в Excel SQL запросами.
Если ты тоже - есть решение - DuckDB - персистентная колоночная БД с поддержкой NoSQL, которая легко ставится куда угодно, в моем случае на Win 10.
Пишут, что она по архитектуре даже многопоточна, а кто-то уже проверил, что она вычисляет число строк в файле быстрее, чем старый добрый wc -l под X-ами

Первое знакомство с зверушкой - восторг как по функционалу, так и по скорости.

Предположим, у нас есть таблица сделок в XLS-файле, куда мы добавляем новые сделки в крипте.
На уровне XLS отфильтровать сделки по определенному тикеру - нет проблем.
Сложности начинаются, если мы хотим посмотреть доходность незакрытых сделок по определенному тикеру, скажем Nosana ( крипто-токен для оплаты сети AI на базе GPU grid).
Т.к. такие сделки не обязаны быть в соседних строках шита ввиду того, что мы покупаем разные крипто-монеты в разное время, диверсифицируя свой портфель, ф-я SUM в XLS по рэнджу( при выбранном фильтре на тикер) даст неверный рез-т, подхватив промежуточные строки диапазона с другим тикером.

И тут на помощь приходит DuckDB, который по щелчку умеет выгрузить csv ( в который мы сохранили наш XLS ) в табл-у:

create table cdeal as select * from read_csv('C:\tmp\deals.csv', delim=';');

Хотим взглянуть на интересующие нас незакрытке сделки? Пожалуйста(на фото):

select * from cdeal where stock='NOS' and "Sell Date" is null;

Инжиниринг Данных

Делюсь новостями из мира аналитики и карьерными советами.

15 лет в Аналитике и Инжиниринге Данных, 10 лет в MAANG

🛠️ dataengineer.ru | 🏄‍♂️ Surfalytics.com

№5017813306

Реклама:
https://almond-rule-130.notion.site/1199f595f76a8030ba1be1e607c9a8ce

568 viewsedited 20:37

Greenplum secrets🎩

577 views20:37

Greenplum secrets🎩

Надо подбить прибыли/убытки - легко!

select sum(profit::numeric) from cdeal where stock='NOS' and "Sell Date" is null;

Это был простой пример на 15 мин , который показывает элегантность данного решения

643 views20:38

Greenplum secrets🎩

Ваша средняя з.п в уходящем 2024 на руки.Your netto average salary in the outgoing 2024.

Anonymous Poll

153 voters768 views20:57

Greenplum secrets🎩

На канале Greenlum Russia сегодня пробежал опрос о зарплатах гринпламовцев.
Т.к. выявлены недовольные разбивкой гистограммы и тут вряд ли есть случайные люди, не связанные с GP,
спешу исправить ситуацию, да что там, мне и самому стало интересно.
Опрос выше и ниже данного дисклэймера

692 viewsedited 20:57

Greenplum secrets🎩

Ваша средняя з.п в уходящем 2024 на руки(продолжение).Your netto average salary in the outgoing 2024( part 2) !

Anonymous Poll

Я стейкхолдер компании Arenadata ( I am owner of Arenadata PJSC)

101 voters852 views21:02

Greenplum secrets🎩

Дорогие друзья! Ввиду того, что пришло время охладиться ( канал уходит в отпуск до 10.01.25), опрос года - рабочий вопрос! __Dear all! Since it's time to cool down (the channel is going on vacation until 10.01.25), the last poll in this year (still regarding business stuff)

719 views14:17

Greenplum secrets🎩

Куда вы охлаждаете данные GP ? // Where do you archive GP data?

Anonymous Poll

В том же самом GP на более медленные диски // In the same GP on slower disks

35%

Нет такой потребности // There is no such need

124 voters698 views14:18

Greenplum secrets🎩

Опровержение
Т.к. я обещал завязать с секретами на время отпуска канала, то просто хотел бы поделиться результатами одного эксперимента.
@andreikapolin на канале Greenplum Russia сделал вчера сногсшибательное заявление, цитирую

"Есть партицированная табличка по дате, партиции делаю только с 2022, все что до, в дефолтной
Получается в таблице есть история 3 млрд строк, все лежит в дефолтной
Начинаю подгружать 300 миллионов строк и создавать для них партиции, подвисает, с чем может быть связаны?"

Т.к. меня это заинтриговало, хоть и считаю, что в default партиции не должно быть многолюдно, решил проверить сие утверждение.
Создадим табл-у, куда зальем 3 ярда в дефолтную партицию:

CREATE TABLE tst1
  (id INT,
   order_date DATE
  )
WITH (appendoptimized=true, orientation=column, compresstype=ZLIB, compresslevel=1)
DISTRIBUTED BY(id)
PARTITION BY RANGE(order_date)
(START(date '2022-01-01') INCLUSIVE
END(date '2023-01-01') EXCLUSIVE
EVERY(INTERVAL '1 month'),
DEFAULT PARTITION other);

Синтетика для записи в дефолтную партицию (с пустой датой) :

create table smpl_1m WITH (appendonly = true, orientation = column, compresstype = zstd, compresslevel = 1)
as
select generate_series(1, 1e6::int) id, null::date distributed by(id);

insert into smpl_1m
select s.* from smpl_1m s
join (select generate_series(1,3000)) a
on 1=1

Чек числа строк:

select count(*) from smpl_1m; -- 3 001 000 000

Запишем данные в дефолтную партицию:

insert into tst1
select * from smpl_1m;

Теперь смоделируем шаг записи 300 млн строк в партиции, отличные от дефолта ( которых состоят из 12 месяцев 2022 г. ):

insert into tst1
select id, '2022-01-01'::date + mod(id, 365)
from smpl_1m
limit 300e6::int;

300,000,000 rows affected in 2 m 27 s 313 ms

Теперь создадим таблицу tst2, идентичную tst1 ( скрипт не привожу, чтобы не мусорить - он такой же как для tst1 ) и запишем 300 млн строк

insert into tst2
select id, '2022-01-01'::date + mod(id, 365)
from smpl_1m
limit 300e6::int;

300,000,000 rows affected in 2 m 27 s 203 ms

Вывод: Исходя из имеющихся вводных - вывод автора не подтвержден. Время совпало с точностью до секунды.

Возмоэжно, у @andreikapolin был другой сценарий и если он прочтет этот пост, поделится, в чем разница.

👍2

863 viewsedited 19:24

Greenplum secrets🎩

Полезные заметки о PXF или тонкости интеграции с Hive

Друзья, да здравствет день вечного студента, 25 янааря, а значит пора за уроки!
Вот несколько, которые я извлек при попытке прорубить окно в Hadoop из Greenplum,
тем более, что ответы местами не гуглились, а были обнаружены в нейросетке просветленных коллег.
Зарублю ка на носу тут, на память!

Урок 1.
Так бывает, что в DEV контуре версия либок отстает от прод, и закон Мерфи тут как тут: "Если что-то может пойти не так, оно пойдет не так"!
Обновили как-то в Hadoop сборку, создали с нуля тест табл-у, создали профиль hive в GP, чтобы ее прочитать по PXF.
Создали внешнюю таблицу в GP, запрос к которой рвет с ошибкой
PXF server error : Can not read value at 0 in block -1 in file ...

Вскрытие показало, что в новой конфигурации Hadoop таблица Hive создана в Parquet со сжатием по дефолту zsdt - не поддерживаемый кодек на старой сборке PXF.

Решение:
Указать в DDL Hive в TBLPROPERTIES 'parquet.compression' = 'SNAPPY'

Урок 2.
Создали writable external table в GP, через которую попытались записать 1 млн строк в Hive через jdbc в рамках одной транзакции.
Запрос кончил с ошибкой:
ERROR: PXF server error : Failed to obtain secured JDBC connection
java.sql.SQLTransientConnectionException: HikariPool-74 - Connection is not available, request timed out after 30000ms

Однако, в Hive таблице появилось 130k строк.
Ошибку с ходу не нагуглил, но убедился, что Hive это не про ACID(atomicity, consistency, isolation, durability) и о консистентности транзакций
речи быть не может.

Урок 3.
Не факт, что чтение и запись из/в Hive по PXF через jdbc имеют равные шансы на успех.
Попытался записать данные в Hive через

CREATE WRITaBLE EXTERNAL TABLE ext_hive_jdbc_snappy_w(
  "bool" boolean,
  "int4" int4,
  "int2" int2,
  "int_tiny" int2,
  "int64_col" int8,
  "float_col" float4,
  "double_col" float8,
  "json_col" text,
  "col_binary" bytea,
  "timestamp_micros" timestamp)
location ('pxf://...?PROFILE=jdbc&SERVER=...') on all format 'custom' ( formatter='pxfwritable_export' ) encoding 'utf8';

ERROR: PXF server error : Method not supported
Если исключить из insert-а поле с типом bytea, запись выполняется без ошибок.
Да, конечно в Hive через jdbc писать не стоит, для этого есть Spark, но все же неожиданно, на фоне того, что чтение поля binary этой же табл-ы Hive по PXF идет
без проблем, как и должно быть согласно документации (слайд для маппинга под постом )

Решение не найдено.

Урок 4.
Внимательный подписчик заметит, что в маппинге 3 урока нет типа DATE. Спешу вас успокоить, такой необходимый тип таки поддерживается в Hive табл-е,
но в нашей версии PXF сборки 16.1, чтобы прочитать его, также как и timestamp пришлось в DDL внешней табл-ы в GP добавить недокументированную опцию date_wide_range=false,
иначе select полей с этими типами из внешки (PXF external table) стошнит в
ERROR: PXF server error : Illegal conversion

👍5❤1🔥1

644 viewsedited 23:32

Greenplum secrets🎩

570 views23:33

Greenplum secrets🎩

Useful notes on PXF or nuances of integration with Hive /*English version of the previous news*/

Friends, long live the day of the eternal student, January 25, which means it's time for lessons!
Here are a few ones that I learned while trying to cut a window into Hadoop from Greenplum,
especially since the answers in some places were not googled, but were found in the neural network of my enlightened colleagues.
I'll write this down for memory!

Lesson 1.
It happens that in the DEV sandbox the version of the binaries lags behind the production, and Murphy's law is right there: "If something can go wrong, it will go wrong"!
We updated the Hadoop assembly somehow, created a test table from scratch, created a hive profile in GP to read it via PXF.
Created an external table in GP, the query to which breaks with the error
PXF server error: Can not read value at 0 in block -1 in file ...

The analysis revealed that in the new Hadoop configuration, the Hive table was created in Parquet with default compression zsdt - an unsupported codec on the old PXF build.

Solution:
Specify in the Hive DDL in TBLPROPERTIES 'parquet.compression' = 'SNAPPY'

Lesson 2.
Created a writable external table in GP, through which they tried to write 1 million rows to Hive via jdbc within a single transaction.
The query ended with an error:
ERROR: PXF server error : Failed to obtain secured JDBC connection
java.sql.SQLTransientConnectionException: HikariPool-74 - Connection is not available, request timed out after 30000ms

However, 130k rows appeared in the Hive table.
I didn't find the error right away, but I was convinced that Hive is not about ACID (atomicity, consistency, isolation, durability) and there can be no talk of transaction consistency.

Lesson 3.
It is not a fact that reading and writing from/to Hive via PXF via jdbc have equal chances of success.
I tried to write data to Hive via

CREATE WRITaBLE EXTERNAL TABLE ext_hive_jdbc_snappy_w(
  "bool" boolean,
  "int4" int4,
  "int2" int2,
  "int_tiny" int2,
  "int64_col" int8,
  "float_col" float4,
  "double_col" float8,
  "json_col" text,
  "col_binary" bytea,
  "timestamp_micros" timestamp)
location ('pxf://...?PROFILE=jdbc&SERVER=...') on all format 'custom' ( formatter='pxfwritable_export' ) encoding 'utf8';

and fall in
ERROR: PXF server error : Method not supported
If you exclude the bytea type field from the insert, the write is performed without errors.
Yes, of course, you shouldn't write to Hive via jdbc, there is Spark for that, but still, it's unexpected, given that reading the binary field of the same Hive table via PXF goes
without problems, as it should be according to the documentation (slide for mapping given above the post)

No solution found.

Lesson 4.
An attentive subscriber will notice that there is no DATE type in the mapping of lesson 3. I hasten to reassure you, such a necessary type is supported in the Hive table,
but in our version of PXF build 16.1, to read it, as well as timestamp, we had to add an undocumented option date_wide_range=false to the DDL of the external table in GP,
otherwise, the select of fields with these types from the external table will throw up in
ERROR: PXF server error : Illegal conversion

🔥1

623 views11:52

Greenplum secrets🎩

Друзья, не сочтите за спам. Т.к. я уверен в несокрушимом потенциале технологий с открытым исходным кодом, а крипта - яркий пример его реализации на благо цивилизации,
то ее некоторые успехи будут освещаться здесь
Friends, please don't consider this as spam. Since I am confident in the indestructible potential of open source technologies, and crypto is a shining example
of its implementation for the benefit of civilization, some of its successes can be found here

Cryptonomics

Экономика криптовалют и борьба за контроль над миром / The Cryptocurrency Economy and the Fight for Control of the World. by @smartyru

👎2❤1

672 views12:01

Greenplum secrets🎩

Секрет 27 ( И снова про перестановку слагаемых или почему 2 <> 1+1 )
Secret 27 (And again about the permutation of terms or why 2 <> 1+1 )
На днях обнаружил рассылку от высокого начальника с вопросом, цитирую
"
Таблица X
Расхождение в объемах более 10 %
ПРОД 897,35 GB
DR 1016,38 GB

Расхождение на момент сверки по количеству строк менее 0,01%
Структура объектов и дистрибьюция совпадают.
Такие ситуации на многих объектах.

Есть объяснение такому расхождению?
"

Я был заинтригован, и проверил все вплоть до типов сжатия полей, т.к. иногда их добавляют без сжатия, полагая, что zstd будет унаследовано от самой таблицы, что не так, как мы знаем из ч.2 секрета 15.
Отличий в этой лист-партицированной AOCO табл-е из 109 млрд строк, разбитой на 6 секций, каждая из 3х bigint и 3х text полей, не нашел, кроме факта озвученного в вопросе - каждая партиция на проде имела недовес ~10% в сравнении с DR

Подумав про VACUUM, который согласно pg_stat_all_tables никогда не делался на обоих контурах, дал рекомендацию сделать оный, который должен вымести потенциальные зомби строки, появившиеся после update/delete, и проверить выровнялся ли размер.

Но вот о чем я не подумал, вернувшись из отпуска, что у нас на DR и PROM число сегментов отличается почти в 3 раза , и это и была истинная причина разницы в объеме.
Secret 27 (And again about the permutation of terms or why 2 <> 1+1 )
The other day I found a mailing from a high-ranking boss with a question, I quote
"
Table X
Volume discrepancy is more than 10%
PROD 897.35 GB
DR 1016.38 GB

Discrepancy at the time of reconciliation by the number of lines is less than 0.01%
Object structure and distribution are the same.
Such situations occur at many objects.

Is there an explanation for this discrepancy?
"

I was intrigued, and checked everything down to the types of field compression, because sometimes they are added without compression,
assuming that zstd will be inherited from the table itself, which is not the case, as we know from part 2 of secret 15.
I did not find any differences in this leaf-partitioned AOCO table of 109 billion rows, divided into 6 sections, each of 3 bigint and 3 text fields, except for the fact voiced in the question - each partition on production had a shortfall of ~10% compared to DR

Having thought about VACUUM, which according to pg_stat_all_tables was never done on both servers, I recommended doing one, which should sweep out potential zombie rows that appeared after update/delete, and check whether the size has aligned.

But what I didn't think about when I returned from vacation is that the number of segments on DR and PROD differs by almost 3 times, and this was the real reason for the difference in volume.
Стало интересно, насколько влияет эта разница.
Создал тривиальный тест из 1 колоночной табл-ы в 1 млн строк.
I became interested in how much this difference affects.
I created a trivial test from a 1-column table with 1 million rows.

create table tst_1m   WITH (appendonly=true,orientation=column,compresstype=zstd,compresslevel=1)
as select generate_series(1,1000000) n    distributed by(n);

Размеры табл-иц, на PROD:
Tables size, on PROD:

SQL> select pg_relation_size( 'tst_1m' )
pg_relation_size
-----------------------------------------
3 245 088

на DR:
DR:

SQL> select pg_relation_size( 'tst_1m' )
pg_relation_size
-----------------------------------------
3 166 432

Забавно, что в реальной табл-е картина обратна тесту на синтетике выше, т.е. при увеличении числа сегментов в 3 раза, размер табл-ы на пром уменьшился, но суть, думаю, ясна.
It's funny that in the real table the picture is the opposite of the test on synthetics above, i.e. with a 3-fold increase in the number of segments, the size of the table on the industrial scale decreased, but the essence, I think, is clear.

Greenplum secrets🎩

Секрет 15 ( 2 в 1 или Точность - вежливость королей )
ч.1
На днях прекрасная коллега, а также подписчик канала, попросила чекнуть запрос на удаление дублей в 2 млрд табл-е.
Я предупредил что, DELETE 99% записей не лучшая идея, но по регламенту TRUNCATE +…

🤔3🔥1

742 viewsedited 18:44

About

Blog

Apps

Platform