Comparing JSONB and EAV model for extensions

Last modified 05 Apr 2022 15:32 +02:00

Table of Contents

Insert
Results for 15 million rows
- Performance with more RAM/CPU
- JSONB vs EAV comparison of similar operations
Update
Other types
- Example of JSON unwrapping to rows
Conclusion

Performance was first measured on default PG installation on 2GB RAM VirtualBox with 1 vCPU. The results can be better (or worse depending on the host) even by an order-of-magnitude where memory/CPU (mostly counts). Comparison with the same box resized to 8GB/4CPU is provided. All tests were run on the host with Ryzen 5 1600AF and 32GB RAM.

Structure for JSON is simple, single table tjson with (oid UUID PK, name VARCHAR, ext JSONB). Column name has a unique index and ext has a GIN index only by default. Other indexes are added as needed and mentioned during tests.

create table tjson (
    oid UUID NOT NULL DEFAULT gen_random_uuid(),
    name VARCHAR(255) NOT NULL,
    ext JSONB,

    CONSTRAINT tjson_oid_pk PRIMARY KEY (oid)
);

ALTER TABLE tjson ADD CONSTRAINT tjson_name_key UNIQUE (name);
CREATE INDEX tjson_ext_idx ON tjson USING gin (ext);

Structure for EAV model has a master table teav with (oid UUID PK, name VARCHAR) with unique index on the name. Detail table teav_ext_string contains (owner_oid FK, key VARCHAR, value VARCHAR). Extension table has PK(owner_oid, key, value), that is all columns, value could be omitted if uniqueness is not necessary. Because the PK covers all columns, it should be index-organized table in Oracle (PG doesn’t have this feature). In PG it can be clustered by an index, but this is potentially costly (12min for 15M rows) during which the table is unusable because of the used ACCESS EXCLUSIVE lock.

create table teav (
    oid UUID NOT NULL DEFAULT gen_random_uuid(),
    name VARCHAR(255) NOT NULL,

    CONSTRAINT teav_oid_pk PRIMARY KEY (oid)
);

ALTER TABLE teav ADD CONSTRAINT teav_name_key UNIQUE (name);

create table teav_ext_string (
    owner_oid UUID NOT NULL references teav(oid),
    key VARCHAR(32) NOT NULL,
    value VARCHAR(255) NOT NULL,

    -- this also covers the index on owner_oid FK
    CONSTRAINT teav_ext_string_pk PRIMARY KEY (owner_oid, key, value)
);

CREATE INDEX teav_ext_string_key_value_idx ON teav_ext_string (key, value);

Utility pgbench can be used for measurements like so:

pgbench -r -P 5 -f - -t 30 << "--EOF"
select * from teav limit 500;
... more selects
--EOF

If pgbench is prefixed with -- it can be part of SQL file as a comment, hence --EOF as a delimiter. It is easy to copy/paste from such an SQL file directly to shell to run the pgbench. Delimiter is quoted with " to avoid any accidental shell expansions within the block.

Results can be taken from the final report of average latency per statement (enabled by -r option). Results may vary widely between runs, especially for shorter queries. Also, running the whole script and running smaller bulks provides different results as well, typically better for smaller amount of statement (this may be caused by rather limited VM performance). Measurements for shorter queries were repeated with longer queries commented out and option -t 30 to get more repetitions per statement (30 in this case) which typically leads to lower, but also more representative, averages.

Full SQL file is attached here.

Insert

Insert loop is created in such a way that both version have exactly the same data, even where randomized.

-- start with smaller batches 1..1000, 1001..100000, etc.
DO
$$
DECLARE
    hobbies VARCHAR[];
    v VARCHAR;
    id UUID;
BEGIN
    FOR r IN 11000001..15000000 LOOP

        IF r % 10 <= 1 THEN
            -- some entries have no extension
            INSERT INTO tjson (name) VALUES ('user-' || LPAD(r::text, 10, '0'));
            INSERT INTO teav (name) VALUES ('user-' || LPAD(r::text, 10, '0'));
        ELSEIF r % 10 <= 3 THEN
            -- email+eid (constant keys) + other-key-{r} (variable key)
            INSERT INTO tjson (name, ext) VALUES (
                'user-' || LPAD(r::text, 10, '0'),
                ('{"eid": ' || r || ', "email": "user' || r || '@mycompany.com", "other-key-' || r || '": "other-value-' || r || '"}')::jsonb
            );
            INSERT INTO teav (name) VALUES ('user-' || LPAD(r::text, 10, '0')) RETURNING oid INTO id;
            INSERT INTO teav_ext_string (owner_oid, key, value) VALUES (id, 'email', 'user' || r || '@mycompany.com');
            INSERT INTO teav_ext_string (owner_oid, key, value) VALUES (id, 'eid', r);
            INSERT INTO teav_ext_string (owner_oid, key, value) VALUES (id, 'other-key-' || r, 'other-value-' || r);
        ELSEIF r % 10 <= 4 THEN
            -- rarely used values of hobbies key
            hobbies := random_pick(ARRAY['recording', 'guitar', 'beer', 'rum', 'writing', 'coding', 'debugging', 'gaming', 'shopping', 'watching videos', 'sleeping', 'dreaming'], 0.1);
            -- JSONB
            INSERT INTO tjson (name, ext) VALUES (
                'user-' || LPAD(r::text, 10, '0'),
                ('{"eid": ' || r || ', "hobbies": '|| array_to_json(hobbies)::text || '}')::jsonb
            );

            -- EAV
            INSERT INTO teav (name) VALUES ('user-' || LPAD(r::text, 10, '0')) RETURNING oid INTO id;
            INSERT INTO teav_ext_string (owner_oid, key, value) VALUES (id, 'eid', r);
            FOREACH v IN ARRAY hobbies LOOP
                    INSERT INTO teav_ext_string (owner_oid, key, value)
                    VALUES (id, 'hobbies', v);
                END LOOP;
        ELSE
            -- these values are used by many entries
            hobbies := random_pick(ARRAY['eating', 'books', 'music', 'dancing', 'walking', 'jokes', 'video', 'photo'], 0.4);
            -- JSONB
            INSERT INTO tjson (name, ext) VALUES (
                'user-' || LPAD(r::text, 10, '0'),
                ('{"eid": ' || r || ', "hobbies": '|| array_to_json(hobbies)::text || '}')::jsonb
            );

            -- EAV
            INSERT INTO teav (name) VALUES ('user-' || LPAD(r::text, 10, '0')) RETURNING oid INTO id;
            INSERT INTO teav_ext_string (owner_oid, key, value) VALUES (id, 'eid', r);
            FOREACH v IN ARRAY hobbies LOOP
                INSERT INTO teav_ext_string (owner_oid, key, value)
                VALUES (id, 'hobbies', v);
            END LOOP;
        END IF;

        IF r % 1000 = 0 THEN
            COMMIT;
        END IF;
    END LOOP;
END $$;

Results for 15 million rows

Both JSON and EAV tables contain 15M rows, EAV extension detail table contains 45.8M rows.

Object Size Note

Object	Size	Note
teav_ext_string	2841 MB	attribute value table (detail for EAV)
teav_ext_string_pk	2373 MB	PK of AV table, nearly the same size
tjson	1927 MB	JSON table (smaller than just AV table alone)
tjson_ext_idx	1616 MB	GIN index for `ext` column
teav_ext_string_key_value_idx	880 MB	index for searching by key+value
teav	862 MB	master table for EAV model
teav_oid_pk	451 MB	primary key index of that table
tjson_oid_pk	451 MB	PK of JSON table, virtually the same as for EAV
teav_name_key, tjson_name_key	451 MB	name indexes are the same

teav_ext_string

2841 MB

attribute value table (detail for EAV)

teav_ext_string_pk

2373 MB

PK of AV table, nearly the same size

tjson

1927 MB

JSON table (smaller than just AV table alone)

tjson_ext_idx

1616 MB

GIN index for ext column

teav_ext_string_key_value_idx

880 MB

index for searching by key+value

teav

862 MB

master table for EAV model

teav_oid_pk

451 MB

primary key index of that table

tjson_oid_pk

451 MB

PK of JSON table, virtually the same as for EAV

teav_name_key, tjson_name_key

451 MB

name indexes are the same

Counts JSON:

Operation Time avg (ms) Note

Operation	Time avg (ms)	Note
select count(*) from tjson	6512	seq-scan by default; doesn’t use PK index but if forced to, the time goes to ~2s; some additional index on `(ext→>'…')` can be uses and time may drop to ~1.3s
… where ext @> '{"hobbies":["video"]}'	8200	seq-scan, low selectivity, matches 3.4M of 15M rows
… where ext @> '{"hobbies":["sleeping"]}'	181	high selectivity, matches ~50k rows, uses GIn index, Gather / Partial Aggregate / Parallel Bitmap Heap Scan on tjson / Bitmap Index Scan on tjson_ext_idx
select count(*) from tjson where ext→>'email' LIKE 'user2%'	7174	matches ~222k rows, seq-scan; index `ON tjson ext→>'email'` doesn’t seem to help (not used, even after `ANALYZE`)
where UPPER(ext→>'email') LIKE 'USER2%'	7855	like above with little overhead because of the function; even ignores the index on (UPPER(ext→>'email'))

select count(*) from tjson

6512

seq-scan by default; doesn’t use PK index but if forced to, the time goes to ~2s; some additional index on (ext→>'…') can be uses and time may drop to ~1.3s

… where ext @> '{"hobbies":["video"]}'

8200

seq-scan, low selectivity, matches 3.4M of 15M rows

… where ext @> '{"hobbies":["sleeping"]}'

181

high selectivity, matches ~50k rows, uses GIn index, Gather / Partial Aggregate / Parallel Bitmap Heap Scan on tjson / Bitmap Index Scan on tjson_ext_idx

select count(*) from tjson where ext→>'email' LIKE 'user2%'

7174

matches ~222k rows, seq-scan; index ON tjson ext→>'email' doesn’t seem to help (not used, even after ANALYZE)

where UPPER(ext→>'email') LIKE 'USER2%'

7855

like above with little overhead because of the function; even ignores the index on (UPPER(ext→>'email'))

Counts EAV:

Operation Time avg (ms) Note

Operation	Time avg (ms)	Note
select count(*) from teav_ext_string	6382	informational, otherwise useless, for 15M `teav` rows returns ~47M rows
select count(*) from teav	2284	seq-scan
… where exists (select from teav_ext_string es where es.owner_oid = t.oid and es.key = 'hobbies' and es.value = 'video')	17,465	seq-scan, low selectivity
… where exists (select from teav_ext_string es where es.owner_oid = t.oid and es.key = 'hobbies' and es.value = 'sleeping')	1965	Gather / Partial Aggregate / Nested Loop ( Parallel Bitmap Heap Scan on teav_ext_string / Bitmap Index Scan on teav_ext_string_key_value_idx, Index Only Scan using teav_oid_pk)
… where exists (select from teav_ext_string es where es.owner_oid = t.oid and es.key = 'email' and es.value LIKE 'user2%')	14,932	very slow, seq-scan on ext table, `teav` PK used for semi-join (that’s OK)
… where exists (select from teav_ext_string es where es.owner_oid = t.oid and es.key = 'email' and UPPER(es.value) LIKE 'USER2%')	11,869	Aggregate / Nested Loop ( HashAggregate / Gather / seq-scan on the ext table, Index Only Scan using teav_oid_pk)
select count(owner_oid) from teav_ext_string where key = 'email' and value LIKE 'user2%'	~7s	the same result for single-valued extension, hadly to be expected from query interpreter, still seq-scan

select count(*) from teav_ext_string

6382

informational, otherwise useless, for 15M teav rows returns ~47M rows

select count(*) from teav

2284

seq-scan

… where exists (select from teav_ext_string es where es.owner_oid = t.oid and es.key = 'hobbies' and es.value = 'video')

17,465

seq-scan, low selectivity

… where exists (select from teav_ext_string es where es.owner_oid = t.oid and es.key = 'hobbies' and es.value = 'sleeping')

1965

Gather / Partial Aggregate / Nested Loop ( Parallel Bitmap Heap Scan on teav_ext_string / Bitmap Index Scan on teav_ext_string_key_value_idx, Index Only Scan using teav_oid_pk)

… where exists (select from teav_ext_string es where es.owner_oid = t.oid and es.key = 'email' and es.value LIKE 'user2%')

14,932

very slow, seq-scan on ext table, teav PK used for semi-join (that’s OK)

… where exists (select from teav_ext_string es where es.owner_oid = t.oid and es.key = 'email' and UPPER(es.value) LIKE 'USER2%')

11,869

Aggregate / Nested Loop ( HashAggregate / Gather / seq-scan on the ext table, Index Only Scan using teav_oid_pk)

select count(owner_oid) from teav_ext_string where key = 'email' and value LIKE 'user2%'

~7s

the same result for single-valued extension, hadly to be expected from query interpreter, still seq-scan

All selects have LIMIT 500 for practical reasons (unless stated differently), pgbench would try to read all the lines otherwise.

Selects - JSON:

Operation Time avg (ms) Note

Operation	Time avg (ms)	Note
select * from tjson	0.949	doesn’t need index
… where ext @> '{"hobbies":["video"]}'	2.034	doesn’t need index
… order by oid	6.32	index scan `tjson_oid_pk` + filter
… and oid>'fffe0000-0000-0000-0000-000000000000'	0.97	PK index scan + index cond + filter, this demonstrates the power of keyset pagination (this returns last "page", less than 500)
select * from tjson where ext @> '{"hobbies":["sleeping"]}'	44.94	rare compared to "video", may use `tjson_ext_idx` (GIN index) or seq-scan, both is OK, takes longer to find 500 values than for "video"
… order by oid	92.34	index scan `tjson_oid_pk` + filter; may also use `tjson_ext_idx` (both OK)
where ext→>'email' LIKE 'user2%'	2.25	seq-scan; ignores the index on `ext→>'email'` even though it matches only 1.5% of the table
… order by oid	90.7	index scan `tjson_oid_pk` + filter
… and oid>'fffe0000-0000-0000-0000-000000000000'	5.34	bitmap index scan `tjson_oid_pk` + filter (why bitmap?)
where UPPER(ext→>'email') LIKE 'USER2%'	2.40	seq-scan; ignores the functional index on `(upper(ext→>'email'))`
… order by oid	95.0	index scan `tjson_oid_pk` + filter
… and oid>'fffe0000-0000-0000-0000-000000000000'	5.00	index scan `tjson_oid_pk` + filter
where cast(ext→'eid' as int) = 5000	8133 (50 with index)	seq-scan, inefficient version of the query below, don’t use `=` where `@>` works better; but with the index on `(ext→'eid')::int` this goes down and can be used also for range searches (`<`, `>`, etc.)
where ext @> '{"eid":5000}'	0.32	Bitmap Index Scan on `tjson_ext_idx` + bitmap heap scan on tjson + recheck cond

select * from tjson

0.949

doesn’t need index

… where ext @> '{"hobbies":["video"]}'

2.034

doesn’t need index

… order by oid

6.32

index scan tjson_oid_pk + filter

… and oid>'fffe0000-0000-0000-0000-000000000000'

0.97

PK index scan + index cond + filter, this demonstrates the power of keyset pagination (this returns last "page", less than 500)

select * from tjson where ext @> '{"hobbies":["sleeping"]}'

44.94

rare compared to "video", may use tjson_ext_idx (GIN index) or seq-scan, both is OK, takes longer to find 500 values than for "video"

… order by oid

92.34

index scan tjson_oid_pk + filter; may also use tjson_ext_idx (both OK)

where ext→>'email' LIKE 'user2%'

2.25

seq-scan; ignores the index on ext→>'email' even though it matches only 1.5% of the table

… order by oid

90.7

index scan tjson_oid_pk + filter

… and oid>'fffe0000-0000-0000-0000-000000000000'

5.34

bitmap index scan tjson_oid_pk + filter (why bitmap?)

where UPPER(ext→>'email') LIKE 'USER2%'

2.40

seq-scan; ignores the functional index on (upper(ext→>'email'))

… order by oid

95.0

index scan tjson_oid_pk + filter

… and oid>'fffe0000-0000-0000-0000-000000000000'

5.00

index scan tjson_oid_pk + filter

where cast(ext→'eid' as int) = 5000

8133 (50 with index)

seq-scan, inefficient version of the query below, don’t use = where @> works better; but with the index on (ext→'eid')::int this goes down and can be used also for range searches (<, >, etc.)

where ext @> '{"eid":5000}'

0.32

Bitmap Index Scan on tjson_ext_idx + bitmap heap scan on tjson + recheck cond

Selects - EAV:

Operation Time avg (ms) Note

Operation	Time avg (ms)	Note
select * from teav	0.494	seq-scan
select * from teav t where exists (select from teav_ext_string es where es.owner_oid = t.oid and es.key = 'hobbies' and es.value = 'video')	6.184	index scan `teav_oid_pk` + seq-scan on ext table
… order by t.oid	4.734	index scan `teav_oid_pk` + index only scan `teav_ext_string_pk` (order accidentally helps to pick better plan here)
… and t.oid>'fffe0000-0000-0000-0000-000000000000'	11.34	index scan `teav_oid_pk` + index only scan `teav_ext_string_pk`
select * from teav t where exists (select from teav_ext_string es where es.owner_oid = t.oid and es.key = 'hobbies' and es.value = 'sleeping')	24.28	index scan `teav_oid_pk` + bitmap index scan on `teav_ext_string_key_value_idx` (higher selectivity for "sleeping")
… order by t.oid	206	index scan `teav_oid_pk` + bitmap index scan on `teav_ext_string_key_value_idx`
… where exists (select from teav_ext_string es where es.owner_oid = t.oid and es.key = 'email' and es.value LIKE 'user2%')	109	index scan `teav_oid_pk` + index only scan `teav_ext_string_pk`
… order by t.oid	108	index scan `teav_oid_pk` + index only scan `teav_ext_string_pk` + merge semi join
… and t.oid>'fffe0000-0000-0000-0000-000000000000'	78.1	index scan `teav_oid_pk` + index only scan `teav_ext_string_pk` + gather merge / sort
… where exists (select from teav_ext_string es where es.owner_oid = t.oid and es.key = 'email' and UPPER(es.value) LIKE 'USER2%')	117	index scan `teav_oid_pk` + index only scan `teav_ext_string_pk`
… order by t.oid	116	index scan `teav_oid_pk` + index only scan `teav_ext_string_pk` + merge semi join
… and t.oid>'fffe0000-0000-0000-0000-000000000000'	90.6	index scan `teav_oid_pk` + index only scan `teav_ext_string_pk` + gather merge / sort
where exists (select from teav_ext_string es where es.owner_oid = t.oid and es.key = 'eid' and es.value = '5000')	0.500	index scan `teav_oid_pk` + index scan `teav_ext_string_key_value_idx`

select * from teav

0.494

seq-scan

select * from teav t where exists (select from teav_ext_string es where es.owner_oid = t.oid and es.key = 'hobbies' and es.value = 'video')

6.184

index scan teav_oid_pk + seq-scan on ext table

… order by t.oid

4.734

index scan teav_oid_pk + index only scan teav_ext_string_pk (order accidentally helps to pick better plan here)

… and t.oid>'fffe0000-0000-0000-0000-000000000000'

11.34

index scan teav_oid_pk + index only scan teav_ext_string_pk

select * from teav t where exists (select from teav_ext_string es where es.owner_oid = t.oid and es.key = 'hobbies' and es.value = 'sleeping')

24.28

index scan teav_oid_pk + bitmap index scan on teav_ext_string_key_value_idx (higher selectivity for "sleeping")

… order by t.oid

206

index scan teav_oid_pk + bitmap index scan on teav_ext_string_key_value_idx

… where exists (select from teav_ext_string es where es.owner_oid = t.oid and es.key = 'email' and es.value LIKE 'user2%')

109

index scan teav_oid_pk + index only scan teav_ext_string_pk

… order by t.oid

108

index scan teav_oid_pk + index only scan teav_ext_string_pk + merge semi join

… and t.oid>'fffe0000-0000-0000-0000-000000000000'

78.1

index scan teav_oid_pk + index only scan teav_ext_string_pk + gather merge / sort

… where exists (select from teav_ext_string es where es.owner_oid = t.oid and es.key = 'email' and UPPER(es.value) LIKE 'USER2%')

117

index scan teav_oid_pk + index only scan teav_ext_string_pk

… order by t.oid

116

index scan teav_oid_pk + index only scan teav_ext_string_pk + merge semi join

… and t.oid>'fffe0000-0000-0000-0000-000000000000'

90.6

index scan teav_oid_pk + index only scan teav_ext_string_pk + gather merge / sort

where exists (select from teav_ext_string es where es.owner_oid = t.oid and es.key = 'eid' and es.value = '5000')

0.500

index scan teav_oid_pk + index scan teav_ext_string_key_value_idx

There was little to no difference between semi-join (EXISTS) and LEFT JOIN wherever tried. Not all selects were tried with both, but EXISTS returns the right number of rows so it was preferred.

Performance with more RAM/CPU

After resizing the VirtualBox, the following parameters were used (uninformed setup using this site):

# DB Version: 13
# OS Type: linux
# DB Type: web
# Total Memory (RAM): 8 GB
# CPUs num: 4
# Connections num: 100
# Data Storage: ssd

max_connections = 100
shared_buffers = 2GB
effective_cache_size = 6GB
maintenance_work_mem = 512MB
checkpoint_completion_target = 0.7 # 0.9 for DB Type: oltp
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 1.1
effective_io_concurrency = 200
work_mem = 10485kB
min_wal_size = 1GB # 2GB for oltp
max_wal_size = 4GB # 8GB for oltp
max_worker_processes = 4
max_parallel_workers_per_gather = 2
max_parallel_workers = 4
max_parallel_maintenance_workers = 2

Most of these are already commented out by default in postresql.conf, the rest was commented too, not to collide with the added section above. Benchmark was run a couple of times first to warm up the DB before results were used. It seems that selects are longer if checkpoint_completion_target is set to 0.9. Last column are results with -c 10 option added to pgbench, all previous tests were single-client. 10 clients should overload the configuration.

Operation	2GB/1CPU (ms)	8GB/4CPU (ms) (speedup)	with 10 clients (slowdown)
select count(*) from tjson	6512	440 (14.8x)	3324 (7.6x)
… where ext @> '{"hobbies":["video"]}'	8200	1388 (5.9x)	8691 (6.3x)
… where ext @> '{"hobbies":["sleeping"]}'	181	63.4 (2.9x)	161 (2.5x)
select count(*) from tjson where ext→>'email' LIKE 'user2%'	7174	758 (9.5x)	5201 (6.9x)
where UPPER(ext→>'email') LIKE 'USER2%'	7855	1196 (6.6x)	7080 (5.9x)
select count(*) from teav_ext_string	6382	2213 (2.9x)	7935 (3.6x)
select count(*) from teav	2284	337 (6.8x)	2997 (8.9x)
… where exists (select from teav_ext_string es where es.owner_oid = t.oid and es.key = 'hobbies' and es.value = 'video')	17465	3324 (5.3x)	58565 (17.6x)
… where exists (select from teav_ext_string es where es.owner_oid = t.oid and es.key = 'hobbies' and es.value = 'sleeping')	1965	87 (22.6x)	2966 (34.1x)
… where exists (select from teav_ext_string es where es.owner_oid = t.oid and es.key = 'email' and es.value LIKE 'user2%')	14932	2499 (6x)	22974 (9.2x)
… where exists (select from teav_ext_string es where es.owner_oid = t.oid and es.key = 'email' and UPPER(es.value) LIKE 'USER2%')	11869	2188 (5.4x)	9253 (4.2x)
select * from tjson limit 500	0.949	0.734 (1.3x)	5.781 (7.9x)
… where ext @> '{"hobbies":["video"]}'	2.034	1.873 (1.1x)	4.815 (2.6x)
… order by oid	6.32	2.425 (2.6x)	6.144 (2.5x)
… and oid>'fffe0000-0000-0000-0000-000000000000'	0.97	0.761 (1.3x)	3.548 (4.7x)
select * from tjson where ext @> '{"hobbies":["sleeping"]}'	44.94	41.82 (1.1x)	100.546 (2.4x)
… order by oid	92.34	58.79 (1.6x)	155.131 (2.6x)
where ext→>'email' LIKE 'user2%'	2.25	1.93 (1.2x)	5.068 (2.6x)
… order by oid	90.7	23.19 (3.9x)	78.79 (3.4x)
… and oid>'fffe0000-0000-0000-0000-000000000000'	5.34	0.59 (9.1x)	3.441 (5.8x)
where UPPER(ext→>'email') LIKE 'USER2%'	2.4	2.19 (1.1x)	5.152 (2.4x)
… order by oid	95	25.2 (3.8x)	78.07 (3.1x)
… and oid>'fffe0000-0000-0000-0000-000000000000'	5	0.65 (7.7x)	3.23 (5x)
where ext @> '{"eid":5000}'	0.32	0.25 (1.3x)	1.5 (6x)
select * from teav limit 500	0.494	0.419 (1.2x)	1.5 (3.6x)
… where exists (select from teav_ext_string es where es.owner_oid = t.oid and es.key = 'hobbies' and es.value = 'video')	6.184	2.384 (2.6x)	6.725 (2.8x)
… order by t.oid	4.734	2.225 (2.1x)	7.919 (3.6x)
… and t.oid>'fffe0000-0000-0000-0000-000000000000'	11.34	4.815 (2.4x)	267.21 (55.5x)
… where exists (select from teav_ext_string es where es.owner_oid = t.oid and es.key = 'hobbies' and es.value = 'sleeping')	24.28	2.736 (8.9x)	11.992 (4.4x)
… order by t.oid	206	42.8 (4.8x)	267.21 (6.2x)
… where exists (select from teav_ext_string es where es.owner_oid = t.oid and es.key = 'email' and es.value LIKE 'user2%')	109	40.6 (2.7x)	315.2 (7.8x)
… order by t.oid	108	35.7 (3x)	106.5 (3x)
… and t.oid>'fffe0000-0000-0000-0000-000000000000'	78.1	21.9 (3.6x)	43.26 (2x)
… where exists (select from teav_ext_string es where es.owner_oid = t.oid and es.key = 'email' and UPPER(es.value) LIKE 'USER2%')	117	39.3 (3x)	114.4 (2.9x)
… order by t.oid	116	39 (3x)	113.1 (2.9x)
… and t.oid>'fffe0000-0000-0000-0000-000000000000'	90.6	21.4 (4.2x)	41.61 (1.9x)
where exists (select from teav_ext_string es where es.owner_oid = t.oid and es.key = 'eid' and es.value = '5000')	0.5	0.468 (1.1x)	2.229 (4.8x)

Operation

2GB/1CPU (ms)

8GB/4CPU (ms) (speedup)

with 10 clients (slowdown)

select count(*) from tjson

6512

440 (14.8x)

3324 (7.6x)

… where ext @> '{"hobbies":["video"]}'

8200

1388 (5.9x)

8691 (6.3x)

… where ext @> '{"hobbies":["sleeping"]}'

181

63.4 (2.9x)

161 (2.5x)

select count(*) from tjson where ext→>'email' LIKE 'user2%'

7174

758 (9.5x)

5201 (6.9x)

where UPPER(ext→>'email') LIKE 'USER2%'

7855

1196 (6.6x)

7080 (5.9x)

select count(*) from teav_ext_string

6382

2213 (2.9x)

7935 (3.6x)

select count(*) from teav

2284

337 (6.8x)

2997 (8.9x)

… where exists (select from teav_ext_string es where es.owner_oid = t.oid and es.key = 'hobbies' and es.value = 'video')

17465

3324 (5.3x)

58565 (17.6x)

… where exists (select from teav_ext_string es where es.owner_oid = t.oid and es.key = 'hobbies' and es.value = 'sleeping')

1965

87 (22.6x)

2966 (34.1x)

… where exists (select from teav_ext_string es where es.owner_oid = t.oid and es.key = 'email' and es.value LIKE 'user2%')

14932

2499 (6x)

22974 (9.2x)

… where exists (select from teav_ext_string es where es.owner_oid = t.oid and es.key = 'email' and UPPER(es.value) LIKE 'USER2%')

11869

2188 (5.4x)

9253 (4.2x)

select * from tjson limit 500

0.949

0.734 (1.3x)

5.781 (7.9x)

… where ext @> '{"hobbies":["video"]}'

2.034

1.873 (1.1x)

4.815 (2.6x)

… order by oid

6.32

2.425 (2.6x)

6.144 (2.5x)

… and oid>'fffe0000-0000-0000-0000-000000000000'

0.97

0.761 (1.3x)

3.548 (4.7x)

select * from tjson where ext @> '{"hobbies":["sleeping"]}'

44.94

41.82 (1.1x)

100.546 (2.4x)

… order by oid

92.34

58.79 (1.6x)

155.131 (2.6x)

where ext→>'email' LIKE 'user2%'

2.25

1.93 (1.2x)

5.068 (2.6x)

… order by oid

90.7

23.19 (3.9x)

78.79 (3.4x)

… and oid>'fffe0000-0000-0000-0000-000000000000'

5.34

0.59 (9.1x)

3.441 (5.8x)

where UPPER(ext→>'email') LIKE 'USER2%'

2.4

2.19 (1.1x)

5.152 (2.4x)

… order by oid

25.2 (3.8x)

78.07 (3.1x)

… and oid>'fffe0000-0000-0000-0000-000000000000'

0.65 (7.7x)

3.23 (5x)

where ext @> '{"eid":5000}'

0.32

0.25 (1.3x)

1.5 (6x)

select * from teav limit 500

0.494

0.419 (1.2x)

1.5 (3.6x)

… where exists (select from teav_ext_string es where es.owner_oid = t.oid and es.key = 'hobbies' and es.value = 'video')

6.184

2.384 (2.6x)

6.725 (2.8x)

… order by t.oid

4.734

2.225 (2.1x)

7.919 (3.6x)

… and t.oid>'fffe0000-0000-0000-0000-000000000000'

11.34

4.815 (2.4x)

267.21 (55.5x)

… where exists (select from teav_ext_string es where es.owner_oid = t.oid and es.key = 'hobbies' and es.value = 'sleeping')

24.28

2.736 (8.9x)

11.992 (4.4x)

… order by t.oid

206

42.8 (4.8x)

267.21 (6.2x)

… where exists (select from teav_ext_string es where es.owner_oid = t.oid and es.key = 'email' and es.value LIKE 'user2%')

109

40.6 (2.7x)

315.2 (7.8x)

… order by t.oid

108

35.7 (3x)

106.5 (3x)

… and t.oid>'fffe0000-0000-0000-0000-000000000000'

78.1

21.9 (3.6x)

43.26 (2x)

… where exists (select from teav_ext_string es where es.owner_oid = t.oid and es.key = 'email' and UPPER(es.value) LIKE 'USER2%')

117

39.3 (3x)

114.4 (2.9x)

… order by t.oid

116

39 (3x)

113.1 (2.9x)

… and t.oid>'fffe0000-0000-0000-0000-000000000000'

90.6

21.4 (4.2x)

41.61 (1.9x)

where exists (select from teav_ext_string es where es.owner_oid = t.oid and es.key = 'eid' and es.value = '5000')

0.5

0.468 (1.1x)

2.229 (4.8x)

JSONB vs EAV comparison of similar operations

Operations are shown as JSON version selects, avg times are taken from 8GB/4CPU configuration. Significant victories are in bold.

Operation (on JSON version)	JSON avg (ms)	EAV avg (ms)
select count(*) from tjson	440	337
… where ext @> '{"hobbies":["video"]}'	1388	3324
… where ext @> '{"hobbies":["sleeping"]}'	63.4	87
select count(*) from tjson where ext→>'email' LIKE 'user2%'	758	2499
where UPPER(ext→>'email') LIKE 'USER2%'	1196	2188
non-weighted counts total	3845	10648
select * from tjson	0.734	0.419
… where ext @> '{"hobbies":["video"]}'	1.873	2.384
… order by oid	2.425	2.225
… and oid>'fffe0000-0000-0000-0000-000000000000'	0.761	4.815
select * from tjson where ext @> '{"hobbies":["sleeping"]}'	41.82	2.736
… order by oid	58.79	42.8
where ext→>'email' LIKE 'user2%'	1.93	40.6
… order by oid	23.19	35.7
… and oid>'fffe0000-0000-0000-0000-000000000000'	0.59	21.9
where UPPER(ext→>'email') LIKE 'USER2%'	2.19	39.3
… order by oid	25.2	39
… and oid>'fffe0000-0000-0000-0000-000000000000'	0.65	21.4
where ext @> '{"eid":5000}'	0.25	0.468
non-weighted select total	160 ms	254 ms

Update

The problem with JSONB type is that it is not efficient to update, see this SO answer. Even though there are functions to update JSONB in a fine-grained manner, the whole row is written anyway, which also solves some MVCC (Multiversion Concurrency Control) problems. This can make little updates of large rows prohibitive.

There are cases when extension property holds thousands of items (e.g. groups in LDAP) and these must be written as a whole document when JSONB is used. Such properties are often index-only, that is, they are not stored in fullObject serialized form. With EAV model only small individual rows would be manipulated (deleted or inserted) which may be more efficient.

Using JSONB still may provide reasonable performance and there are various solutions:

The final extension value may be calculated in midPoint, JSONB is created and updated as a whole. This means the old value of the property must be obtained first (this can be controlled by configuration flags enableNoFetchExtensionValuesInsertion/Deletion for the old repository) it must travel from the DB to the application but makes the update process cleaner and more reliable from midPoint perspective.
Only updates are sent to the database. This does not prevent writing of the whole row internally but saves network bandwidth. There are currently no tests whether the updates (especially multiple updates like add/delete combinations) are more efficient in the database or in midPoint (including additional traffic). Also, this can lead to inconsistent data, e.g. double adds.

Also, main reason NOT to use JSONB is when the data is used in a relational way, e.g. for JOINs. This is not the case - except maybe for extension references if they are traversed with @. Various WHERE conditions can be supported by GIN indexes, optionally with gin_trgm_ops for better (I)LIKE support.

Other types

Various data types are possible in extension attributes and these must be represented in JSONB.

Keys will actually be surrogate IDs from extension item catalog, probably numbers in ". Shortcuts are possible if readability is important, but unwrapping the JSONB in selects is possbile if necessary (see the next section).
Numbers can be representad as JSON numbers which support NUMERIC range internally. This would allow for numeric comparison not possible with some number types in old repository.
Date/time types can be represented as strings formated using ISO 8601. This allows for comparison as well as the string representation order is the same as temporal (but not between various date/time types, of course).
Multivalues are stored consistently as arrays, even when one value is stored. No value/empty array can be omitted. It may be wise to store any value in an array, which makes the system simpler, although the values must be always unwrapped (e.g. for comparision and corresponding indexes). But this allows for change of maxOccurs in the extension schema from 1 to more or unbounded without any need of reindexing.

In general mixing scalar and arrays for operations can be tricky. Conditions ext @> '{"hobbies":["eating"]}' and ext @> '{"hobbies":"eating"}' (without array) are two different conditions that do not work the same way. Using arrays very time means there is no need to reindex when maxOccurs in shema changes. Threre are still other cases when recomputing of extension column (aka "reindexing") is relevant.

Example of JSON unwrapping to rows

Given a row:

select * from tjson
where oid = '0cbe39c7-c7af-4cf3-a334-098400284a0a';

That returns something like:

oid	name	ext
0cbe39c7-c7af-4cf3-a334-098400284a0a	user-0000038799	{"eid": 38799, "hobbies": ["eating", "books", "dancing", "walking", "video"]}

oid

name

ext

0cbe39c7-c7af-4cf3-a334-098400284a0a

user-0000038799

{"eid": 38799, "hobbies": ["eating", "books", "dancing", "walking", "video"]}

We can explode all the top level values including the array with this select:

select oid, key, coalesce(aval, sval) val from (
    select oid, key,
        CASE WHEN jsonb_typeof(value) = 'array' THEN value END avals,
        CASE WHEN jsonb_typeof(value) <> 'array' THEN value END sval
    from tjson, jsonb_each(tjson.ext) fields
) x left join jsonb_array_elements(avals) aval on true
where oid = '0cbe39c7-c7af-4cf3-a334-098400284a0a';

Which returns:

oid	key	val
0cbe39c7-c7af-4cf3-a334-098400284a0a	eid	38799
0cbe39c7-c7af-4cf3-a334-098400284a0a	hobbies	"eating"
0cbe39c7-c7af-4cf3-a334-098400284a0a	hobbies	"books"
0cbe39c7-c7af-4cf3-a334-098400284a0a	hobbies	"dancing"
0cbe39c7-c7af-4cf3-a334-098400284a0a	hobbies	"walking"
0cbe39c7-c7af-4cf3-a334-098400284a0a	hobbies	"video"

oid

key

val

0cbe39c7-c7af-4cf3-a334-098400284a0a

eid

38799

0cbe39c7-c7af-4cf3-a334-098400284a0a

hobbies

"eating"

0cbe39c7-c7af-4cf3-a334-098400284a0a

hobbies

"books"

0cbe39c7-c7af-4cf3-a334-098400284a0a

hobbies

"dancing"

0cbe39c7-c7af-4cf3-a334-098400284a0a

hobbies

"walking"

0cbe39c7-c7af-4cf3-a334-098400284a0a

hobbies

"video"

Obviously, this is not performance tuned, but for a single/few object(s) it is no problem at all.

Conclusion

JSON with index takes less space than EAV+ext table with their indexes (4GB vs 7GB, not counting unique indexes on names). This may change when keys are normalized using "ext item catalog", but both JSON and EAV would benefit from it.
Performance wise they are trading blows, but JSON is ~2 times faster overall, depending on the mix of selects.
- EAV’s biggest win was finding any 500 rows with hobbies=sleeping (15x faster).
- JSON was over 20x faster for 4 selects, all related to email LIKE searches (no index support on either solution).
There are ways how to improve searching with function based indexes, especially for specific attributes in JSON (see the slow select with eid above).
Extension table index on key+value was used only for rare multi-value values (sleeping) and for exact matches of single values (eid). There may be other cases where it is used that were not tested here. Also, changing the VARCHAR key to INTEGER (FK to ext-item catalog) could entice PG to use it.
Counts are always much slower than limited select.
Ordering slows things down
Avoid unlimited selects.

Which model to use? I’d start with JSON but added possibility to store extension attribute in a flexible way. This complicates the implementation but can offer the better performing options. But this can be done later, including transparent extension item storage migration, etc.

Was this page helpful?

YES NO

Thanks for your feedback