practical json - pgconf india · 2020-03-16 · ivan panchenko postgres professional . about myself...
TRANSCRIPT
Practical JSON
Ivan Panchenko
Postgres Professional
About myself
• Astronomer
• Application developer since 1996
• Use PostgreSQL since 1999
• Use JSON in PostgreSQL since 2009
(yes, before it was supported)
• Use PL/Perl since 2009
• Cofounder of Postgres Professional Co.
About this talk
• Too short time to cover full JSON
• So the talk does not cover JSON completely
• The documentation does not, either
• So I will give the main ideas
• And I hope it would be useful
How to store polymorphic rows?
EAV and brothers? No!
2003 (Pg 8.2) : HStore extension (T.Sigaev, O.Bartunov)
– a date type to store sets of key-value pairs 'a=>1, b=>2'::hstore
Operators: • Retrieve value by key : hstore -> text
• Key exists? : hstore ? text
• Hstore contains other one ? : hstore @ hstore
Index support with GiST, GIN (since 2006)
Jsonb comes to help
JSON:
• Standard
• Multi-level
• JSONB (better)
HStore:
• Fast
• Multi-level Hstore (proposal)
• 2011: Joseph Adams proposed a JSON extension for 9.1
(not accepted)
• 2012: 9.2: json data type (storage and verification)
• 2012: Nested HStore proposal
• 2014: 9.4 JSONB (Bartunov, Korotkov, Sigaev)
• 2016: SQL:2016 standard: JSON support (Technical Report)
• 2019: 12 SQL/JSON (Bartunov, Korotkov, Glukhov)
https://commitfest.postgresql.org/17/1471/
Google trends:
jsonb vs hstore
Db-Engines.com ranking
(relative)
JSON or JSONB ?
JSON:
• Text
• Preserves key order
• Allows duplicate keys
• Keeps spaces
• Compact (if no spaces), May be
2 times smaller than JSONB, for
short values.
• Fast INSERT
• No @> (contains) operator
JSONB:
• Fast values retrieval (up to
1000 times!)
• Sorted and unique keys.
• Index support
JSON was not
designed for
storage !!!
Start practicing….
Creating JSON(B)
1. From textual representation: '{}'::jsonb
2. Constructing functions: jsonb_build_object(), array_to_json(), …
https://postgrespro.com/docs/postgresql/12/functions-json
3. From database objects:
SELECT row_to_json(pg_class) FROM pg_class;
Values retrieval
1. Operators: json->text, json->>text:
SELECT row_to_json(pg_class)->>'relname'
FROM pg_class;
2. Operators: json #> text[], json #>>text[]
3. Iterator functions: json_each, json_each_text,
json_array_elements, json_array_elements_text
SELECT * FROM
json_each_text('{"a":"foo","b":"bar"}')
key | value
-----+-------
a | foo
b | bar
Simple search
1. Операторы json->text, json->>text:
WHERE js ->>'key' = 'value';
2. Операторы json #> text[], json #>>text[]
WHERE js #>>ARRAY['key'] = 'value';
3. Функции json_each, json_each_text
WHERE EXISTS (
SELECT * FROM json_each_text(js)
WHERE key='key' AND value='value'
);
B
Tree
Test data { "company_name": "Postgres Vr ̥ttipara", "offices": [
{ "name": "Main",
"city": "Bengaluru",
"area": 1000
},
{ "name": "Secondary",
"city": "Pune",
"area" : 200
}
]
}
CREATE TABLE tbl (js jsonb); -- Bad practice!
“Contains” search (@>) Which companies have offices in Bengaluru?
WHERE js @> '{"offices":[ {"city" : "Bengaluru"} ]}'
CREATE INDEX idx ON tbl USING GIN (js);
Bitmap Heap Scan on tbl (cost=68.77..411.01 rows=100 width=194)
Recheck Cond: (js @> '{"offices": [{"city": "Bengaluru"}]}'::jsonb)
-> Bitmap Index Scan on idx (cost=0.00..68.75 rows=100 width=0)
Index Cond: (js @> '{"offices": [{"city": "Bengaluru"}]}'::jsonb)
GIN
More complicated search
Which companies have offices in any of the two cities?
WHERE
js @>
'{"offices":[ {"city" : "Bengaluru"} ]}'
OR
js @>
'{"offices":[ {"city" : "Pune"} ]}'
Other (pure SQL) way:
WHERE EXISTS (
SELECT *
FROM jsonb_array_elements(js->'offices') t
WHERE t->>'city' IN ('Bengaluru', 'Pune')
);
GIN
Even more complicated search
Which companies have MAIN office in any of the two cities?
WHERE
js @>
'{"offices":[
{"city": "Bengaluru", "name": "Main" }
]}'
OR
js @>
'{"offices":[
{"city": "Pune", "name": "Main" }
]}'
GIN
SQL style…
Which companies have MAIN office in any of the two cities?
WHERE EXISTS(
SELECT *
FROM jsonb_array_elements(js->'offices') t
WHERE t->>'city' IN ('Bengaluru', 'Pune')
AND t->>'name' = 'Main'
);
No index acceleration …
Most complicated search for today
Which companies have MAIN office with area > 500 sq.m in
any of the two cities?
WHERE EXISTS(
SELECT *
FROM jsonb_array_elements(js->'offices') t
WHERE t->>'city' IN ('Bengaluru', 'Pune')
AND t->>'name' = 'Main'
AND (t->>'area')::float > 500
);
@> operator cannot help for >…
(in fact, it can, if used for preselection)
Some conclusions
Search with -> is good and can be fast with B-Tree
functional indexes, but has poor flexibility.
Search with @> is more flexible, and fast with GIN, but
not effective fo complicated queries.
SQL-style search is most flexible, but not effective.
???????
Advanced JSONB search solutions
JSquery extension (Alexander Korotkov)
JSONPATH – SQL:2016 features in PostgreSQL 12
(O.Bartunov, N.Glukov, A.Korotkov, L.Mantrova,
T.Sigaev)
JSONTABLE – continuation in PostgreSQL 13
JSQuery
Defines a data type for a language for JSON search queries
Jsonb @@ jsquery
GIN index support (2 types of GIN indexes optimized for
different types of queries)
JSONPATH
• Defines a JSONPATH data type for a jsonb query
language.
• Defines functions and operators.
• Defines an opclass for index search.
• Implements SQL:2016.
Why a datatype?
• Because we can (this is a Postgres way)
• To enable operators and indexes, leveraging
PostgreSQL extendability
JSONPATH: retrieval
City of the first office:
SELECT jsonb_path_query_first(
js, '$.offices[0].city'
) FROM tbl;
City of the Main office with area > 500
SELECT jsonb_path_query_first(
js,
'$.offices[*]
? (@.area > 500 && @.name == "Main")
.city'
) FROM tbl;
JSONPATH: search
SELECT * FROM tbl WHERE
js @?
'$.offices[*]
? (@.area > 500 && @.name == "Main")
.city'
CREATE INDEX idx1 ON tbl
USING gin (js jsonb_path_ops);
GIN
Details on JSONPATH:
https://github.com/obartunov/sqljsondoc/blob/master/jsonpath.md
More opclasses in JSQuery
What else can we do with JSON ?
JSON enthusiast’s table:
-- ( not recommended! )
CREATE TABLE data (
fields jsonb
);
JSON enthusiast’s table:
(next day)
-- ( not recommended! )
CREATE TABLE data (
fields jsonb
);
CREATE INDEX data_pk
ON data (fields->>'id');
JSON enthusiast’s table:
(2 days later)
CREATE TABLE foo(
fields jsonb
);
CREATE UNIQUE INDEX data_pk
ON foo((fields->>'id'));
CREATE UNIQUE INDEX data_pki
ON foo (((fields->>'id'):int));
-- which is better?
INSERT INTO foo
SELECT jsonb_build_object('q',random(),'id', x)
FROM generate_series(10000000,20000000);
Storing all fields in a single JSONB
1. Impossible to define referential constraints
CREATE TABLE data_smth (
data_id int4 REFERENCES ……… ???? WTF
);
Part 1. Storing primary key in JSON
2. Do not forget to check it for NOT NULL
CHECK (fields->>'id' IS NOT NULL);
3. Still it is not a PK. Define IDENTITY for logical
replcation…
( not recommended! )
Storing all fields in a single JSONB
Part 2. Storing all other data in JSON
( not recommended! )
• Needs more space (keys + JSONB overhead) not always!!
• Slower retrieval (specially with TOAST)
• index scan ~ same performance, Seq scan is 2-3 times
slower even without TOAST
• No statistics: Planning is less effective.
• No types. Even dates missing!
• Less readability (sometimes)
A good article by Dan Robinson.
How to save space with JSON ?
• Merge several rows into one row
• Which rows to merge?
• Close in space-time, or semantically
• Example: part of a time series.
Which fields to store in JSON?
• Really polymorphic data
• Better to store schema metadata too
• Use something to restrict from chaos
• Data with internal structure
• Specially if you plan to use JSQuery or JSONPATH
• If you need to store history, regardless of the schema
changes
Polymorphic structure with versions
CREATE TABLE foo (
id int primary key,
subclass … ,
schema_version …
/*
SUPERCLASS_FIELDS
*/
subclass_fields jsonb
);
CREATE TRIGGER …
-- ensure schema
foo
Subclass1 Subclass2 Subclass3
subclasses
Metadata
subclass_versions
subclass_fields
Fields with internal structure
• Image
{
file_name: '...',
file_path: '...',
width:
height:
}
• Query plan
a big tree returned by
EXPLAIN (FORMAT JSON)
Obtaining and storing query plans
• Cannot obtain it in pure SQL: SELECT FROM (EXPLAIN …)
• Also impossible: EXPLAIN INTO ….
• Need to make a function, like that:
CREATE FUNCTION explain (query text)
RETURNS jsonb
LANGUAGE plperl TRANSFORM FOR TYPE jsonb AS
$$
my ($sql) = @_;
my $res = spi_exec_query(
"EXPLAIN(FORMAT JSON) $sql", 1);
return $res->{rows}[0]->{"QUERY PLAN"};
$$;
JSON aggregates
• Problem: Obtain a book list together with authors
book person
authorship
pos int
Field theory // Landau, L.D., Lifshitz, E.M.
The C Programming Language // Kernigan, B.V., Ritchie, D.M.
JSON aggregates
Field Theory
Langau, L.D. Lifshitz, E.M.
The C Programming Language
Kernigan, B.V., Ritchie, D.M.
JSON aggregates
SELECT book.*,
( SELECT json_agg(row_to_json(x))
FROM (
SELECT person.*
FROM person
JOIN authorship
ON person.id = person_id
WHERE book_id = book.id
ORDER BY pos
) x
) authors
FROM book;
More JSON aggregates
• Expanding attributes from referenced tables
book publisher
series
SELECT book.*,
row_to_json(series) AS $series,
row_to_json(publisher) AS $publisher
FROM book
LEFT JOIN publisher
ON published.id = book.publisher
LEFT JOIN series
ON series.id = book.series
More complex JSON aggregates
• Retrieve a book with hierarchical classification
CREATE TABLE sections ( /* Section tree*/
id int PRIMARY KEY,
parent int REFERENCES sections(id),
title text
);
CREATE TABLE book ( /* Books */
…
sections[] int
…
);
More complex JSON aggregates
• Retrieve a book with hierarchical classification
Output:
{ sections: [ /* Paths of all sections of this
book*/
[ {id:..., title:...},
{id:....} …
], -- 1st section path
[ {id:...
...
] -- 2nd section path
...
}
Test data
INSERT INTO section VALUES
(1, NULL, 'DBMS'),
(2, 1, 'PostgreSQL'),
(3, NULL, 'OS'),
(4, 3, 'Solaris');
INSERT INTO book VALUES (
'Postgres for Solaris OS',
ARRAY[2,4]
);
Step 1: compute the section path
WITH RECURSIVE path AS (
SELECT s.id, s.title, s.parent , 0 AS level
FROM section s WHERE s.id = 4
UNION ALL
SELECT s.id, s.title, s.parent,
path.level+1 AS level
FROM section s
JOIN path ON s.id = path.parent
)
SELECT id,title FROM path ORDER BY level
DESC;
Step 2: aggregate the section path
SELECT json_agg(row_to_json(path)) AS path
FROM ( WITH RECURSIVE path AS (
SELECT s.id, s.title, s.parent , 0 AS level
FROM section s WHERE s.id = 4
UNION ALL
SELECT s.id, s.title, s.parent, path.level+1 AS level
FROM section s JOIN path ON s.id = path.parent
) SELECT id,title FROM path ORDER BY level DESC
) path;
path
-----------------------------------------------------
[{"id":3,"title":"OS"}, {"id":4,"title":"Solaris"}]
(1 row)
Step 3: aggregate several paths SELECT * FROM (
SELECT unnest(ARRAY[2,4]) node_id
) nodes
JOIN LATERAL ( SELECT json_agg(row_to_json(path)) AS path
FROM (
WITH RECURSIVE path AS ( SELECT s.id, s.title, s.parent , 0 AS level FROM section s WHERE s.id = node_id UNION ALL SELECT s.id, s.title, s.parent, path.level+1 AS level FROM section s JOIN path ON s.id = path.parent
) SELECT id,title FROM path ORDER BY level DESC
) path
) AS path ON true;
node_id | path
---------+----------------------------------------------------------
2 | [{"id":1,"title":"DBMS"}, {"id":2,"title":"PostgreSQL"}]
4 | [{"id":3,"title":"OS"}, {"id":4,"title":"Solaris"}]
(2 rows)
Step 4: final result SELECT title, (
SELECT json_agg(path.path)
FROM ( SELECT unnest(book.sections) node_id
) nodes
JOIN LATERAL (
SELECT json_agg(row_to_json(path)) path
FROM (
WITH RECURSIVE path AS (
SELECT s.id, s.title, s.parent , 0 AS level
FROM section s WHERE s.id = node_id
UNION ALL
SELECT s.id, s.title, s.parent, path.level+1 AS level
FROM section s JOIN path ON s.id = path.parent
)
SELECT id,title FROM path ORDER BY level DESC
) path
) path ON true
) paths FROM book;
The output -[ RECORD 1 ]-----------------------------------------------
title | Postgres for Solaris OS
paths |
[
[{"id":1,"title":"DBMS"}, {"id":3,"title":"PostgreSQL"}],
[{"id":2,"title":"OS"}, {"id":4,"title":"Solaris"}]
]
Disclaimer:
• The more complex is the query,
the more sick is the optimizer
Extracting too
much data
increases your
carbon footprint
Several queries in one Problem: return several result sets as one.
SELECT
( SELECT json_agg(row_to_json(book)) books
FROM book )::jsonb ||
( SELECT json_agg(row_to_json(section)) sections
FROM section)::jsonb;
or
SELECT json_build_object(
'books',
(SELECT json_agg(row_to_json(book)) books FROM
book ),
'sections',
(SELECT json_agg(row_to_json(section)) sections
FROM section));
Make a tree Problem: build a multi-level hierarchical tree in JSON [{ id: 1, title: 'DBMS', nodes: [
{ id: 2, title: 'Relational DBMS',
nodes: [
……
etc.
• Of course we need a recursive CTE, but…
– CTE adds rows, but we need to reduce number of
rows…
– Wow! Only aggregates do this! We need an
aggregate.
Make a tree: plan
1) Preprocess the tree with CTE
2) Make a smart aggregate, which puts each node on its place
in the tree.
Step 1: Compute node paths WITH RECURSIVE positioned_section AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY parent ORDER BY id) -1 AS pos FROM section ), t AS ( SELECT id,title, ARRAY['nodes', pos::text] AS path FROM positioned_section WHERE parent IS NULL
UNION ALL
SELECT s.id, s.title, t.path || ARRAY['nodes',pos::text] AS path FROM positioned_section s JOIN t ON s.parent = t.id
)
SELECT row_to_json(t)::jsonb node FROM t ORDER BY path;
Step 1: Result
node
---------------------------------------------------------
{"id": 1, "path": ["nodes", "0"], "title": "DBMS"}
{"id": 3, "path": ["nodes", "0", "nodes", "0"], "title":
"PostgreSQL"}
{"id": 2, "path": ["nodes", "1"], "title": "ОС"}
{"id": 4, "path": ["nodes", "1", "nodes", "0"], "title":
"Solaris"}
(4 rows)
Step 2: Aggreate CREATE OR REPLACE FUNCTION tree_agg_f (state jsonb, item jsonb)
RETURNS jsonb LANGUAGE sql IMMUTABLE STRICT AS $$
SELECT jsonb_set(
state,
(SELECT array_agg(x)
FROM jsonb_array_elements_text(item->'path') x
),
jsonb_build_object(
'id', item->'id',
'title', item->'title',
'nodes','[]'::jsonb
),
true);
$$ ;
CREATE AGGREGATE tree_agg (item jsonb) (
SFUNC=tree_agg_f, STYPE=jsonb, INITCOND='{"nodes":[]}');
Step 3: 1+2 SELECT jsonb_pretty(tree_agg(x.node)) FROM (
WITH RECURSIVE
positioned_section AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY parent ORDER BY id)
-1 AS pos
FROM section
),
t AS (
SELECT id,title,
ARRAY['nodes', pos::text] AS path
FROM positioned_section WHERE parent IS NULL
UNION ALL
SELECT s.id, s.title,
t.path || ARRAY['nodes',pos::text] AS path
FROM positioned_section s JOIN t ON s.parent = t.id
)
SELECT row_to_json(t)::jsonb node FROM t ORDER BY path)
x;
One more example: Histogram
A single aggregate for calculating complex statistics.
For example:
Compute frequences of random integers from 0 to 20.
Obtaining random numbers:
SELECT( random()*20)::int
FROM generate_series(1,20);
Aggregate for frequencies
CREATE OR REPLACE FUNCTION freq_agg_f (
state jsonb, item int)
RETURNS jsonb LANGUAGE sql IMMUTABLE STRICT AS
$$
SELECT jsonb_set(
state,
ARRAY[item]::text[],
to_jsonb(
COALESCE((state->>(item::text))::int,0)+1
)
);
$$;
CREATE AGGREGATE freq_agg (int) (
SFUNC=freq_agg_f, STYPE=jsonb, INITCOND='{}');
The result
SELECT freq_agg ( (random()*20)::int ) FROM
generate_series(1,20);
freq_agg
----------------------------------------------------
{"2": 1, "3": 2, "5": 1, "6": 2, "10": 2, "11": 2,
"13": 1, "15": 3, "16": 3, "17": 1, "18": 2}
(1 row)
JSON in PL/* SQL and PL/PgSQL are not always effective with JSON.
• The code is not always readable
• The object is copied for any operation
What about PL/Perl, PL/Python, PL/v8 ?
• PL/Perl and PL/Python get JSON as text, unless
TRANSFORM is explicitly used
• PL/v8 gets JSON transformed into internal
representation automatically
• PL/v8 is not included in PostgreSQL distribution and has
a version lag
Aggregate function in PL/v8
CREATE EXTENSION plv8;
CREATE FUNCTION v8_agg_f (state jsonb, item
int) RETURNS jsonb AS $$
state[item] ++;
return state;
$$
LANGUAGE plv8 IMMUTABLE STRICT;
CREATE AGGREGATE v8_agg (int)
(SFUNC=v8_agg_f, STYPE=jsonb,
INITCOND='{}');
-- 4 times faster!
Aggregate function in PL/Perl
CREATE EXTENSION plperl;
CREATE EXTENSION jsonb_plperl; -- defines TRANSFORM
CREATE FUNCTION pl_agg_f (state jsonb, item int)
RETURNS jsonb AS $$
my ($state, $item) = @_;
$state->{$item} ++;
return $state;
$$
LANGUAGE plperl TRANSFORM FOR TYPE jsonb
IMMUTABLE STRICT;
CREATE AGGREGATE pl_agg (int)( SFUNC=pl_agg_f, STYPE=jsonb,
INITCOND='{}');
-- 2.5 times faster than SQL
For a snack: NULL pitfall
• jsonb_set function is STRICT
SELECT jsonb_set (some_large_json, path, NULL)
gives NULL. Usually people do not expect this.
Correct way:
SELECT jsonb_set (some_large_json, path,
'NULL'::jsonb)
Future
• Single data type for JSON
• Storage optimization (TOAST modification, compression, etc)
• Lazy transform
• Faster search inside
• JSONPATH search engine optimization
More info
• Rethinking JSONB (PgCon 2015)
• JSON, JSONB, JSQuery (PgConf.Russia 2017)
• PL/{Perl,Python,V8}
• JSONB by example (PgConf.EU 2019)
• JSONPATH manual
Thanks for your attention
Questions: [email protected]