DEVOXX 2023: Unlock the power of PostgreSQL

Last month I had the opportunity to visit Devoxx 11^th edition and to attend a conference about PostgreSQL advanced features called “Profitez de PostgreSQL pour passer à la vitesse supérieure” by Emmanuel REMY.

Postgres is now in its version 15 and it is the 4^thmost popular database.

It is appreciated, not only by its robustness but also because, it offers, among others, particular features such as support for JSON, geographical and other non-relational data types.

During 45 minutes, he could talk about advanced features Postgres 15 offers and that might be not yet fully exploited by many users.

With a live demo, it was possible to have a glimpse on datatypes and functionalities to discover a little bit more about this ecosystem.

I’ll try to break down some of the topics and examples presented:

Examples:

So, besides the common datatypes known as varchar, numeric, date, we have specific types integrated to the language where we can manipulate and do operations natively: for example IP, macaddr, UUID, Range, etc.

Checking if an address IP is contained by a network

> select '192.168.1.19'::inet << '192.168.1.0/24'::cidr;

Adding 300 addresses to an address IP

> SELECT '192.168.1.6'::inet + 300;

Creating an Enum as a data type, to define possible values for a column

> CREATE TYPE service_type AS ENUM ('IT', 'FINANCE', 'COMM');

Creating a domain, which is a data type with optional user-defined constraints. In this example, a regex for a telephone number.

> CREATE DOMAIN tel_type AS text CHECK (value ~ '\+336\d{8}$');

Not respecting the constraint defined for the data type will result in an error

> SELECT '+3378787878'::tel_type;

ERROR: value for domain tel_type violates check constraint “tel_type_check”

I – Generating series

You can generate a data series in order to populate a database (for tests purposes, for instance). This allows you to generate numerous rows, without the need of writing and uploading files.
The method generate_series takes 3 arguments (start, stop, step)

> SELECT generate_series (5,100,3);

It is also possible to generate series for dates. For instance, generate a series from the last 10 days, with 1 day interval.

> SELECT generate_series (current_date -interval '10 days', current_date, interval '1 day');

generate_series

———————

2023-05-04 00:00:00

2023-05-05 00:00:00

2023-05-06 00:00:00

2023-05-07 00:00:00

2023-05-08 00:00:00

2023-05-09 00:00:00

2023-05-10 00:00:00

2023-05-11 00:00:00

2023-05-12 00:00:00

2023-05-13 00:00:00

2023-05-14 00:00:00

(11 rows)

Or to generate a series with all dates since yesterday with an interval of one hour

> SELECT generate_series (current_date -interval '1 days', current_date, interval '1 hour');

generate_series

———————

2023-05-13 00:00:00

2023-05-13 01:00:00

2023-05-13 02:00:00

2023-05-13 03:00:00

2023-05-13 04:00:00

2023-05-13 05:00:00

2023-05-13 06:00:00

2023-05-13 07:00:00

2023-05-13 08:00:00

2023-05-13 09:00:00

2023-05-13 10:00:00

2023-05-13 11:00:00

2023-05-13 12:00:00

2023-05-13 13:00:00

2023-05-13 14:00:00

2023-05-13 15:00:00

2023-05-13 16:00:00

2023-05-13 17:00:00

2023-05-13 18:00:00

2023-05-13 19:00:00

2023-05-13 20:00:00

2023-05-13 21:00:00

2023-05-13 22:00:00

2023-05-13 23:00:00

2023-05-14 00:00:00

When creating a table, It is also possible to create a column as a concatenation of other columns. If the generated column is defined as STORED, it means its value is going to be computed at the moment the data is being written.

> CREATE TABLE clients (id serial, nom text, prenom text, email text, tel tel_type,
service service_type, code_interne int, date_naissance date,
description text GENERATED ALWAYS AS (prenom || ' ' || nom || ' (' || email || ')') STORED);

The next examples are going to use an extension called postgresql-anonymizer
It allows to mask or replace personal identifiable information or sensitive data
It also provides functions to create random but plausible values
To know more about it and the install procedures you can check : https://postgresql-anonymizer.readthedocs.io/en/latest/INSTALL/
After having installed it, you have to initialize it

> CREATE EXTENSION anon CASCADE;
> SELECT anon.init();

Creating random (but plausible) last name and name

SELECT anon.fake_last_name() as nom, anon.fake_first_name()

Here is the complete example to insert the elements generated on the fly. 10000 rows are being generated to populate the table clients

> insert into clients (nom, prenom, email, tel, service, code_interne, date_naissance)

select anon.fake_last_name() as nom, anon.fake_first_name() as prenom,
'collaborateur_' || id || '@client.fr' as email,
('+336' || Left((random() * 100000000000)::int8 || '000', 8))::tel_type as tel,
CASE WHEN (id % 3 =0) THEN 'IT'
WHEN (id % 3 = 1) THEN 'FINANCE'
WHEN (id % 3 = 2) THEN 'COMM'
END :: service_type AS service,
((id % 29)+1)::int as code_interne,
date '2000-01-01' + random() * interval '-20 years' as date_naissance
from generate_series (1,10000) id;

INSERT 0 10000

Checking the results:

select * from clients;

II – List Partitioning

First, drop table previously created:

drop table clients;

Now, a similar table can be created, but with a partition by service

create table clients (id serial, nom text, prenom text, email text, tel tel_type,
service service_type, code_interne int, date_naissance date,
description text GENERATED ALWAYS AS (prenom || ' ' || nom || ' (' || email || ')') STORED)

PARTITION BY LIST (service);

Create new partitioned tables:

create table clients_it PARTITION OF clients for values in ('IT');
create table clients_finance partition of clients for values in ('FINANCE');
create table clients_comm partition of clients for values in ('COMM');

generate and reinsert data:

insert into clients (nom, prenom, email, tel, service, code_interne, date_naissance)
select anon.fake_last_name() as nom, anon.fake_first_name() as prenom,
'collaborateur_' || id || '@client.fr' as email,
('+336' || Left((random() * 100000000000)::int8 || '000', 8))::tel_type as tel,
CASE WHEN (id % 3 =0) THEN 'IT'
WHEN (id % 3 = 1) THEN 'FINANCE'
WHEN (id % 3 = 2) THEN 'COMM'
END :: service_type AS service,
((id % 29)+1)::int as code_interne,
date '2000-01-01' + random() * interval '-20 years' as date_naissance
from generate_series (1,10000) id;

Now verifying with ‘explain’ what it happens when filtering by service. It does a scan, but not in the whole table, instead, only in the table related to the partition ‘clients_it’

explain select * from clients where service = 'IT';

Seq Scan on clients_it clients (cost=0.00..104.66 rows=3333 width=115)

Filter: (service = ‘IT’::service_type)

(2 rows)

III- Anonymizing and security labels

Besides being useful for tests, postgresql-anonymizer is specially interesting to preserve the original data, providing an anonymous or pseudo anonymous data when reading data.
It works with the notion of SECURITY LABELS. You can specify if you want to mask a field, show some other value, does not show anything, pseudo-anonymize, to a particular role for example

Example:

To always show the string ‘NO_NAME’ for the column nom

SECURITY LABEL FOR anon ON COLUMN public.clients.nom IS 'MASKED WITH VALUE $$NO_NAME$$';

To show null for the column id

SECURITY LABEL FOR anon ON COLUMN public.clients.id IS 'MASKED WITH VALUE NULL';

To show another value for an email

SECURITY LABEL FOR anon ON COLUMN public.clients.email IS 'MASKED WITH VALUE $$test_email@mycompany.fr$$';

Declare a masked role

CREATE ROLE analyst WITH LOGIN;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO analyst;
SECURITY LABEL FOR anon ON ROLE analyst IS 'MASKED';

If a user with the role ‘analyst’ connects to database, and reads this table, he/she will not be able to see those actual values but these ones instead:

select * from clients;

IV- Securing data access

Through RLS (Row level security), the user will have access to table, but only the rows its role allows.

Example: You can do this by creating a group role soc_securite and 2 member roles: soc_finance, soc_it

CREATE ROLE soc_securite;
GRANT ALL ON clients TO soc_securite;
CREATE ROLE soc_it; GRANT soc_securite to soc_it;
CREATE ROLE soc_finance; GRANT soc_securite to soc_finance;

The key to restrain data access, is to create a policy to a role.
In this example, each member role has access only to the rows related to certain service. (defined with the keyword USING)

CREATE POLICY sp_acces_clients_par_service ON clients
AS RESTRICTIVE
FOR SELECT TO soc_securite
USING ( service = upper (substring(current_user from '.*_(.*)$'))::service_type );

this way, a user with soc_finance role will only have access to the rows that have service ‘FINANCE’

the same way, a user with soc_it role will only have access to the rows that have service ‘IT’

Many other interesting topics were approached during this conference like: cron, using other languages to write functions (like Python, JS, Java), Foreign Data Wrapper and Json manipulation.

To know more about it, you can check :
https://www.youtube.com/watch?v=Zfdx2URaEOk&list=PLTbQvx84FrAQ4oAXbCkQugzA0o7ujccM-

Reference : Devoxx 2023 conference “Profitez de PostgreSQL pour passer à la vitesse supérieure” by Emmanuel REMY.