Mixed isolation levels inside transactions

We ran into an Serializable Snapshot Isolation issue recently when doing atomic schema replacement in our data warehouse, where the end user would see zero rows of data even though when they started their transaction and set their isolation level to REPEATABLE READ READ ONLY or SERIALIZABLE READ ONLY and they queried the table and the query had results.

Requirements:

These are some of the requirements for our data warehouse.

  • Readers to have transactionally consistent data with at least REPEATABLE READ isolation.
  • Readers to see the datasets as relations organized by schema/table.
  • A single writer to be able to add/remove datasets, change the logical structure* of the relations, and update data without blocking on readers.
  • Update the data as often as 1 hour.

* The width, number of columns, of any table may change without notice during the build of the datasets.

Requirement Issues:
We can't naively delete the old data and replace it with a new dataset, because the problem then becomes table, index, and WAL bloat. The oldest running transaction on the entire cluster/server is what sets the autovacuum event horizon. This means that long running transaction will cause excessive table/index bloat when you are churning large amounts of data. We found ourselves sometimes burning through a TB of drive space in 24 hours.

We can't truncate the tables, because truncate requires an exclusive lock which will then require all the read transactions to finish before the truncate can be executed.

We can't use Table Inheritance aka Parent/Child tables due to the fact that fields can be added/removed at any time.

We can't use Functions due to the fact that fields can be added/removed at any time which would alter the functions returning TYPE.

We can't just swap out the tables as the ALTER TABLE command will be blocked until the reader finishes their transaction.

What we thought was the solution:

We came up with an innovative way to deal with the problem. When refreshing the data, do so into a new schema. Then inside a single transaction swap out the schemas. Any previous transactions should see the old data and the new transactions would see the new data when using SERIALIZABLE or REPEATABLE READ. We were encouraged by the PostgreSQL documentation to use SERIALIZABLE READY ONLY DEFERRABLE which removed the risk of contributing to or being canceled by a serialization failure, which this mode is well suited for long-running reports or backups.

The DEFERRABLE transaction property has no effect unless the transaction is also SERIALIZABLE and READ ONLY. When all three of these properties are selected for a transaction, the transaction may block when first acquiring its snapshot, after which it is able to run without the normal overhead of a SERIALIZABLE transaction and without any risk of contributing to or being canceled by a serialization failure. This mode is well suited for long-running reports or backups.

https://www.postgresql.org/docs/12/sql-set-transaction.html

The Problem:

The issue we ran into was when we are doing the atomic schema replacement in our data warehouse, where the end user would see zero rows of data even though when they started their transaction and set their isolation level to REPEATABLE READ READ ONLY or SERIALIZABLE READ ONLY and they queried the table and the query had results.

Reproducible Test Case:

Here is a simple test case to show off the issue we are seeing and has been tested with PostgreSQL versions 10.6 & 12.2.

Writer Reader
BEGIN;
CREATE SCHEMA test;
CREATE TABLE test.table ( test TEXT );
INSERT INTO test.table (test) VALUES ('row 1');
COMMIT;
 
  BEGIN;
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE READ ONLY DEFERRABLE;
SELECT * FROM test.table;
test
-------
row 1
(1 row)
BEGIN;
CREATE SCHEMA new_test;
CREATE TABLE new_test.table (test TEXT);
INSERT INTO new_test.table VALUES ('row 2');
ALTER SCHEMA test RENAME TO old_test;
ALTER SCHEMA new_test RENAME TO test;
COMMIT;
 
  SELECT * FROM test.table;
test
------
(0 rows)

DEFERRABLE vers NOT DEFERRABLE does not make a difference.
REPEATABLE READ vers SERIALIZABLE does not make a difference.
READY ONLY vers READ WRITE does not make a difference.

I believe that the issue is that the Reader connection is using the current version of the catalog instead of the historic snapshot of the catalog. I think this is a bug, a type of snapshot isolation anomaly, in PostgreSQL. PostgreSQL should either work properly, using the historic snapshot of the catalog, throw an error, or block the ALTER SCHEMA until the Reader's transaction completes. To test this theory, I used this test case.

Writer Reader
BEGIN;
CREATE SCHEMA test;
CREATE TABLE test.table ( test TEXT );
INSERT INTO test.table (test) VALUES ('row 1');
COMMIT;
 
  BEGIN;
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE READ ONLY DEFERRABLE;
SELECT * FROM test.table;
test
-------
row 1
(1 row)
BEGIN;
CREATE SCHEMA new_test;
CREATE TABLE new_test.table (test TEXT);
INSERT INTO new_test.table VALUES ('row 2');
ALTER SCHEMA test RENAME TO old_test;
ALTER SCHEMA new_test RENAME TO test;
COMMIT;
 
  SELECT * FROM old_test.table;
test
-------
row 1
(1 row)

Since the query is returning data from a schema that did not exist at the start of the Readers transaction, this means that the SELECT statement is using the PostgreSQL catalog generated after the Writers transaction completed instead of when the Readers transaction started. The query did return the correct results.

Viewing the problem inside the the PostgreSQL Catalog:

We can use this test to to see that the Reader is reading the wrong row of data from the pg_catalog.pg_namespace.

Writer Reader
CREATE EXTENSION pageinspect;
BEGIN;
CREATE SCHEMA test;
CREATE TABLE test.table ( test TEXT );
INSERT INTO test.table (test) VALUES ('row 1');
COMMIT;
 
  BEGIN;
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE READ ONLY DEFERRABLE;
SELECT * FROM test.table;
test
-------
row 1
(1 row)
SELECT txid_current();
txid_current
--------------
349902815
(1 row)
SELECT oid, nspname FROM pg_catalog.pg_namespace WHERE nspname = 'test';
    oid    | nspname
-----------+---------
 513537401 | test
(1 row)
-- Use this OID for the next Query
SELECT lp as tuple, t_xmin, t_xmax, t_field3 as t_cid, t_ctid, t_oid
FROM heap_page_items(get_raw_page('pg_catalog.pg_namespace', 0))
WHERE t_oid = '(replace with saved OID)';
 tuple |  t_xmin   |  t_xmax   | t_cid | t_ctid |   t_oid
-------+-----------+-----------+-------+--------+-----------
    13 | 349902806 |         0 |     0 | (0,13) | 513537401
(1 row)
BEGIN;
CREATE SCHEMA new_test;
CREATE TABLE new_test.table (test TEXT);
INSERT INTO new_test.table VALUES ('row 2');
ALTER SCHEMA test RENAME TO old_test;
ALTER SCHEMA new_test RENAME TO test;
COMMIT;
 
  SELECT * FROM test.table;
test
------
(0 rows)
SELECT lp as tuple, t_xmin, t_xmax, t_field3 as t_cid, t_ctid, t_oid
FROM heap_page_items(get_raw_page('pg_catalog.pg_namespace', 0))
WHERE t_oid = '(replace with saved OID)';
 tuple |  t_xmin   |  t_xmax   | t_cid | t_ctid |   t_oid
-------+-----------+-----------+-------+--------+-----------
    13 | 349902806 | 349902827 |     7 | (0,15) | 513537401
    15 | 349902827 |         0 |     7 | (0,15) | 513537401
(2 rows)

We can see that our Reading transaction id is 349902815. This is greater than the t_xmin and less than the t_xmax for the original record, tuple 13, so in theory this is the row that should be returned. We can also see that the transaction id is less than the t_xmin of tuple 15, so theory it should not be possible to read this row during an isolation level of SERIALIZABLE READ ONLY DEFERRABLE. This means that the query used the Read Committed Isolation Level when talking to the system tables instead of the isolation level specified in the transaction causing a mixed/hybrid isolation level to actually be used.

Conclusion - Proposed Solution:

When the isolation level is set to REPEATABLE READ READ ONLY or SERIALIZABLE READ ONLY, then use the current transaction id to check what records to are returned from the system tables. This way the version of the data matches the version of the catalog.

Bug Report:

I submitted PostgreSQL Bug #16454 on 5/20/2020.

References:

Serializable Snapshot Isolation in PostgreSQL
http://vldb.org/pvldb/vol5/p1850_danrkports_vldb2012.pdf
Presented at the 38th International Conference on Very Large Databases provides more detail and perspective than the other links, along with references to the papers which laid the groundwork for this implementation.

The Internals of PostgreSQL - Chapter 5 - Concurrency Control
http://www.interdb.jp/pg/pgsql05.html

PostgreSQL Wiki - Serializable
https://wiki.postgresql.org/wiki/Serializable

PostgreSQL Documentation - 13.2. Transaction Isolation
https://www.postgresql.org/docs/12/transaction-iso.html

PostgreSQL Documentation - SET TRANSACTION
https://www.postgresql.org/docs/12/sql-set-transaction.html

Acknowledgments:

I would like to thank the following people for their input and help during the research of this issue:
Tobin Stelling a co-worker at SCHARP / Fred Hutchinson Cancer Research Center
David Costanzo a co-worker at SCHARP / Fred Hutchinson Cancer Research Center