Minimal wal_level optimization can cause slowdowns

We found that using minimal WAL files together with the CREATE TABLE and COPY commands inside a transaction causes a 10x to 300x slowdown, whereas the PostgreSQL code and docs say this will make the commands faster due to optimizations that writes the table directly bypassing the WAL files. The slowdown is amplified by the amount of shared buffers due to the fact that the FlushRelationBuffers function sequentially searches the entire shared buffers. The FlushRelationBuffers function's comments even state that this function is not optimized because it is not used in any "performance-critical code paths". I would say that this is actually a code critical code path when it slows down your system performance by 10 to 300 times. Skipping the WAL files might be overall faster, but it is slower on the clients connection because you are taking code that runs in a background process and moving it into a foreground process thereby slowing down the foreground process. I have showed the PostgreSQL 10.6 source code, because this is the version I am running, this also affects PostgreSQL 12. In this white paper I will talk about how we went about diagnosing this issue and reviewing the code to find out the cause of the slow down.

Normally we run with archive mode turned on and with streaming replication turned on, but we need a new server that is going to be a standalone server. I disabled the WAL Archival and Streaming Replication which also follows the suggested guidelines in the doc's.

14.4.7. Disable WAL Archival and Streaming Replication

When loading large amounts of data into an installation that uses WAL archiving or streaming replication, it might be faster to take a new base backup after the load has completed than to process a large amount of incremental WAL data. To prevent incremental WAL logging while loading, disable archiving and streaming replication, by setting wal_level to minimal, archive_mode to off, and max_wal_senders to zero. But note that changing these settings requires a server restart.

Aside from avoiding the time for the archiver or WAL sender to process the WAL data, doing this will actually make certain commands faster, because they are designed not to write WAL at all if wal_level is minimal. (They can guarantee crash safety more cheaply by doing an fsync at the end than by writing WAL.) This applies to the following commands:

  • CREATE TABLE AS SELECT
  • CREATE INDEX (and variants such as ALTER TABLE ADD PRIMARY KEY)
  • ALTER TABLE SET TABLESPACE
  • CLUSTER
  • COPY FROM, when the target table has been created or truncated earlier in the same transaction

https://www.postgresql.org/docs/10/populate.html#POPULATE-PITR

When running our application against the new server it took approximately 7.5 hours where with old server it would take less than 1 hour. I was quick to think that this might be an OS issue as we just changed from openSUSE Leap 42.2 to Ubuntu 18.04.3 LTS (Bionic Beaver) due to a slight version change in the hardware RAID controller. When reviewing the PostgreSQL log files, I found that the new server seemed to be taking 10 times longer for the COPY command than the other servers.

Machine OS copy time Server Age Hard Drives
Development openSUSE Leap 42.2 45 Minutes 7 Years Spinning DIsk RAID
New Server Ubuntu 18.04.3 LTS (Bionic Beaver) 7.5 hours 2 Weeks SSD RAID

Here is some of the information from our postgres.auto.con file.

wal_level = 'minimal' # default replica
max_wal_senders = 0
#archive_mode = off #default off
shared_buffers = '50GB'
work_mem = '327MB'
maintenance_work_mem = '10GB'
effective_cache_size = '96GB'
#wal_sync_method = fsync # default for linux is fsync
random_page_cost = '1.1' 
effective_io_concurrency = '256' 

If you want to know more information about the random_page_cost see https://momjian.us/main/writings/pgsql/hw_selection.pdf - Page 17. If you want to know more information about the effective_io_concurrency see https://momjian.us/main/writings/pgsql/hw_selection.pdf - Page 17, https://news.ycombinator.com/item?id=12517368, https://portavita.github.io/2019-07-19-PostgreSQL_effective_io_concurren....

Since our application is very large, we needed to create a small test case to find the problem. The programming group, create a 50,000 row csv file that is 480,752 bytes and a Groovy script to create 10 tables and to load 1,000 rows of the csv file the tables while rotating through the tables. The application ends up doing 5 COPY commands to each of the 10 tables. I could see that each copy was taking 70ms to 120ms when looking at the PostgreSQL log files and then comparing it to the 0.8ms to 1.3ms on our 7 year old test machine, I could see that there is a problem with the new machine. I also tested on 1 year old hardware, that is the same as the new serverwith the SSD's and the copy command took from 1.2ms to 1.5ms. This 1 year old hardware is the same as the new stand alone machine, but just has a different OS. We later on boiled this down to a test SQL file that will cause the same issue.

Machine OS copy time Server Age Hard Drives
Development openSUSE Leap 42.2 0.8ms to 1.3ms 7 Years Spinning DIsk RAID
Production openSUSE Leap 42.2 1.2ms to 1.5ms 1 Year SSD RAID
New Server Ubuntu 18.04.3 LTS (Bionic Beaver) 70ms to 120ms 2 Weeks SSD RAID
  • The WAL files are set to minimal, so less data being written there, so that should be faster.
  • The new machine is SSD's vers the Spinning Disk of the 7 eyar old hardware.
  • Faster Processor
  • More RAM
Development Production New Server
COPY 1000
Time: 1.365 ms
COPY 1000
Time: 0.893 ms
COPY 1000
Time: 0.892 ms
COPY 1000
Time: 0.996 ms
COPY 1000
Time: 0.958 ms
COPY 1000
Time: 0.899 ms
COPY 1000
Time: 1.019 ms
COPY 1000
Time: 1.031 ms
COPY 1000
Time: 1.116 ms
COPY 1000
Time: 1.041 ms
COPY 1000
Time: 1.073 ms
COPY 1000
Time: 0.911 ms
COPY 1000
Time: 0.939 ms
COPY 1000
Time: 0.968 ms
COPY 1000
Time: 1.527 ms
COPY 1000
Time: 1.461 ms
COPY 1000
Time: 1.423 ms
COPY 1000
Time: 1.241 ms
COPY 1000
Time: 1.362 ms
COPY 1000
Time: 1.365 ms
COPY 1000
Time: 1.383 ms
COPY 1000
Time: 1.406 ms
COPY 1000
Time: 1.358 ms
COPY 1000
Time: 1.408 ms
COPY 1000
Time: 1.258 ms
COPY 1000
Time: 1.339 ms
COPY 1000
Time: 1.349 ms
COPY 1000
Time: 1.384 ms
COPY 1000
Time: 121.331 ms
COPY 1000
Time: 101.869 ms
COPY 1000
Time: 101.375 ms
COPY 1000
Time: 101.558 ms
COPY 1000
Time: 102.500 ms
COPY 1000
Time: 103.690 ms
COPY 1000
Time: 102.611 ms
COPY 1000
Time: 101.177 ms
COPY 1000
Time: 101.614 ms
COPY 1000
Time: 103.256 ms
COPY 1000
Time: 102.448 ms
COPY 1000
Time: 101.128 ms
COPY 1000
Time: 100.869 ms
COPY 1000
Time: 101.728 ms

This lead me down the path to look at the wal_sync_method because of course for ACID compliance all the data has to be written to the WAL files, but I would learn that this is not true later on.

/usr/local/apps/postgres/current/bin/pg_test_fsync -s 3
3 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
        open_datasync                     44835.635 ops/sec      22 usecs/op
        fdatasync                         42148.497 ops/sec      24 usecs/op
        fsync                              8383.168 ops/sec     119 usecs/op
        fsync_writethrough                              n/a
        open_sync                          8501.906 ops/sec     118 usecs/op

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync is Linux's default)
        open_datasync                     22627.033 ops/sec      44 usecs/op
        fdatasync                         36627.596 ops/sec      27 usecs/op
        fsync                              8155.612 ops/sec     123 usecs/op
        fsync_writethrough                              n/a
        open_sync                          4256.142 ops/sec     235 usecs/op

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB in different write
open_sync sizes.)
         1 * 16kB open_sync write          8320.870 ops/sec     120 usecs/op
         2 *  8kB open_sync writes         4247.050 ops/sec     235 usecs/op
         4 *  4kB open_sync writes         2149.511 ops/sec     465 usecs/op
         8 *  2kB open_sync writes         1063.186 ops/sec     941 usecs/op
        16 *  1kB open_sync writes          533.648 ops/sec    1874 usecs/op

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written on a different
descriptor.)
        write, fsync, close                7801.971 ops/sec     128 usecs/op
        write, close, fsync                7804.802 ops/sec     128 usecs/op

Non-sync'ed 8kB writes:
        write                            553329.822 ops/sec       2 usecs/op

While the above output shows that open_datasync has the most operations per second, that did not play out when actually running our test script. When running our 50,000 row test script using the different wal_sync_method's, see timings below. This leads me to believe that fsync is still better than open_datasync even though open_datasync has more operations per second than fync per the pg_test_fsync application output.

wal_sync_method Total Test Time
fsync 0:09.27
fdatasync 0:15.75
open_datasync 0:15.16
open_sync 0:11.73

At this point, we decided to look at the pg_stat_statements. Since I already have it loaded as a shared_preload_libraries, it is easy to add to a database and use.

CREATE EXTENSION pg_stat_statements;
SELECT pg_stat_statements_reset();

Now run the test script and then query the pg_stat_statements afterwards.

SELECT * FROM pg_stat_statements;

Values that are 0 for both the fast and slow tests have been removed from the following tables. They are: local_blks_hit, local_blks_read, local_blks_dirtied, local_blks_written, temp_blks_read, temp_blks_written, blk_read_time and blk_write_time.

queryid calls total_time min_time max_time mean_time stddev_time rows shared_blks_hit shared_blks_read shared_blks_dirtied shared_blks_written
2862154959 5 390.252225 76.88265 79.205529 78.050445 0.974304313734883 5000 19803 186 190 74
1125121860 5 394.627177 77.423235 79.932728 78.9254354 0.835203713590062 5000 19803 186 190 74
757962155 5 392.687423 76.734166 79.275727 78.5374846 0.930741207527982 5000 19803 186 190 74
4979868 5 386.722666 76.950663 77.755313 77.3445332 0.315288358332753 5000 19803 186 190 74
4108769240 5 385.043845 76.662604 77.72198 77.008769 0.370750516167408 5000 19803 186 190 74
982812922 5 389.727785 77.449501 78.762465 77.945557 0.435613480251932 5000 19803 186 190 74
3468077325 5 390.009387 77.282116 79.029509 78.0018774 0.658153792248161 5000 19803 186 190 74
3466826688 5 385.310237 76.51649 77.757097 77.0620474 0.460984058299894 5000 19803 186 190 74
3943553597 5 412.839826 77.706408 97.593635 82.5679652 7.54129887371314 5000 19803 186 190 74
1256848945 5 388.654889 76.841249 79.531977 77.7309778 1.09063370887762 5000 19803 186 190 74

In testing, I found that if we committed after the create tables, the copies are fast. You will need to reset pg_stat_statements before running the new test.

SELECT pg_stat_statements_reset();

Here is the pg_stat_statements for this test.

queryid calls total_time min_time max_time mean_time stddev_time rows shared_blks_hit shared_blks_read shared_blks_dirtied shared_blks_written
2862154959 5 39.106633 6.855429 9.02918 7.8213266 0.726818419868842 5000 19967 188 187 0
1125121860 5 39.710882 7.167101 8.675202 7.9421764 0.656001857053652 5000 19968 188 187 0
757962155 5 38.905694 6.890894 8.868478 7.7811388 0.677629815512688 5000 19968 188 187 0
4979868 5 38.482345 7.08392 8.226157 7.696469 0.380047955497198 5000 19967 188 187 0
4108769240 5 39.587987 6.878957 8.880604 7.9175974 0.658522980833501 5000 19967 188 187 0
982812922 5 39.087789 7.566033 8.191758 7.8175578 0.237778606090119 5000 19967 188 187 0
3468077325 5 39.635136 7.382916 8.945652 7.9270272 0.542390095733837 5000 19967 188 187 0
3466826688 5 39.615545 6.897731 9.071549 7.923109 0.709376866450831 5000 19967 188 187 0
3943553597 5 39.76028 7.632807 8.212925 7.952056 0.201387453103713 5000 19969 188 187 0
1256848945 5 39.041628 7.036939 8.444774 7.8083256 0.478956203302807 5000 19967 188 187 0

We can see from the two tests that when we ran without using the commit after the CREATE TABLE that PostgreSQL did not write any of the shared buffers to disk (shared_blks_written).

So maybe these writes are causing some type of I/O issues or something else that we can monitor using PostgreSQL's wait events. We can monitor the wait events by looping over the pg_stat_activity view.

bash
while true; do \
/usr/local/postgres-current/bin/psql -d sandbox -t -A -c \
"SELECT wait_event_type, wait_event, query FROM pg_stat_activity WHERE query LIKE 'copy %' AND state = 'active'"; \
done;

Now run the test script.

/usr/local/postgres-current/bin/psql -h db0 -d sandbox -f slow-copy-inside-txn-repro.sql
 pg_sleep
----------

(1 row)

BEGIN
psql:slow-copy-inside-txn-repro.sql:9: NOTICE:  schema "deleteme" does not exist, skipping
DROP SCHEMA
CREATE SCHEMA
CREATE TABLE
Timing is on.
COPY 1000
Time: 121.331 ms
COPY 1000
Time: 101.869 ms
COPY 1000
Time: 101.375 ms
COPY 1000
Time: 101.558 ms
COPY 1000
Time: 102.500 ms
COPY 1000
Time: 103.690 ms
COPY 1000
Time: 102.611 ms
COPY 1000
Time: 101.177 ms
COPY 1000
Time: 101.614 ms
COPY 1000
Time: 103.256 ms
COPY 1000
Time: 102.448 ms
COPY 1000
Time: 101.128 ms
COPY 1000
Time: 100.869 ms
COPY 1000
Time: 101.728 ms
Timing is off.
ROLLBACK

Once the test script is finished, we need to click Ctrl+c to end the loop.

||COPY deleteme.table (foo) FROM stdin;
||COPY deleteme.table (foo) FROM stdin;
||COPY deleteme.table (foo) FROM stdin;
||COPY deleteme.table (foo) FROM stdin;
||COPY deleteme.table (foo) FROM stdin;
||COPY deleteme.table (foo) FROM stdin;
||COPY deleteme.table (foo) FROM stdin;
||COPY deleteme.table (foo) FROM stdin;
||COPY deleteme.table (foo) FROM stdin;
||COPY deleteme.table (foo) FROM stdin;
||COPY deleteme.table (foo) FROM stdin;
||COPY deleteme.table (foo) FROM stdin;
||COPY deleteme.table (foo) FROM stdin;
||COPY deleteme.table (foo) FROM stdin;
||COPY deleteme.table (foo) FROM stdin;
||COPY deleteme.table (foo) FROM stdin;
^C

As you can see from the blank wait events that PostgreSQL is not waiting on anything. This means that PostgreSQL is CPU bound, aka running it's own code.

If we run this same test against out 7 year old test hardware we see an amazing difference.

/usr/local/postgres-current/bin/psql -h db0 -d sandbox -f slow-copy-inside-txn-repro.sql
 pg_sleep
----------

(1 row)

BEGIN
psql:slow-copy-inside-txn-repro.sql:9: NOTICE:  schema "deleteme" does not exist, skipping
DROP SCHEMA
CREATE SCHEMA
CREATE TABLE
Timing is on.
COPY 1000
Time: 1.365 ms
COPY 1000
Time: 0.893 ms
COPY 1000
Time: 0.892 ms
COPY 1000
Time: 0.996 ms
COPY 1000
Time: 0.958 ms
COPY 1000
Time: 0.899 ms
COPY 1000
Time: 1.019 ms
COPY 1000
Time: 1.031 ms
COPY 1000
Time: 1.116 ms
COPY 1000
Time: 1.041 ms
COPY 1000
Time: 1.073 ms
COPY 1000
Time: 0.911 ms
COPY 1000
Time: 0.939 ms
COPY 1000
Time: 0.968 ms
Timing is off.
ROLLBACK

Our brand new SSD server is 100 times slower than our 7 year old spinning disk server.

Now we will want to profile PostgreSQL to find out where it is spending all it's time. To do so, we need to recompile PostgreSQL to make it easier for debugging / profiling.

https://wiki.postgresql.org/wiki/Developer_FAQ#Compile-time

./configure --enable-cassert --enable-debug \
CFLAGS="-ggdb -Og -g3 -fno-omit-frame-pointer"

To profile PostgreSQL to see where it is spending all its time, we need to install perf. For more inforation on perf see http://www.brendangregg.com/perf.html

perf top

Now that I know that all the time is being spent in FlushRelationBuffers, I would like to know what inside of that function is taking up all the time. I can do this by monitoring the specific process that the COPY thread is using. To more easily find the thread add a pg_sleep(30) just after you make the connection. This allows you time to find the connection and connect perf to the process before any important work is performed. We can watch the pg_stat_activity until we see the sleeping process in the database we are performing the test and then ask perf to record what it sees happening.

while true; \
do /usr/local/postgres-current/bin/psql -d postgres -t -A -c \
"SELECT pid FROM pg_stat_activity WHERE query LIKE '%pg_sleep%' AND datname = 'sandbox'"; \
done;
perf record -p PID

After the test script process ends, this will cause perf to also automatically end. We can then view the perf report.


Samples: 22K of event 'cycles:ppp', Event count (approx.): 24367076801      
Overhead  Command   Shared Object           Symbol                          

  90.08%  postgres  postgres                [.] FlushRelationBuffers
   1.19%  postgres  postgres                [.] rnode_comparator
   0.82%  postgres  postgres                [.] nocache_index_getattr
   0.67%  postgres  postgres                [.] DropRelFileNodesAllBuffers

We can now see that the FlushRelationBuffers is taking up all 90% of the time for our test script. If we would like to know what inside of that function is taking up all then time, we can press the down arrow to highlight the FlushRelationBuffers line and then click a to annotate the source code of that function. Once we see the annotation, you can click k to see the source code line numbers, which will help when referencing the full source code.


FlushRelationBuffers /usr/local/postgres-10.6-debug/bin/postgres
Percent│3200         for (i = 0; i < NBuffers; i++)
  0.03 │19e:   add    $0x1,%r13d
       │1a2:   cmp    %r13d,NBuffers
       │     ↓ jle    279
       │3204                 bufHdr = GetBufferDescriptor(i);
  3.07 │       movslq %r13d,%rbx
       │       shl    $0x6,%rbx
       │       add    BufferDescriptors,%rbx
       │3210                 if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
       │       mov    0x8(%r12),%eax
  0.04 │       cmp    %eax,0x8(%rbx)
 96.86 │     ↑ jne    19e
       │       mov    0x4(%r12),%eax
       │        cmp    %eax,0x4(%rbx)
       │     ↑ jne    19e
       │       mov    (%r12),%eax
       │       cmp    %eax,(%rbx)
       │     ↑ jne    19e
       │3213                 ReservePrivateRefCountEntry();
       │     → callq ReservePrivateRefCountEntry

At this point go to https://doxygen.postgresql.org/ and enter FlushRelationBuffers into the search field. Just remember this is searching the master code base and not your specific PostgreSQL version. In my case, I am using version 10.6. The master has this same issue, just different line numbers, so it was a fast way to find the file. The file manager on teh left hand side of the doxygen will show you the directory structure location to file the file. I can then looked the specific version up on github https://github.com/postgres/postgres/tree/REL_10_6 where you can then follow the same directory structure down to the exact same file, but for your specific version of PostgreSQL.

https://github.com/postgres/postgres/blob/REL_10_6/src/backend/storage/b...

/* ---------------------------------------------------------------------
 *		FlushRelationBuffers
 *
 *		This function writes all dirty pages of a relation out to disk
 *		(or more accurately, out to kernel disk buffers), ensuring that the
 *		kernel has an up-to-date view of the relation.
 *
 *		Generally, the caller should be holding AccessExclusiveLock on the
 *		target relation to ensure that no other backend is busy dirtying
 *		more blocks of the relation; the effects can't be expected to last
 *		after the lock is released.
 *
 *		XXX currently it sequentially searches the buffer pool, should be
 *		changed to more clever ways of searching.  This routine is not
 *		used in any performance-critical code paths, so it's not worth
 *		adding additional overhead to normal paths to make it go faster;
 *		but see also DropRelFileNodeBuffers.
 * --------------------------------------------------------------------
 */
void
FlushRelationBuffers(Relation rel)
{
	for (i = 0; i < NBuffers; i++)
	{
		uint32		buf_state;

		bufHdr = GetBufferDescriptor(i);

		/*
		 * As in DropRelFileNodeBuffers, an unlocked precheck should be safe
		 * and saves some cycles.
		 */
		if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
			continue;

The comment in the code appears to be wrong, because to me it is in the critical code path, if it is slowing down PostgreSQL by 10x or more.

XXX currently it sequentially searches the buffer pool, should be changed to more clever ways of searching. This routine is not used in any performance-critical code paths, so it's not worth adding additional overhead to normal paths to make it go faster; but see also DropRelFileNodeBuffers.

As the stated in the comment, the code is looping through the shared buffers and checking each buffer / page to see if it belongs to the new file and needs to be flushed to disk. This means that the above "for loop" is looping through the NBuffers which is shared_buffers / block_size. This is a 6.5+ million NBuffers on our server.

6,553,600 NBuffers = 50G shared_buffers / 8K block_size

Tuning your shared_buffers: The High Performance book talks about using 25% of RAM with a maximum of 8G for shared_buffers and Robert Haas has come out with a blog post that says if you can put 100% of your database in RAM, then you should do so.

PostgreSQL 9.6 High Performance book by Ibrar Ahmed & Gregory Smith pages 129 - 131 on "Starting size guidelines" and "Platform, version, and workload limitations" (talks about 25% of RAM with max of 8G RAM.)
https://www.packtpub.com/big-data-and-business-intelligence/postgresql-h...
https://www.packtpub.com/big-data-and-business-intelligence/postgresql-1...

http://rhaas.blogspot.com/2012/03/tuning-sharedbuffers-and-walbuffers.html

For shared_buffers, the quick answer is to allocate about 25% of system memory to shared_buffers, as recommended by the official documentation and by the wiki article on Tuning Your PostgreSQL server, but not more than about 8GB on Linux or 512MB on Windows, and sometimes less. However, I've recently become aware of a number of cases which suggest that higher values may perform much better. PostgreSQL uses the operating system's memory for caching, but, as recent performance results have shown, there can be a huge performance benefit to having the entire database in shared_buffers as opposed to merely having it in RAM. Several EnterpriseDB customers or potential customers have reported excellent results from cranking up shared_buffers to very large values, even giving PostgreSQL the lion's share of system memory for its own use. This flies in the face of some conventional wisdom that PostgreSQL doesn't handle large shared_buffers settings well, but enough people have reported good results that it seems worth taking this strategy seriously.

The RelFileNodeEquals is really just a macro that does a compare. When we look it up we can find it's source code.

https://github.com/postgres/postgres/blob/REL_10_6/src/include/storage/r...

/*
 * Note: RelFileNodeEquals and RelFileNodeBackendEquals compare relNode first
 * since that is most likely to be different in two unequal RelFileNodes.  It
 * is probably redundant to compare spcNode if the other fields are found equal,
 * but do it anyway to be sure.  Likewise for checking the backend ID in
 * RelFileNodeBackendEquals.
 */
#define RelFileNodeEquals(node1, node2) \
	((node1).relNode == (node2).relNode && \
	 (node1).dbNode == (node2).dbNode && \
	 (node1).spcNode == (node2).spcNode)

We are actually looking at the NOT (!) version. This is what the code would really looks like this inside the assembly version.

#define !RelFileNodeEquals(node1, node2) \
	((node1).relNode != (node2).relNode || \
	 (node1).dbNode != (node2).dbNode || \
	 (node1).spcNode != (node2).spcNode)

If we look up at what called FlushRelationBuffers, we will find that it was the heap_sync function. When we look up it's code, we can see that it is calling the FlushRelationBuffers.

https://github.com/postgres/postgres/blob/REL_10_6/src/backend/access/he...

/*
 *	heap_sync		- sync a heap, for use when no WAL has been written
 *
 * This forces the heap contents (including TOAST heap if any) down to disk.
 * If we skipped using WAL, and WAL is otherwise needed, we must force the
 * relation down to disk before it's safe to commit the transaction.  This
 * requires writing out any dirty buffers and then doing a forced fsync.
 *
 * Indexes are not touched.  (Currently, index operations associated with
 * the commands that use this are WAL-logged and so do not need fsync.
 * That behavior might change someday, but in any case it's likely that
 * any fsync decisions required would be per-index and hence not appropriate
 * to be done here.)
 */
void
heap_sync(Relation rel)
{
	/* non-WAL-logged tables never need fsync */
	if (!RelationNeedsWAL(rel))
		return;

	/* main heap */
	FlushRelationBuffers(rel);
	/* FlushRelationBuffers will have opened rd_smgr */
	smgrimmedsync(rel->rd_smgr, MAIN_FORKNUM);

If we look up at what called heap_sync, we will find that it was the CopyFrom function. When we look up it's code, we can see that it is calling the heap_sync.

https://github.com/postgres/postgres/blob/REL_10_6/src/backend/commands/...

	/*----------
	 * Check to see if we can avoid writing WAL
	 *
	 * If archive logging/streaming is not enabled *and* either
	 *	- table was created in same transaction as this COPY
	 *	- data is being written to relfilenode created in this transaction
	 * then we can skip writing WAL.  It's safe because if the transaction
	 * doesn't commit, we'll discard the table (or the new relfilenode file).
	 * If it does commit, we'll have done the heap_sync at the bottom of this
	 * routine first.
	 *
	 * As mentioned in comments in utils/rel.h, the in-same-transaction test
	 * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
	 * can be cleared before the end of the transaction. The exact case is
	 * when a relation sets a new relfilenode twice in same transaction, yet
	 * the second one fails in an aborted subtransaction, e.g.
	 *
	 * BEGIN;
	 * TRUNCATE t;
	 * SAVEPOINT save;
	 * TRUNCATE t;
	 * ROLLBACK TO save;
	 * COPY ...
	 *
	 * Also, if the target file is new-in-transaction, we assume that checking
	 * FSM for free space is a waste of time, even if we must use WAL because
	 * of archiving.  This could possibly be wrong, but it's unlikely.
	 *
	 * The comments for heap_insert and RelationGetBufferForTuple specify that
	 * skipping WAL logging is only safe if we ensure that our tuples do not
	 * go into pages containing tuples from any other transactions --- but this
	 * must be the case if we have a new table or new relfilenode, so we need
	 * no additional work to enforce that.
	 *----------
	 */
	/* createSubid is creation check, newRelfilenodeSubid is truncation check */
	if (cstate->rel->rd_createSubid != InvalidSubTransactionId ||
		cstate->rel->rd_newRelfilenodeSubid != InvalidSubTransactionId)
	{
		hi_options |= HEAP_INSERT_SKIP_FSM;
		if (!XLogIsNeeded())
			hi_options |= HEAP_INSERT_SKIP_WAL;
	}

How does it tell if XLogIsNeeded? We look that macro up and we find that it is checking to see if the WAL level is greather than or equal to replica.

https://github.com/postgres/postgres/blob/REL_10_6/src/include/access/xl...

/*
 * Is WAL-logging necessary for archival or log-shipping, or can we skip
 * WAL-logging if we fsync() the data before committing instead?
 */
#define XLogIsNeeded() (wal_level >= WAL_LEVEL_REPLICA)

We can the look up the WAL_LEVEL_REPLICA to see what all the possible values are and what numerical order they are in. This way we know what ones are greather than replica and the ones less than replica.

https://github.com/postgres/postgres/blob/REL_10_6/src/backend/access/rm...

/*
 * GUC support
 */
const struct config_enum_entry wal_level_options[] = {
	{"minimal", WAL_LEVEL_MINIMAL, false},
	{"replica", WAL_LEVEL_REPLICA, false},
	{"archive", WAL_LEVEL_REPLICA, true},	/* deprecated */
	{"hot_standby", WAL_LEVEL_REPLICA, true},	/* deprecated */
	{"logical", WAL_LEVEL_LOGICAL, false},
	{NULL, 0, false}
};

We can see at the end of the CopyFrom function that it calls the heap_sync function when it is skipping the WAL files.

https://github.com/postgres/postgres/blob/REL_10_6/src/backend/commands/...

	/*
	 * If we skipped writing WAL, then we need to sync the heap (but not
	 * indexes since those use WAL anyway)
	 */
	if (hi_options & HEAP_INSERT_SKIP_WAL)
		heap_sync(cstate->rel);

	return processed;
}

Now that we know the problem is caused by the MINIMAL wal files, I reviewed the docs and noticed that in the docs, https://www.postgresql.org/docs/10/populate.html#POPULATE-PITR, that it that with minimal wal files and archive turned off that it will make these commands faster, but that is not true as we have just proven.

Aside from avoiding the time for the archiver or WAL sender to process the WAL data, doing this will actually make certain commands faster, because they are designed not to write WAL at all if wal_level is minimal. (They can guarantee crash safety more cheaply by doing an fsync at the end than by writing WAL.) This applies to the following commands:

  • CREATE TABLE AS SELECT
  • CREATE INDEX (and variants such as ALTER TABLE ADD PRIMARY KEY)
  • ALTER TABLE SET TABLESPACE
  • CLUSTER
  • COPY FROM, when the target table has been created or truncated earlier in the same transaction

Now that we know what we need to change, we can start testing with different configurations.

Development
Replica
8GB
Production
Replica
50GB
New Server
Minimal
50GB
New Server
Replica
50GB
New Server
Minimal
8GB
COPY 1000
Time: 1.365 ms
COPY 1000
Time: 0.893 ms
COPY 1000
Time: 0.892 ms
COPY 1000
Time: 0.996 ms
COPY 1000
Time: 0.958 ms
COPY 1000
Time: 0.899 ms
COPY 1000
Time: 1.019 ms
COPY 1000
Time: 1.031 ms
COPY 1000
Time: 1.116 ms
COPY 1000
Time: 1.041 ms
COPY 1000
Time: 1.073 ms
COPY 1000
Time: 0.911 ms
COPY 1000
Time: 0.939 ms
COPY 1000
Time: 0.968 ms
COPY 1000
Time: 1.527 ms
COPY 1000
Time: 1.461 ms
COPY 1000
Time: 1.423 ms
COPY 1000
Time: 1.241 ms
COPY 1000
Time: 1.362 ms
COPY 1000
Time: 1.365 ms
COPY 1000
Time: 1.383 ms
COPY 1000
Time: 1.406 ms
COPY 1000
Time: 1.358 ms
COPY 1000
Time: 1.408 ms
COPY 1000
Time: 1.258 ms
COPY 1000
Time: 1.339 ms
COPY 1000
Time: 1.349 ms
COPY 1000
Time: 1.384 ms
COPY 1000
Time: 121.331 ms
COPY 1000
Time: 101.869 ms
COPY 1000
Time: 101.375 ms
COPY 1000
Time: 101.558 ms
COPY 1000
Time: 102.500 ms
COPY 1000
Time: 103.690 ms
COPY 1000
Time: 102.611 ms
COPY 1000
Time: 101.177 ms
COPY 1000
Time: 101.614 ms
COPY 1000
Time: 103.256 ms
COPY 1000
Time: 102.448 ms
COPY 1000
Time: 101.128 ms
COPY 1000
Time: 100.869 ms
COPY 1000
Time: 101.728 ms
COPY 1000
Time: 0.472 ms
COPY 1000
Time: 0.353 ms
COPY 1000
Time: 0.344 ms
COPY 1000
Time: 0.337 ms
COPY 1000
Time: 0.340 ms
COPY 1000
Time: 0.330 ms
COPY 1000
Time: 0.317 ms
COPY 1000
Time: 0.347 ms
COPY 1000
Time: 0.350 ms
COPY 1000
Time: 0.343 ms
COPY 1000
Time: 0.331 ms
COPY 1000
Time: 0.345 ms
COPY 1000
Time: 0.325 ms
COPY 1000
Time: 0.317 ms
COPY 1000
Time: 20.279 ms
COPY 1000
Time: 16.195 ms
COPY 1000
Time: 15.576 ms
COPY 1000
Time: 15.499 ms
COPY 1000
Time: 16.040 ms
COPY 1000
Time: 15.916 ms
COPY 1000
Time: 15.096 ms
COPY 1000
Time: 15.081 ms
COPY 1000
Time: 15.063 ms
COPY 1000
Time: 15.034 ms
COPY 1000
Time: 15.197 ms
COPY 1000
Time: 14.960 ms
COPY 1000
Time: 14.889 ms
COPY 1000
Time: 16.139 ms

When testing our app, we could see these differences.

wal_level Total Time H:MM:SS Processing Time H:MM:SS COPY rows shared_buffers NBuffers
minimal 7:38:23 6:55:21 1,000 50G 6,553,600
hot_standby / replica 0:43:54 0:15:35 1,000 50G 6,553,600
minimal 1:54:06 1:16:57 1,000 8G 1,048,576
minimal 1:00:42 0:31:13 10,000 8G 1,048,576
hot_standby / replica 0:55 0:15:03 10,000 50G 6,553,600

The difference between the Total Time and the Processing Time is the amount of time to fetch / download the data from external sources to the local application server.

We can still see that even by lowering the amount of RAM to 8GB that it is taking over twice as long as the replica because the FlushRelationBuffers is having to loop through all 1+ million shared buffer blocks/pages. Since we really do want to go with the minimal WAL files, we are revising our code so that it does not use the 1,000 row writes and instead writes the full table in a single COPY command. Once this is finished, I will update this white paper with the new improved timings.

NOTE: If you wish to see the above links in the latest version of the code, just replace "REL_10_6" in the URL with "master" and remove the the #Lxxx at the end of the URL, as the line number will not be the same between version 10.6 and the current master. Once the page is displayed, use your web browsers search function to find the function you want to look up in the code.

I would like to thank the following people for their input and help during the research of this issue:
Tobin Stelling a co-worker at SCHARP / Fred Hutchinson Cancer Research Center
David Costanzo a co-worker at SCHARP / Fred Hutchinson Cancer Research Center
Jeremy Schneider from Amazon AWS