Reasonability of using temporary SQL tables. Where is the borderline?

July 12, 2012, 5:00 PM

The pratical performance test in PostgreSQL.

Thousands of rows to import but you need to filter them out using power of SQL. What should we use: temporary table or usual permanent table (emptying the latter manually)? The answer is not trivial…

Given

The SQL table for storing data about forest fires.

CREATE TABLE public.fires ( id BIGINT DEFAULT nextval('meteo_fires_id_seq'::regclass) NOT NULL, when_registered TIMESTAMP WITH TIME ZONE NOT NULL, geo_center public.geometry NOT NULL, confidence SMALLINT NOT NULL, — … (and 6 other service fields) CONSTRAINT meteo_fires_pkey PRIMARY KEY(id) ) WITHOUT OIDS;

There are indexes on "geo_center" (gist) and "when_registered" fields.

The data is supplied in large CSV-files. But there is a peculiarity. We always load data from a file which is updated some like a stack, but with an unknown mechanism. We only know that it can contain new data and old (already loaded) data.

The CSV-file is very large. So we need to use a database buffer to operate through it. Thus we parse the CSV-file, insert all lines to a table, then compare data using SQL features and extract the new information and insert it to the permanent storage.

Ways

We can use temporary or permanent tables. We have to insert a lot of rows and execute a complicated query for comparison new data with old. Then we truncate the buffer table.

Which way would be better in performance?

In both cases we should use indexes (for faster comparsion). And of course Postges will use a hard disk to store but it's not a trivial task to unserstand preliminary which set of operations will be the most optimal. So I decided to compare.

Buffer table

CREATE TABLE service.import_fires_buffer ( id SERIAL, when_registered TIMESTAMP WITH TIME ZONE NOT NULL, longitude DOUBLE PRECISION NOT NULL, latitude DOUBLE PRECISION NOT NULL, confidence SMALLINT NOT NULL, — … CONSTRAINT meteo_fires_pkey PRIMARY KEY(id) ) WITHOUT OIDS; CREATE INDEX import_fires_buffer_idx ON service.import_fires_buffer USING btree (when_registered, longitude, latitude);

Comparsion and inserting

The code is universal for both cases. I compiled the procedure in SQL:

INSERT INTO public.meteoevents_fires (when_registered, geo_center, confidence, …) SELECT when_registered, ST_GeomFromText(CONCAT('POINT(',longitude,' ',latitude,')'),4326), confidence, … FROM service.import_fires_buffer AS b WHERE b.id NOT IN ( SELECT fb.id FROM service.import_fires_buffer AS fb JOIN public.meteoevents_fires AS ff ON fb.when_registered = ff.when_registered AND fb.longitude = ST_X(ff.geo_center) AND fb.latitude = ST_Y(ff.geo_center) );

But I can't use a procedure if it has to process temporary tables. So this is quite an important circumstance.

The test

You should mean the execution time is unclean because of mixing with PHP. But these conditions are equal for both cases.

Parsing CSV (excluding downloading) and inserting are mixed beause of I didn't use multiple inserts. The second step was the complicated INSERT query that you can see above. Here's the resulting table:

Buffer typeTemporary tablePermanent table
Was in storage: 32864. Came from CSV: 36351
Inserting in buffer21.17 s47.1 s
Comparsion and inserting to the storage0.88 s1.36 s
Was in storage: 41591. Came from CSV: 43019
Inserting in buffer26.6 s54.48 s
Comparsion and inserting to the storage1.3 s1.54 s
Was in storage: 51588. Came from CSV: 45824
Inserting in buffer26.44 s63.73 s
Comparsion and inserting to the storage1.01 s0.7 s

Wow! As you can see there's a tendention. With filling the storage and increasing the buffer's size the operation of comparison and inserting to the storage become faster in case of permanent table! And this is plus truncating!

Just take and use Autocommit Disabling:

Was in storage: 51588. Came from CSV: 45824
Inserting in buffer25.89 s26.34 s
Comparsion and inserting to the storage0.83 s0.68 s

So disabling autocommit hardly helped to the method with the temporary table but it was great for the other one! And it ought to be used in cases like this.

Obviously working with large arrays of data DataBase have to use roughly the same operations working with the both types of tables.

Conclusion

There is no especial sense to use temporary tables with large arrays of data. But your code in PHP will be much better without necessity to create tables and with a possibility to compile procedures for operating with permanent table-buffers.

But of course you should use permanent tables on client-server architectures when similar operations are requested by many processes. This is because temporary tables are indeed the tool to separate temporary buffers and avoid data conflicts.