backend
Create Upsert Yourself for Amazon Redshift Databases
With a separate staging table, inserting and updating data in PostgreSQL is straightforward.
Introduction
One of the many things I enjoy about working at a startup is the variety of people I get to work with across the company and the interesting problems I get to solve. A few months back, the marketing team was working on improving our analytics reporting and wanted to store data in our Amazon Redshift database that could then be used to build more detailed reports.
While this would normally be a relatively straightforward INSERT
SQL request if all the data was brand new, one thing that made it a little tricky was that I needed to update certain rows in the data table AND insert new rows as well. Normally, I'd use an UPSERT
SQL statement to do both things at once, but Redshift doesn't support UPSERT
statements, so I had to get a little more creative to make this happen manually. With some research and a lot of testing, I learned that by using a temporary data table and a few SQL statements strung together it was possible, and it wasn't as complicated as I thought it would be.
Let's look at how to create our own version of an UPSERT
SQL statement in a PostgreSQL-based Amazon Redshift database, updating data that already exists and adding new data as well.
Amazon Redshift
Redshift is Amazon Web Service's PostgreSQL-based cloud data warehouse built with added functionality to manage very large datasets and support high-performance analysis and reporting of those data.
It's similar to Google's BigQuery or Microsoft's Azure cloud data warehouses, if you've ever used either of those in the past.
And although Redshift is based on PostgreSQL it does have some differences, one of those being: a lack of UPSERT
functionality.
No built in UPSERT functionality
Amazon acknowledges this shortcoming readily enough, and suggests using a staging table to perform a merge. This sounds sensible, but the AWS documentation on how exactly to do this is a little sparse.
So after I figured it out, I wanted to share it, with code examples, for each step. Let's get to it.
Create your own UPSERT
NOTE: This tutorial assumes you're familiar with basic SQL syntax. If you wish to reference any of the commands in the article, I recommend checking the Redshift SQL command docs.
If there's no native UPSERT
, we'll make our own. There's more than one way to recreate UPSERT
but the option I chose to go with essentially deletes all the existing rows in the production table that match rows in the temporary table, then inserts all the data from the temporary table into the production table.
It sounds complex (because it is a bit), but it's not as bad once each step in the process is broken down.
For this article, I'll set up a table (USER_UPDATES) that's a copy of a fictitious USERS table. In USER_UPDATES, I'll modify some existing data in the table (updates), and create some new data in the table (inserts), then we'll get to updating that newly modified data into the USERS table.
Prerequisite data modification setup
-- Create a sample table as a copy of the fake USERS table
CREATE TABLE user_updates AS
SELECT * FROM users;
-- Change every third row so we have updated users
UPDATE user_updates
SET
firstname = upper(firstname),
lastname = upper(lastname),
lastseen = '2022-11-13'
AND mod(userid, 3) = 0;;
-- Add some new rows of users so we have insert examples
-- This example creates a duplicate of every fifth row of user data
INSERT INTO user_updates
SELECT
(userid + 127) AS userid,
firstname,
lastname,
getdate() AS lastseen,
FROM user_updates
AND mod(userid, 5) = 0;
Now that there's some updated user data, it's time to add it back to the USERS table, while ensuring the user data for users already in the table gets overwritten and updated, and the new user data gets added.
1. Make a staging table
The first thing we'll need to do is write a SQL command that creates a temporary table that is an exact copy of the production table the data will eventually be written into.
-- Create temporary table that's a duplicate of the production users table
CREATE TEMP TABLE temp_users (LIKE users);
Creating a temporary table is almost the same as creating a regular table in SQL (just add TEMP
), and instead of having to list out each column in detail, we can simply use LIKE <table_name>
to tell SQL to make this table's columns a copy of the table it will eventually add its data to.
2. Insert the new data into the temp table
With our temp table now created, it's time to insert the modified data in the prerequisite setup step into that temp table. In this INSERT statement, we'll take all the data from the USER_UPDATES table and insert it into the temporary TEMP_USERS table.
-- Take all the data from the user_updates table and insert it into the temporary table
INSERT INTO temp_users
SELECT
userid,
firstname,
lastname,
lastseen,
FROM user_updates;
Here, all the data from USER_UPDATES is put in the TEMP_USERS table for the moment.
3. Begin the transaction
A transaction is a unit of work done in a database - the work can be anything from creating tables to deleting them. The transaction taking place here, will be deleting particular rows of user data from the production table USERS (these are existing users who have updated data in the temp table) before adding their new data into the table.
-- Start the work of deleting existing user data from the table and adding the new data instead
BEGIN TRANSACTION;
The thing about transactions in PostgreSQL is that a commit is implicitly performed at the end of the statement as long as the execution is successful. If the execution fails, a rollback is done instead. This becomes especially convenient because if new user data fails to get added to the USERS table, it will rollback and automatically replace the old user data back in the table.
4. Delete the rows in the permanent table that need to be replaced with new data
Once the transaction's begun, the first thing to do is delete any rows from the permanent table (USERS) where the user ID matches a user ID in the TEMP_USERS table.
-- Delete the users from the database whose data needs to be updated
DELETE FROM users
USING
temp_users
WHERE users.userid = temp_users.userid;
By doing this DELETE
first, we ensure that there will be no duplicate user data when the new data is added to the USERS table.
5. Insert all the data from the temp table to the permanent table
Now it's safe to insert everything from the TEMP_USERS table into the USERS table, without fear of duplicating existing user data.
-- Add all the user data from the temp table to the prod table
INSERT INTO users
SELECT * FROM temp_users;
6. End the transaction
Once all the parts of the transaction have completed successfully, it can be ended just like it began.
-- End the work
END TRANSACTION;
There's one last thing to do now: a little clean up.
7. Delete the temp table
Finally, since the TEMP_USERS table has done its job and there's no further need for it, it's dropped (deleted).
-- Delete the temporary table now that its job is done
DROP TABLE temp_users;
Creating our own UPSERT
's not so bad once each of the steps are broken down and explained.
Conclusion
Amazon Redshift is a popular, powerful data warehouse, but even though it's based on PostgreSQL, it lacks some of the nicer features like UPSERT
, which is exactly the feature I needed to help out one of the teams at work.
Since I needed to update a data table with new information and update existing data in that table, I needed something more than a simple SQL INSERT
. Lucky for me, creating UPSERT
functionality wasn't as difficult as I thought it would be with the help of a temporary table and a couple of targeted DELETE
and INSERT
queries. Not too shabby.
Check back in a few weeks — I’ll be writing more about JavaScript, React, IoT, or something else related to web development.
Thanks for reading. I hope you find it helpful to see how to create your own SQL UPSERT
function. This probably won't be the first or the last time you need to insert and update data into a table, and it's nice to have options for how best to accomplish it.
References & Further Resources
Want to be notified first when I publish new content? Subscribe to my newsletter.