Zero Downtime Postgres Database Migration from GCP to AWS RDS

Zero Downtime Postgres Database Migration from GCP to AWS RDS

In 2023, I was working at a startup where we were running our Elixir backend on Gigalixir.

Gigalixir is a PaaS similar to Heroku that at the time we were using provided its managed database service for Postgres via GCP (Google Cloud Platform). We had trouble with it since it was a bit of a black-box with limited scaling and observability. We needed to be able to scale it more granularly, monitor disk, CPU, memory and disk usage and better backups and restores.

A zero downtime migration strategy was ideal since more 9s = happier users + happier engineers. I decided to use Postgres logical replication which allows you to make the source database a publisher and the migration target a subscriber to synchronise data between them. In our case, the GCP DB would be the publisher whereas the AWS one would be the subscriber.

Strategy

This was the general strategy:

  1. Replicate the schema from GCP onto a new RDS DB instance

  2. Enable logical replication on RDS and sync existing data then continue ongoing replication

  3. Cutover the application to the GCP database

I found that AWS RDS prevents creation of a subscriber. They have an rds_replication role present on the database but it seemed artificially limited and couldn’t actually create subscriptions, forcing one to go through their paid service wrapper AWS Database Migration Servie (DMS). So this service is what I used.

Logical replication happens table-wise, capturing DML changes on each of the source database tables and replicating them on the target. As such it requires that all tables have a primary key in order for UPDATEs and DELETEs to be replicated.
💡
It is also limited in that you can't capture DDL changes with this strategy so it's best to pause DDL modifications until the migration is completed.

Here's a set of instructions for how I did it:

Steps

I. Prep the GCP source database

  1. Ensure the wal_level is set to logical on and that pglogical, which is the extension that helps with this replication strategy, is installed.

    Set cloudsql.logical_decoding and cloudsql.enable_pglogical flags to on and reboot the DB which could make it unavailable for a few minutes but should take seconds on average.

     SHOW wal_level;
     -- should return logical
    
     CREATE EXTENSION pglogical;
    
     SELECT * FROM pg_catalog.pg_extension WHERE extname = 'pglogical';
     -- should return row showing pglogical is installed
    
  2. Create replication user which AWS DMS will use to login and grant them the appropriate roles:

     CREATE USER repl_user_from_aws WITH REPLICATION IN ROLE cloudsqlsuperuser LOGIN PASSWORD '<generated password>';
     -- permit AWS' access to pglogical tables
    
     GRANT USAGE ON SCHEMA pglogical TO repl_user_from_aws;
    
     GRANT SELECT ON ALL TABLES IN SCHEMA pglogical TO repl_user_from_aws;
     -- permit AWS' read access to all our tables
    
     GRANT USAGE ON SCHEMA public TO repl_user_from_aws;
    
     GRANT SELECT ON ALL TABLES IN SCHEMA public TO repl_user_from_aws;
    

II. Prep the AWS RDS target database

  1. Set the following DB parameters in the parameter group to switch on logical replication

     rds.logical_replication = 1
     session_replication_role = replica
    
  2. Setup the following PG extensions:

     CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
     CREATE EXTENSION IF NOT EXISTS  pglogical;
    
  3. Fetch the DB schema from the source database and create it on the target RDS DB.

     pg_dump $GCP_DB_ADMIN --schema-only --no-privileges --no-subscriptions --no-publications -Fd -f ./db_schema
    
     pg_restore -Fd -v -n public --single-transaction -s --no-privileges -h <target-db-host> -U <target-db-username> -p 5432 -d <target-db-name> ./db_schema
    
     # pg_restore: connecting to database for restore
     # Password:
     # pg_restore: creating TABLE "public.****"
     # pg_restore: creating TABLE "public.****"
     # pg_restore: creating TABLE "public.****"
     # …
    
  4. Validate table schema equivalency between GCP DB and AWS RDS DB

     --- Show columns, indices and constraints counts
    
     ---- column count
     SELECT table_name, count(column_name) AS column_count FROM information_schema.columns WHERE table_schema = 'public' GROUP BY table_name ORDER BY table_name;
    
     ---- index count
     SELECT tablename AS table_name, indexname, COUNT(indexname) AS index_count FROM pg_indexes WHERE schemaname = 'public' GROUP BY indexname, tablename ORDER BY tablename;
    
     ---- constraint count
     SELECT conrelid::regclass AS table_name, COUNT(conname) AS constraint_count FROM pg_constraint WHERE connamespace = (SELECT oid FROM pg_namespace WHERE nspname = 'public') GROUP BY table_name ORDER BY table_name;
    

III. Initialize migration with AWS DMS

  1. Via the AWS console, create the DMS replication instance with the following settings. Make sure to use the same VPC as the target RDS DB if one exists, otherwise place it in the default VPC.

     name: staging-migrator
     size: dms.t3.medium
     storage: 5 GiB
     vpc: vpc-xxxxxx
     subnet group: default
     high availability: prod (for prod migration), dev for staging migration
     publicly accessible: true
    
    Setting it to be publicly accessible makes your RDS database visible to the internet. Since AWS and GCP are separate cloud providers, this is the easiest way to allow them to connect barring setting up some network peering. Consider that if your security is necessary but for most usecases, enforcing SSL and using a secure password might be enough.
  2. Create the source endpoint using the GCP database's credentials.

     name: prod-gcp-db
     engine: PostgreSQL
     server name: <Gigalixir DB host>
     port: 5432
     username: repl_user_from_aws
     password: <repl_user_from_aws password>
     ssl mode: require
     database name: db name
    

  3. Create the target endpoint using RDS DB credentials. You can use the raw credentials or optionally create and use an AWS SecretsManager secret since your DB is on RDS like I did.

  4. Create a migration task using the source and target endpoints from the above steps setting migration type as Migrate existing data and replicate ongoing changes.

    💡
    Enable Cloudwatch logging so it's easier to debug in case it halts or there are conflicts.

    Specify the following task settings and table mapping rules in this gist.

Once it starts, it will perform a full load before proceeding to ongoing replication where changes will stream from GCP as they happen. Once it is in ongoing replication it's move to the cutover stage.

IV. Cutover

  1. Once the full load of all the tables is completed and it switches to ongoing replication, validate that the target RDS DB tables match the source:

     --- show the row counts for each table.
     --- https://stackoverflow.com/a/2611745/7558991
     WITH tbl AS
     (SELECT table_schema,
     TABLE_NAME
     FROM information_schema.tables
     WHERE TABLE_NAME not like 'pg_%'
     AND table_schema in ('public'))
     SELECT table_schema,
     TABLE_NAME,
     (xpath('/row/c/text()', query_to_xml(format('select count(*) as c from %I.%I', table_schema, TABLE_NAME), FALSE, TRUE, '')))[1]::text::int AS rows_n
     FROM tbl
     ORDER BY TABLE_NAME;
    

    Proceed to cutting over the application only if the row counts match.

  2. Switch over your application configs to use the new DB connection credentials postgres://<user>:<password>@<host>:5432/my_app.

  3. Once your application is switched over to the RDS DB, stop the DMS migration task.

  4. Celebrate (cautiously).

V. Cleanup

  1. Drop the replication user and AWS DMS objects from your GCP database

     GRANT ALL on awsdms_ddl_audit TO public; GRANT ALL on awsdms_ddl_audit_c_key_seq TO public;
     ALTER TABLE awsdms_ddl_audit OWNER TO gigalixir_admin;
     REVOKE ALL ON ALL TABLES IN SCHEMA public, pglogical FROM repl_user_from_aws;
     REVOKE ALL ON SCHEMA public, pglogical FROM repl_user_from_aws;
     DROP FUNCTION awsdms_intercept_ddl CASCADE;
     DROP USER repl_user_from_aws;
     DROP TABLE awsdms_ddl_audit CASCADE;
    
  2. Revert the RDS DB parameters:

     rds.logical_replication = 0
     session_replication_role = origin
    
  3. Remove the DMS replication instance and source & target endpoints.

Evaluation and validation examples

Here’s how some of the validation commands output looked like for me

  1. Confirming schema equivalency between source and target

    source:

    target:

  2. Row counts after migration is complete

    source:

    target:

Conclusion

That's how I did it. Let me know your thoughts and alternative approaches you might have in mind.