October 17, 2020

I deleted the production database by accident 😱


Today at around 10:45pm, after a couple of glasses of red wine, I deleted the production database by accident 😨.

Thankfully my database is a managed database from DigitalOcean, which means that DigitalOcean automatically do backups once a day. After 5 minutes of hand-wringing and panic, We went into maintenance mode and were able to restore a backup. At around 11:15pm, 30 minutes after the disaster, we went back online, however 7 hours of scoreboard data was gone forever

Here's the full story.

  1. 4

    It's sad, it happens, more frequent than you'd head aloud

    With all the protections in place things still happen.

    Did you separate a DDL vs DML user? (app shouldn't be allowed to change schema, only a deployment user)
    You'r production doesn't use a server on localhost? are you sure it's that part of the code than? unlikely..

    config.DevelopmentConfig a bit weird, don't you have some config.Env to check?..

    Did you have someone connecting from their dev env to a production DB server? (hopefully this is blocked)
    Or do you run things on the production nodes manually?
    Were you trying to debug something in production?

    1. 1

      These are all great suggestions, do you have more written out (maybe w/ more detail)?

      A pre-deploy checklist or todo/readme to refer to would be great

      1. 2

        mmm... I don't know where all of these come from, surely there is some rules of thumb security list, I kinda know them part from schools bigger part from work exp..
        It mostly comes from the base "Principle of Least Privilege" - if it doesn't absolutely need a privilege/permission, it shouldn't have it

        • DB users, app should have it's own user, never on root/admin, should only have select, insert, update and delete could be argued about, depending on system design. (honestly it's hard to remove delete in reality, but it's awesome if you can)
        • For DB schema changes should have it's own user with create/update table, commonly a deployment user if you have proper deploy process, preferably it doesn't need a drop table permission, due it might be considered. It shouldn't be able to query or update the data inside the table as it's not used for that, will show issues if the streams are crossed.
        • The root/admin user is only used to admin other users. Even that you might create it's own user if you have scale in terms of people...

        Never cross the streams
        Production env. should be as locked down as possible in terms of who can access it and how it can be accessed. (beware due of locking yourself out accidentally, and you might block your own app 1st time if you do it badly ^^)

        • Production DB should only allow access from the app servers. Depending on your architecture on how to implement, but for sure it should never be available freely over the net by ip..
        • Your app servers shouldn't be accessible directly, only from a load balancer and only to pre-defined end points with specific protocols, commonly the LB should deal with SSL by the client.
        • The only way a developer should be able to access anything in production which isn't accessible by a public user should be with SSH.
        • No internal tool should be exposed to the internet, if you run any DB admin tool, it should only be made available using a VPN or SSH Tunneling. So should any reporting or other non end user tool.

        This might be partial but it's the big tickets of access lock down to prevent issues with production and production data.

        1. 1

          This is great! I will try and implement some of these once I've figured out what went wrong

  2. 3

    I've faced such situations too 😅

    Which is why now I created this docker image using RCLONE and crontab that syncs up Mysql database and Minio storage to Google Drive (for free!)

    One can easily customize the bash scripts to fit specific needs

    Now all my servers have 30 min backups that I can quickly restore :D .. + backup service is free!

  3. 2

    I've come close to it, but for a lesson I learned from a Hacker News thread a while back. I think it was when that one employee at Gitlab accidentally deleted a production database which caused 24 hours of chaos. There was one commenter on there who said that they have mandated within their own organisation, that before any database query is run, the operator has to pause and verbally explain (even if to themselves sitting at the console) in plain English what the command is going to do.

    So, if you type in:

    DELETE FROM dbproduction.user WHERE last_login < DATE(2020, 06, 01)

    Then before hitting the 'Execute' button, you pause and say "I am about to delete users from the production database where their last login date was before 1st of June this year".

    It is amazing how that focuses you, and any distractions by co-workers or emails or phone notifications get pushed aside as you listen to your own words. It really works to bring your mind back to what you are about to do.

    Another we've done here is to have a separate db login for general viewing and maintenance on our database that will not allow a DELETE or DROP command to be run. For that, we have to use a separate login where the password has to be looked up in our password management system, and that password is never remembers, and rotated frequently.

    1. 1

      Very interesting! Thanks.

      The thing is, I wasn't writing raw SQL.

  4. 2

    I have created a corn job on my servers which backup the date every hour and delete 12 old backup.

    So if anything like this happens my risk is just of data create last one hour.

    Risk well managed ?

    1. 1

      What if you only discover the issue after more than 12 hours? Do you have a daily backup as well?

      1. 1

        Yes...I am sorry I confused you in my comments...

        It deletes 24 hour old backup . Not 12....

        Technically deleting functionality runs once in a day at a particular time... So programmatically it generally end up delete around 12 old backups at a time.

  5. 1

    I know that pain. I once lost a production database by stopping an EC2 instance. The MySQL DB was on that same box. I didn't realize that stopping the instance wipes all of the data ("restarting" does not).

    I was trying to save costs by not using a managed database (RDS), and had to learn the hard way.

  6. 1

    We've all been there - we've all done that. Growing by mistakes also means that those things hopefully only happen once during a career :)

  7. 1

    How you did that? could you tell us the story?

  8. 1

    Sobered up pretty quickly I imagine.

  9. 1

    That's why the backup plays a very important role... We often think that, Whats the point to pay for automatic backup or infrastructure security. But this couple of bucks saved the day.. Cheers!

  10. 1

    It happens man..!
    One of my colleagues at last startup accidently rhn the command
    Sudo rm -rf /
    He deleted the whole server.

  11. 1

    Are you using environment variables for database connexion uri/username/password? Or did you put them in a config file?

    You should use environment variables, so developers have localhost variables configured, but production variables are only used on the production server.

    If anyone needs access to the production database, just give them read-only access and force them to use a GUI. It should be forbidden to have production access in your local env variables.

    1. 1

      Yes, the problem is I have production ENV variables accessible on my dev machine. I need a change of policy.

  12. 1

    That sucks , but things like that happen and your users should understand!

    I backup my production database hourly , I wonder why digital ocean doesn’t offer hourly backups on managed databases?

Recommended Posts