My understanding of containers are as follows,
This is awesome for server code. Because they are dumb and all do the same job.
But, I've seen MySQL databases inside containers.
So if there are multiple database containers, how do they sync up the data with each other?
I asked someone I knew, and he said it's a single instance.
Can you someone ELI5?
You have received some good information on how this works already by the other peeps in this thread. But if you are also looking for advice as to whether or not you should containerize your database then I would advice the following:
For local development 👇
Are you the only developer? You probably don't need to containerize your database. Just run it locally on your machine. If you are running everything else in docker then it might make sense to containerize your DB as well though for consistency. Otherwise it's mostly just mental overhead IMO.
Are you one of several developers working on the same application? You can consider containerizing your DB to keep the development environment consistent for all the devs and make it easy to get started developing your app.
For production 👇
Don't containerize your DB unless you have a very good reason to do so. I would definitely pay someone else to host your database for you so you don't have to worry about your data going bye-bye unexpectedly, and to make it easy to perform backups and roll-backs. Heroku, AWS RDS and Google Cloud SQL are some good options for example.
This is database-specific and doesn't have anything to do with containers. Containers are a technology to package, isolate and run your application, but how databases replicate is a whole different story. A lot of relational databases can be configured to run as a master or child node. Child nodes replicate data and are read-only. The master node can be used for both, read and write access. This is usually the case because relational databases have strong consistency guarantees and therefore only allow one writer at any time. Of course, there are other database systems that allow multi-write access, but they usually don't make such strong consistency claims.
That makes more sense. Thank you.
do they all have read-only child nodes?
I don't know about MongoDB, but Postgres and MySQL have read-only replication. There are replicators that sit on top of multiple database nodes that allow read and write operations by multiple clients, but I haven't tried that myself.
Another bit worth pointing out is that containers are ideally stateless. And if they have state, they will/should keep that state in volumes. That's usually where data stores (or any container really) will save the data. That makes -as you already noticed- scaling up stateless backends (container or not) easy.
Things get interesting when the data store needs to be scaled up. You might also want to read up on sharding. It's another strategy to distribute data across multiple nodes, where one node is only responsible for part of the data. This can also be combined with replication.
...and that's just the start of it :)
I am not sure you understand 100% what containers are. They are two things at once.
Packaging. You create an image (= archive of files). This solves the distribution question (you can take your image to another system).
Runtime. Docker engine can run your image as a process in isolation (using kernel features like namespaces).
To interact with the host system, you have to explicitly permit things to happen (expose ports, accessing /certain/location/on/filesystem).
This brings us to databases.
Databases need some disk space to save data. When you run a database in a container, the data in the container will be lost once you stop a container. That's why you'll need to bind mount (-v) a volume (disk space from the host system).
To access a database you need to access its port. For that you either have to map this port to a host system port (-p) or let the container share the network (--network host).
So imagine you run two database containers. If they have different binding they are different databases. If they point to the same space, then two different database processes work with the same disk space not knowing about each other.
Depending on the database, that could be a problem. Some data stores can form a distributed cluster, but to sync their access you'll need their containers to share network and find each other.
In case of MySQL and PostgreSQL, you mostly want to do replicas or sharding. For replicas you want to have read-only followers, for sharding each node will care about one part of the database. In both those cases you won't be sharing the same disk space, but rather just connecting the nodes with the network so they know about each other.
In my book I recommend installing a database on your host from a proven system package, and scale vertically for as long as you can ;).
Similar to some of the reasoning in Ludwig's response concerning production, I would leverage "cloud native" database services.
On AWS, these are things like Aurora (where you can choose MySQL or Postgres) or DynamoDB (a NoSQL option). Managing a database is difficult but these types of services abstract away much of the hard parts. I think doing this empowers you, as a developer, to not be afraid to use databases exactly how you want. It allows you to do small experiments with new databases and much more easily use polyglot persistence. Generally, you won't be able to access to "bleeding edge" features of the newest versions of the databases, so that is something that needs to be considered on an app by app (or even service by service) basis.
The cost analysis for this choice is always difficult. You are comparing the cost of paying for the cloud native hosting (which is a relatively easily estimate) versus the cost of development time and overall architecture sacrifices you will make when creating a new database or table or scaling a database across AZs or regions (which is very difficult to estimate).
I would also look into leveraging all the containerization, orchestration, kubernetes-ation services your cloud provider offers as well. For me, I want as much of my focus (and my engineers focus) to be on my product and the code as possible. I offload as much of the infrastructure worries as I can to my cloud provider.
A common pattern I see is people using containers locally for their database (to ensure everyone is testing against the correct version/configuration), but then in production use something like Amazon's RDS service.
RDS (and other similar services) can handle replication, backups, scaling, and other production concerns for you.
From what I've seen, containerizing your database in production doesn't really give you many benefits at small-to-medium scales.
Thank you all for the wonderful answers, I think I have a better idea 💡 now.
Think of it like an application, except the database is just the application.
Your application CAN have state which is stored on the filesystem, like most databases do. What happens when you destroy the container and recreate it? That state disappears.
Same thing with databases. So most people mount volumes onto the container to persist the database state across containers.