5 STEPS TO CHOOSING A DATABASE FOR YOUR APPLICATION
It’s a shiny new project assignment for you. After days of deliberation on requirements, specs have finally been frozen. Or at least for now. And now is the time, to put in place the high-level building blocks of what will be called the “System Architecture”. The choice of technology, the front-end frameworks and of course the Database!
Choosing the right Database could turn out to be a bit overwhelming, given all the choices prevalent today. DB Engines Ranking lists more than 350 active and up to date databases to select from, all addressing, slightly different solutions of the same problem: storing and retrieving (largely) text based data! There’s really no right or wrong answer, especially because all databases are not created equal and there are pros and cons of each selection.
Consideration #1: Let NFRs (Non-Functional Requirements) show way
The initial inputs for database selection should focus on supporting the right Structure, Size and Speed to meet the needs of your Application.
Structure will matter when you need to store and retrieve data, in all its wide-variety of formats. What kind of data structures are inherent in the application? And how do they translate to the choice of the database? Mismatch here, would cause significant effort overhead later.
Size is about the quantity of data to be stored. How much data would the database be comfortable handling before it starts showing signs of deterioration? This will also impact the database’s ability in sharding & partitioning.
Speed is the time to service read and write requests to your Application. Some databases are optimized for fast-read while others have been designed for fast-write operations. It’s really the Application requirements that should dictate the choice of speed. Or if availability is a primary concern.
A mundane question that still needs to be answered is: are you prepared to pay for your database? But hang-on: aren’t there fabulous, open-source databases that you can use, just for free? Of course, but if you’d rather have the peace of mind that comes with the 99.9999% uptime guarantee and a support number that one can call even at midnight then you’d want to go for a commercial license.
Consideration #3: Cloud or otherwise?
Hosting a database is dead simple: just install, start the service and your database springs into action. But managing a database in production is hard: Partitioning, Patching Updates, creating Read and Write Replicas, Continuous Backups without downtime, Monitoring - it’s a lot of work!
So, an alternative could be to use serverless or “backend as-a-service” databases. These are proprietary from providers like AWS DynamoDB or Azure CosmosDB or Google Firebase. It’s as simple as connecting to an end-point and storing your data.
Or else, use a managed cloud service that offers a popular open source database hosted on a fully managed cloud service. AWS RDS offers MySQL, Postgres instances as a fully managed service. But here, anyways, you will need to make decisions on - “when should I need to back-up the database?” or “how many database instances would I need?”
Consideration #4: Let the Data Model prevail
But one of the most important considerations is to let the Data Model guide the choice. What does your data look like? Is it more structured or voluminous or what kind of operations would need to be performed (full-text search as an example?)
Relational Databases have been around since ages and have influenced every database in use today. They are good to store data that is highly structured and can scale up well. Relational Databases such as PostGreSQL, MySQL are mature and have expressive SQL (Structured Query Language) to access data. They are considered to be rather inflexible although they can avoid duplication but critics of the other type argue that storage has become so cheap that this tradeoff (flexibility vs storage) doesn’t make sense any more.
Document based or Object based NoSQL Databases store info in a single, loosely structured blob of popular data format that do not require fixed schemas, thus providing extreme flexibility for dev. Typically used to store semi-structured data, such a storage mechanism offers great simplicity also providing for massive scalability in read and write. Access varies widely from RESTful APIs such as CouchDB to MapReduce as in MongoDB. Flexibility comes at a price as queries are difficult as they’re stored as blobs and will need to be offset by Dev discipline & vigilance including locally designing the schema for each instance and documenting.
Graph Databases offer a great compromise between structured tables and loose entities where nodes are entities connected by edge relationships. Most GraphDBs like Azure CosmosDB come with a feature-rich set of tools for querying, evaluating and traversing complex networks. Usually used for highly interconnected data, they provide flexibility as well as structure. Usage of Graph Databases is not recommended for simple use cases as that proves to be an overhead. Also, there’s an additional challenge of thinking in graph terms.
Wide-Column Databases like Cassandra and HBase are relational but provide the flexibility of having any number of columns and rows, that is highly optimized for data retrieval even for massive amounts of data. These use keyspace instead of schema so have best of both worlds - key-value stores as well as relational in nature. They horizontally scale easily, are simple to explore, easier to update and are good at aggregating queries. However, they are usually slower than Relational and writes are expensive and usually updates in bulk are easier than individual.
Time-Series Databases do one thing and they do it well. They store 2-dimension linear data usually time or and one more value. Applications that have only part of the data as Time-Series might be served well by a Relational database.
Consideration #5: Other Aspects
Once you've decided on keys aspects of commercial or open-source, on-premise or cloud and your data-model, you still need to take a few things into account like the need to do if there’s a real-time querying or read-to-write ratio or the requirement of complex full text search or if you’d rather have fully programmable data environment such as PostgreSQL. Sometimes, it might require a combination of more than one type of database to get what you want in the form of primary-secondary databases, resulting in Polyglot persistence. The Gartner Magic Quadrant might also come handy in such cases.
Putting it all together: Choosing a Database for your Application
So, here’s how you should go about choosing a Database for your Application:
1. Figure out Non-Functional Requirements such as the amount of data you need to store, the speed & scaling requirements and also the structure of data
2. Decide the Licensing Model that your project is willing to adopt: Proprietary or Open Source
3. Consider Hosting Options for On-Prem or Cloud or whether “as a service” databases would help
4. Model your Data to determine if a relational, document, wide-column, or graph database is the most appropriate for your needs
5. Consider Additional Aspects such as read-to-write ratio, throughput requirements and the use of multiple databases under different usage patterns
It’s a shiny new project assignment for you. After days of deliberation on requirements, specs have finally been frozen. Or at least for now. And now is the time, to put in place the high-level building blocks of what will be called the “System Architecture”. The choice of technology, the front-end frameworks and of course the Database!
Choosing the right Database could turn out to be a bit overwhelming, given all the choices prevalent today. DB Engines Ranking lists more than 350 active and up to date databases to select from, all addressing, slightly different solutions of the same problem: storing and retrieving (largely) text based data! There’s really no right or wrong answer, especially because all databases are not created equal and there are pros and cons of each selection.
Consideration #1: Let NFRs (Non-Functional Requirements) show way
The initial inputs for database selection should focus on supporting the right Structure, Size and Speed to meet the needs of your Application.
Structure will matter when you need to store and retrieve data, in all its wide-variety of formats. What kind of data structures are inherent in the application? And how do they translate to the choice of the database? Mismatch here, would cause significant effort overhead later.
Size is about the quantity of data to be stored. How much data would the database be comfortable handling before it starts showing signs of deterioration? This will also impact the database’s ability in sharding & partitioning.
Speed is the time to service read and write requests to your Application. Some databases are optimized for fast-read while others have been designed for fast-write operations. It’s really the Application requirements that should dictate the choice of speed. Or if availability is a primary concern.
A mundane question that still needs to be answered is: are you prepared to pay for your database? But hang-on: aren’t there fabulous, open-source databases that you can use, just for free? Of course, but if you’d rather have the peace of mind that comes with the 99.9999% uptime guarantee and a support number that one can call even at midnight then you’d want to go for a commercial license.
Consideration #3: Cloud or otherwise?
Hosting a database is dead simple: just install, start the service and your database springs into action. But managing a database in production is hard: Partitioning, Patching Updates, creating Read and Write Replicas, Continuous Backups without downtime, Monitoring - it’s a lot of work!
So, an alternative could be to use serverless or “backend as-a-service” databases. These are proprietary from providers like AWS DynamoDB or Azure CosmosDB or Google Firebase. It’s as simple as connecting to an end-point and storing your data.
Or else, use a managed cloud service that offers a popular open source database hosted on a fully managed cloud service. AWS RDS offers MySQL, Postgres instances as a fully managed service. But here, anyways, you will need to make decisions on - “when should I need to back-up the database?” or “how many database instances would I need?”
Consideration #4: Let the Data Model prevail
But one of the most important considerations is to let the Data Model guide the choice. What does your data look like? Is it more structured or voluminous or what kind of operations would need to be performed (full-text search as an example?)
Relational Databases have been around since ages and have influenced every database in use today. They are good to store data that is highly structured and can scale up well. Relational Databases such as PostGreSQL, MySQL are mature and have expressive SQL (Structured Query Language) to access data. They are considered to be rather inflexible although they can avoid duplication but critics of the other type argue that storage has become so cheap that this tradeoff (flexibility vs storage) doesn’t make sense any more.
Document based or Object based NoSQL Databases store info in a single, loosely structured blob of popular data format that do not require fixed schemas, thus providing extreme flexibility for dev. Typically used to store semi-structured data, such a storage mechanism offers great simplicity also providing for massive scalability in read and write. Access varies widely from RESTful APIs such as CouchDB to MapReduce as in MongoDB. Flexibility comes at a price as queries are difficult as they’re stored as blobs and will need to be offset by Dev discipline & vigilance including locally designing the schema for each instance and documenting.
Graph Databases offer a great compromise between structured tables and loose entities where nodes are entities connected by edge relationships. Most GraphDBs like Azure CosmosDB come with a feature-rich set of tools for querying, evaluating and traversing complex networks. Usually used for highly interconnected data, they provide flexibility as well as structure. Usage of Graph Databases is not recommended for simple use cases as that proves to be an overhead. Also, there’s an additional challenge of thinking in graph terms.
Wide-Column Databases like Cassandra and HBase are relational but provide the flexibility of having any number of columns and rows, that is highly optimized for data retrieval even for massive amounts of data. These use keyspace instead of schema so have best of both worlds - key-value stores as well as relational in nature. They horizontally scale easily, are simple to explore, easier to update and are good at aggregating queries. However, they are usually slower than Relational and writes are expensive and usually updates in bulk are easier than individual.
Time-Series Databases do one thing and they do it well. They store 2-dimension linear data usually time or and one more value. Applications that have only part of the data as Time-Series might be served well by a Relational database.
Consideration #5: Other Aspects
Once you've decided on keys aspects of commercial or open-source, on-premise or cloud and your data-model, you still need to take a few things into account like the need to do if there’s a real-time querying or read-to-write ratio or the requirement of complex full text search or if you’d rather have fully programmable data environment such as PostgreSQL. Sometimes, it might require a combination of more than one type of database to get what you want in the form of primary-secondary databases, resulting in Polyglot persistence. The Gartner Magic Quadrant might also come handy in such cases.
Putting it all together: Choosing a Database for your Application
So, here’s how you should go about choosing a Database for your Application:
1. Figure out Non-Functional Requirements such as the amount of data you need to store, the speed & scaling requirements and also the structure of data
2. Decide the Licensing Model that your project is willing to adopt: Proprietary or Open Source
3. Consider Hosting Options for On-Prem or Cloud or whether “as a service” databases would help
4. Model your Data to determine if a relational, document, wide-column, or graph database is the most appropriate for your needs
5. Consider Additional Aspects such as read-to-write ratio, throughput requirements and the use of multiple databases under different usage patterns