The best long-brief history of Database Management Systems
To conform with Medium rules I inform you about my affiliation with the TryDB service. Texts, links, images, and resources mentioning other parties are used without any interest granted from them.
I called this text a long-brief history, because yes, it’s quite long compared to an average Medium article and contains much information. But on the other hand, it’s a brief history, it lacks a lot of details because even a whole 1000 page book is not enough to contain them all. Still, I call it the best, not because I wrote it, but because the information, which I present here, filled the information gap me and my friends had. As a part of the small team, I develop the service that democratizes the way we are interacting with database management systems and we needed a basis for features and design choices. We felt, that we needed to inspect DBMS history overall to get insights, which have to rule our decisions. But it was hard to find a single good source of information since most of them were either too small and general or too big and at the same time too focused on a particular theme. We needed a good survey which covers both history and modern times, but we failed to find a good one. So we had to dig into many sources and process much data to get a general understanding of the things happened in the last 60 years in DBMS field. As a result, we have written this text, which we would be glad to have during our initial research and in hope it might be useful to anyone, who needs a good overview of the DBMS landscape.
You will find this text informal, a bit opinionated and possibly fun and this is exactly how we wanted to write it so that you don’t use it as a sleeping pill. Go!
Do we need a DBMS?
Strange question? Or not? I bet in a practical way you have answered “NO” at least once. I mean those experiments during programming studies, when you tried to save and load some data on the filesystem in a unique self-invented format with a bunch of hacky optimizations, which you thought would be great for your task. Finishing the data manipulation code had brought you great satisfaction! You tried to avoid complex DBMS for simplicity reasons, but what you have actually done? You got the collection of custom data organized in some way, and by definition, it could be called a database. You have created a programming interface to manage it, which in turn can be called a management system. “Wow!” you think, “I have created the DBMS!”. Well, definitely yes, and your custom-tailored system was probably nice for your particular task. But it was unsuitable for general usage, not mentioning its poor efficiency, reliability, and security. It was OK for the small learn-stage task, but not even close for production usage.
So the question in the title of this chapter is meaningless, since in most cases, when you need to store data, you are using DBMS of some kind explicitly or implicitly. The real question which has to worry you with the production system in mind is this:
Which DMBS should we use to fulfill all the system requirements?
It doesn’t have a simple right answer, which becomes even harder when looking at the current state of DBMS landscape: there are hundreds of them already on the market with dozens of new products released every year.
As the title of this text suggests, I will focus on the historical aspect to help you get the right feeling about the modern state of this field. I have split the whole history into several “Eras”, which names refer to human history. We will start from the Ancient Era, and we will finish with the seven trends leading us to the future of database systems.
Ancient Era (1960's)
In the Ancient Era, you have navigated the computer. In Modern times the computer navigates you.
This joke reflects what happened in the 60’s, when Navigational DBMS (NDBMS) appeared. Engineers understood databases as raw data structures like trees or graphs of linked memory areas with the information inside. NDBMS allowed only simple operations, most notorious of which was “navigation” between nodes by memory pointers which played the role of links. These systems were a huge step forward since they allowed to make fast queries to arbitrary data, wheres previous data processing systems only allowed sequential access to a full list of records, so that every search began at the start of the list and have to pass all records sequentially, which was obviously slow.
The IDS system, which is often presented as the first DBMS, was created by Charles Bachman and was released in 1964. It had a “network” data model, which means that data was stored in a graph-like structure of linked nodes (we will refer to network subclass as NnDBMS later). To get some data, you had to first select the starting node (or nodes) by an identifier, get the pointers of linked nodes and jump like this until you get all the data you need. The important thing is that you have to do it with low-level unsafe imperative code with direct memory manipulation, there was no specialized declarative query language available. The query itself was an algorithm, written by a programmer, which described the navigation of nodes traversing.
In 1968 IBM released another NDBMS called IMS, which had a different “hierarchical” storage model (NhDBMS). Data was organized as a tree of nodes, where each node could be linked to exactly one parent and could have many children, very similar to what we see now in the filesystem viewer. IMS was more strict than IDS in this sense, but the basic idea remained the same: to get some data, we had to navigate a traverse through a node-set using low-level imperative approach.
IMS and IDS proposed effective instruments to operate data, but at the same time, they lacked convenience in data manipulation. It was hard to read and modify information in these systems. To understand how ancient those systems are, imagine that famous Quicksort algorithm, which you definitely thought was invented somewhere between the last dinosaur died, and the creation of javelin was actually published just three years before IDS was released! If you want to dive deeper and feel the atmosphere of that time, I strongly recommend an interview with Charles Bachman, creator of IDS, where he summed up his career.
Medieval Era (1970-late 1980's)
Medieval Era was the time of great battles. The same holds for the history of DBMS, in which one of the most significant battles took place. As we told before, the defining feature of an Ancient Era DBMS was a view on data as just a continuation of an algorithm which operates over it. Data was not considered distinctive from code and valuable by itself but rather a part of an imperative step sequence to achieve some goal. In this historical context, one guy from IBM was not agreed with the situation. He thought that data is something different, autonomous, and essential by itself, which should have implications in not only how we process it but even in the way we think of it. This guy was mister Edgar Codd. I tried to find any information about his activities during the protests of 1968. I failed, so I assume he focused all his revolutionary energy on the DBMS field alone. And to change the DBMS world of that time, great energy was needed for sure. IMS and IDS had very strong positions on the market and IMS was even marketed as the solution, which had brought the first man to the Moon, definitely not the easiest stone to move. Also in 1969, the so-called CODASYL group released the language specification for data manipulation based on the ideas of IDS and NnDBMS overall. It looked like a routine action, but the goal was far from routine — to make the standard for all future data processing systems, the one and the only. The intention was not evil in any way, but rather an attempt to give engineers the solid foundation for their creations… and establish the monopoly of the NDBMS for the good of their creators.
In 1970 Codd published a small paper “A Relational Model of Data for Large Shared Data Banks,” which introduced the new “relational” approach. The main idea was to abstract hardware and software details of target system in favor of pure data manipulation. Codd proposed to group information into “tables” where rows contain distinct entities and columns contain their data attributes. A columns properties in a particular table were fixed which was referred to as a table schema, which allowed to better conceptualize the entity type of items inside the table. The relations between entities were easily set using identifiers inside appropriate columns without the need to introduce special mechanisms based on low-level memory pointer manipulation, as previous systems did. The relational model was the data-first approach, which abstracted the hardware, operating system and low-level details of memory organization away, leaving only domain data to work with. It was much easier and much more maintainable in the long run.
The relational model later turned the database world upside down. But not in a moment. IBM was not interested in Codd’s ideas at first, because they already had invested much in IMS and didn’t want to shake those investments in any way. Codd’s ideas were first spread in a research and engineering community, and only after conquering minds, they became the mainstream reality in DMBS of 80’s. In 1973 IBM finally started to develop the prototype of their Relational DBMS (RDBMS) System-R and also a query language, which became what we know now as SQL. But IBM failed to be the first company to release commercial RDBMS, their DB2, a System-R’s cousin, was shown to the world only in 1983. The first commercial RDBMS was Oracle, released in 1979. Oracle took a huge part of the fast-growing market and is still a king today.
There was also a relational hero in the “open source” field of that time — Ingres, which appeared in 1974 as a research project, inspired by Codd’s ideas. It was not open source in the way we are used to now, but if you have paid a small fee, developers promised to open the code for you to dig into (now you can simply get it on the Github). At the beginning of 80’s Ingres was considered an Oracle’s competitor, since functionally they were very close at that time. In 1980 copies of Ingres source code were distributed in US universities, which led to the rapid proliferation of relational ideas and many relational DBMS projects have started as a consequence.
In addition to other problems of NDBMS, there was one, which probably made the greatest contribution to their fall. NDBMS were bound to specialized hardware, called mainframes, which was very expensive and hard to maintain. It was a stack for “serious” data, which at the same time implied very expensive vendor lock-in. Unlike NDBMS, first commercial relational systems had a focus on the minicomputer market, which grew very fast back then with rising portability opportunities. Oracle was the one that had targeted famous VAXes at first for example.
Relational paradigm became extremely popular so that DBMS creators tried to add a “relational” label to their products no matter, how “relational” those systems actually were. In 1985 Codd even published a paper “Is your DBMS really relational?” where he postulated his famous twelve principles of what the relational DBMS is, just in case some marketer tries to trick the public again. Another important date is the standardization of SQL, occurred in 1986, which stopped the war of custom languages for relational systems and made SQL lingua franca for software developers.
Since the middle 80’s the leading role of the relational paradigm was out of question, but in 70’s NDBMS still tried to compete. There were a number of direct speech-battles on conferences between Codd’s and CODASYL’s supporters, with the famous one between Codd and Bachman took place on SIGMOD conference in 1974. One of the main arguments of CODASYL advocates was performance and since NDBMS worked on a low level with raw data, they were really fast, compared to their RDBMS counterparts. During 70’s a number of NDBMS releases occurred: IDMS in 1973 which was based on CODASYL specification, IDS/II in 1975, IMS/II in 1976.
At 80’s the battle was over and relational ideas won for general-purpose database management. I’m saying general purpose because for some specific domains NDBMS are still in use even nowadays! During 80’s many DBMS based on relational ideas were released including most famous today: PostgreSQL (1989), MsSQL Server (1989) and MySQL released a bit later in 1995. At that time relational ideas became what CODASYL ideas wanted to become before: the mainstream paradigm of database management, some kind of religion in DBMS field.
The interesting thing about that time was the presence of alternative approaches in different domains. You assume they could be called “heretics”, following the idea, that relational paradigm became the main religion. But the truth is they were mostly domain-specific systems, some of which hold their niche until now, and they did not compete with mainstream enterprise systems directly. One such approach was MUMPS — a programming language with embedded database capabilities, which was developed initially for the healthcare domain. The ANSI standard for MUMPS was approved in 1977, about 10 years before SQL was first standardized. MUMPS spinoffs like GT.M were broadly used in healthcare and banking domains since 70’s. Other examples of domain-specific DBMS were AceDB and HGDBMS which focused on storage and processing of genetic data. I’m pretty sure there were more domain-specific DBMS released at that time probably for chemical or nuclear physics domains, but I haven’t found any clues. If you know such systems, please leave a comment with a link.
I also have to mention ADABAS released in 1971, it was of Multivalue DBMS kind initially based on an inverted index and in 80’s elements of the relational model were added. ADABAS was very popular back in 70’s, however, it lost most of its popularity after relational databases came to power.
Renaissance Era (late 1980's-2000)
At the end of 80’s RDBMS seemed the only viable faith for those in DBMS field. The relational hype tide had washed off many ideas leaving the landscape almost blank and until the next century, RDBMS had the leading role on the market. Yet one contender has born during this period and tried to conquer a place in the sun.
In 1985 Bjarne Stroustrup published his famous book “The C++ Programming Language” which made a boom in the programming world. Object-oriented programming was not new at that time, but mixing C — one of the top procedural languages, with an object-oriented paradigm looked really promising and it finally led to great success. C++ started an almost 25 year period of OOP domination in the industry. This trend came to DBMS world as well — researchers and engineers started attempts to structure databases using the object-oriented approach. The term Object-Oriented DBMS (OODBMS) first appeared around 1985 and soon became trendy. A number of products were released including GemStone/S(1986), Versant Object Database(1988), InterSystems Caché (1997) and others. OODBMS have not reached the peaks, that RDBMS have conquered and I bet not much of you heard those names, not speaking those who really used them. The reason for this is debatable, but most people came to conclusion, that OODBMS had many limitations compared to RDBMS: lack of standardized query language, limited interoperability with non-object-oriented languages and tools, insufficient flexibility, etc.
Modern Era (2000–2015)
The year 2000 was a notable one in the computer industry. It was greatly marketized by the Y2K problem when each human on Earth got the sense that most critical parts of our lives depend on the computer systems, it doesn’t matter what we are doing. Today it’s not nonsense, just try to imagine that you would turn off your phone for a couple of days, but it was not obvious in masses at that time.
Another thing was a Dot-com bubble burst, which showed immaturity of the early Internet industry overall. As for me it also showed that system scaling was very expensive back then since investments were partly justified by technical needs, so the state of DBMS have made a small contribution to the bubble as well.
The last one I want to mention, which directly affected DBMS field was the presentation by Eric Brewer “Towards Robust Distributed Systems” where he explicated what later will be called a CAP theorem. It states, that any distributed storage system at the same time can provide only two properties out of three: consistency, availability or partition tolerance. It was much criticized and spawned a number of misleading interpretations, but the reason why it is so important for DBMS history is that it showed that you fundamentally can’t have everything in one when dealing with DBMS.
Relational DBMS vision was like a religion, which tried to give you an answer to any question, but as Brewer showed, there are fundamental tradeoffs that the creators of DBMS make explicitly or implicitly. Relational DBMS creators have made a bet on ACID, but at the same time they have sacrificed what we call today horizontal scaling and for the Modern Era with its exponential growth of RPS it was not acceptable. For people in 90’s hundreds of thousands and millions of requests per second was the case for a very small number of companies, which tailored custom (and costly) solutions to deal with this situation. But since 2000, the growth of the Internet industry has risen the RPS scaling problem for exponentially more companies. Every notable startup of the Modern Era pretended to become a leader in its market segment with a potentially huge audience, which needed appropriate technical solutions. Open source preferred.
Starting from the year 2000 a number of new DBMS types have appeared (and some reappeared), reflecting the needs of new Era: key-value and document stores, graph and time series DBMS, storage engines and wide column stores. You will meet the word “NoSQL” for sure, which is a bit confusing in my opinion, but still widely used to set a border between old and new epoch. In this chapter, we will focus on the most notable members of this Era instead of a simple chronology.
Memcached and Redis
There are only two hard things in Computer Science: cache invalidation and naming things
I assume naming things is a fundamental one, but cache invalidation has one prerequisite: a cache. At the beginning of Modern Era caching started to be essential for the Internet-focused companies. Databases were stored on HDDs, which had slow access times compared to RAM and if accessed often for routine reads, databases easily became latency bottlenecks. The solution was a caching layer, which actually was implemented everywhere: applications had caching layer, DBMS themselves had caching layer, even HDDs had caching layer, but the separate systems for caching arbitrary data were also necessary. One of the first solutions was Memcached released in 2003. It was distributed in-memory key-value storage (KVS) with very simple API. It was also open source and quickly became standard as a separate caching layer system. The success of Memcached had brought many followers to live, however, no one got a chance to beat Memcached in popularity.
No one until Redis released in 2009. It was different from Memcached and other KVS, because in addition to simple key-value API it provided the rich set of data structures to store data in, like lists, bitmaps, sets, sorted sets and more, which gave wide range of possibilities to programmers, and in the same time it remained very easy to use. Redis became a game-changer and is one of the most popular caching platforms even now.
MongoDB
MongoDB was released in 2009 and was classified as a Document oriented DBMS (DoDBMS). It was not new with its ideas, CouchDB and a few others have released earlier and got many in common: storing schemaless documents, using js as a query language, MapReduce, etc. The competitive intention behind the new database was to conquer the hearts of both newbies and professionals from the RDBMS domain. MongoDB offered solutions to the weak sides of RDBMS: friendliness, very low entering threshold with easy setup, store JSON documents with no schema, good documentation and of course JavaScript (yum yum!) this was the menu for newbies.
RDBMS professionals suffered from another pain — scaling, and failover, which demanded much effort to work properly on prior databases. MongoDB seemed like a solution: there was a sharding mechanism added directly into the DBMS engine allowing it to achieve scaling of writing operations out of the box. The same hold for failover — for any shard you could have a replica set, which allows scaling of read operations and automated failover in case a master node is down. MongoDB became extremely popular even despite its many drawbacks. The last important news about MongoDB is that it received multi-document transactional support in version 4.0 which is awesome..! But beware, please don’t miss the “IMPORTANT” paragraph, carefully left by the authors in the manual 😉
Cassandra
When I wrote about MongoDB, I mentioned failover and sharding, which I stated, were much easier with MongoDB, than with prior relational databases. Even though it’s generally true, both failover and sharding in MongoDB were still not the easiest things. MongoDB needed much configuration effort and more importantly reconfiguration when the structure of a cluster changed. Also, a failover event implied a downtime since the master election in a replica set was not instant. So in MongoDB, you didn’t feel like everything is done with a click of a finger.
To be precise techniques used in DBMS like MongoDB weren’t very different from that used to scale relational databases, except they had better support of sharding and replication on a DBMS engine level. But of course, when you dream about scaling and failover, you imagine that your system does all the work automagically behind the scenes, with no additional configuration and downtime needed. To a certain extent, this dream comes true with Cassandra, released in 2008. It implied a smart masterless technique, which allowed data to automatically spread in a cluster in case of adding and removing nodes without any reconfiguration or downtime needed. Cassandra supports automatic failover even if multiple nodes become unavailable and with no downtime. Also, the scalability is almost linear when adding nodes so that a 100 node Cassandra cluster can handle ~ 10 times more RPS than 10 nodes one. It sounds like heaven, and it actually is, if you have an appropriate task, but for many scenarios, you can think of, Cassandra will be unsuitable, so you have to be sure when choosing it for your task.
Cassandra’s data model is referred to as a Wide Column Store (WCS), which structures data in rows with an arbitrary number of columns. It also can be thought of as two-dimensional KVS, where the first key gives you a row and the second one gives you a value of a particular column for that row. Wide column stores are often confused with Columnar DBMS, which are very different and mostly used for analytic purposes, don’t make this mistake on an interview!
Solr and Elasticsearch
What do you imagine when you hear the term “search engine”? Something cool like Google, right? In 2019 almost anyone who creates backends needs an honorable and shiny search engine to store… well, garbage, or you can call it logs if you wish. But this story started the other way.
In 90’s the search engine theme became popular with the rise of Internet-based companies like Yahoo or Google. At that time many proprietary search engine projects were developed inside companies for internal needs. One person, who participated in four such projects, just got tired of it. It was Doug Cutting, who developed search engines in Apple, Xerox and Excite and did the same things over and over again. Probably when someone asked him to create one more search engine, he shouted “Stop it!” and created his famous Lucene search library. Lucene was open source and started to be widely used in enterprises, who needed to add search capabilities to their products. But it was low level and Java-based, so you had to embed it into your application directly.
Search engine (SE) is a special kind of DBMS designed to quickly find text entries among a big number of documents. Usually, they use a variation of the inverted index technique as a basis for the search capabilities, the same hold for Lucene and all systems based on it.
In 2004 Solr was created. Initially, it was an inhouse project of CNET, but later in 2007, they open-sourced it. Solr was based on Lucene, but had high-level HTTP API, so that it could be used independently as a separate system, providing search capabilities to services, who need it. Solr quickly became popular in both research and enterprise communities. It was created with a general-purpose search in mind, where reads exceed writes and had replication capabilities to scale reads if needed.
I don’t know who first proposed to use search engines to store logs, but since then the enterprise world never became the same again. The idea was simple: you can add a number of markers to log text which can define a particular situation you want to dig into when analyzing logs, for example, user identifier or an event type, then with the help of search engine you quickly find all the records with the needed marker, sort them in the creation order and voila — you got all the needed information for the analysis. The task to process logs implied the need to scale writes for high load systems and Solr was not good at it. In this context, Elasticsearch was created in 2010 with the initial focus on horizontal scaling both writes and reads. It was log-centric from the beginning with the Logstash subsystem, which provided log gathering from different sources and the ability to load them into Elasticsearch. It gave Elasticsearch a significant advantage over Solr and for now, it is considered the main SE platform for backend development.
But don’t delude yourself — storage engines are good in text search, and not that much for general data processing. So don’t just through all the JSONs you have into SE in the hope you can manipulate and query data inside them efficiently.
Neo4j
When researching Graph DBMS (GDBMS), which are well known modern pieces of software, I had one question bothering me: how do they differ from the NnDBMS from the Ancient Era. I was not alone, and it seems the question is debatable. The reason for this: both modern vivid GDBMS and ancient dying NnDBMS are modeling data in almost the same way using graph structure with links and nodes. Are there any differences then? Yes and many, to name a few: NnDBMS lack high-level declarative query language; GDBMS have richer modeling capabilities including data in links; you can’t just start using any Navigational DBMS right away(of course if you don’t have access to a couple of millions of $ to purchase specialized hardware and license), etc. But despite the differences, the mathematical basis for data modeling remains the same: graphs.
The leading example of GDBMS today is Neo4j, initially released in 2007. In 2011 it received powerful declarative query language Cypher which made it even more attractive. Neo4j model is very simple and powerful: you can have nodes and links between them, each could have attributes, holding the main data and labels, which allow fast querying. Neo4j allows complex queries, which can traverse graphs in sophisticated ways and also have a bunch of predefined graph algorithms included. It also supports transactions and strong data consistency.
It’s just perfect, right? Tradeoffs… remember of tradeoffs. First of all, Neo4j is schemaless as any proud member of NoSQL family must be, sometimes it’s good, sometimes it’s not, especially ten years later after the start of a huge project. But unlike other members of NoSQL family Neo4j have problems with write scaling. The reason for this: how do you cut the graph? You want to store your graphs split in a bunch of servers, but how do you select entities to store on one server and another? And even if you found a reasonable rule for a split, what if the next release brings the new query which traverses graph jumping from one server to another and back many times, which kills performance? These questions are unanswered up till now, but Neo4j is still well suited for a wide range of tasks, even the scandalous ones.
Time Series DBMS
Time is money.
This is especially true for Time Series DBMS (TSDBMS) which are widely used by traders and investment banks. Not only them by the way, TSDBMS are on the rise today since more and more data is produced by different sources like IoT devices, backend and frontend systems, which all need specialized storage for telemetry and TSDBMS suit this task best. They are optimized to store time series data which is essentially a timestamp and a corresponding numerical value for a big number of serial observations. Although time series data has a simple structure, TSDBMS use sophisticated algorithms for compression, indexing, and aggregation of data in order to reduce storage size, optimize queries and provide analytic insights. Also, most TSDBMS include visualization tools to support data analysis.
Until 90’s TSDBMS were inner projects of big enterprises, not available for public usage. In 1999 Tobi Oetiker released a set of tools to work with time series, called RRDtool. It allowed gathering data from different sources and stored it in a circular buffer to force constant memory consumption. RRDtool was mostly used for the telemetry of computer systems.
In 2003 Kdb+ was released and became very popular in the area of high-frequency trading, which needed near real-time processing of huge amounts of time series data. At present time Kdb+ is marketed as a general-purpose TSDBMS and you can even try it in the cloud, not very cheap, though.
There was not much TSDBMS released during the late 2000’s, the only notable example was Graphite, released in 2008, which still used the circular buffer as its default storage data structure and was not scalable.
In 2010’s the boom of TSDBMS started, following the demand, with 3–4 releases per year on average. In 2011 OpenTSDB was released. It was different from previous TSDBMS in that it used HBase as a storage backend, so it scaled well and also didn’t use a circular buffer. However, it was not easy to install and maintain due to the Hadoop nature of the underlying HBase. Another example of scalable TSDBMS was Druid released in 2012. It had a different focus: while OpenTSDB was better suited for write scaling, Druid’s goal was to provide fast analytic queries over huge datasets of time series data.
In 2013 InfluxDB was released, which is the most popular TSDBMS at present time. It provided horizontal scaling out of the box as well as query possibilities with SQL-like language. The important advantage of InfluxDB was the ease of use and good documentation, which made it very popular. In 2015 Prometheus was released which played well with clouds and container environments which gave him much attention. However, it used a pull-based data gathering model, which had problems with scaling and is not flexible, especially if opening an additional port is not favorable due to security or other considerations.
Cloud storage
15 years ago, in 2004 first post from now well known AWS speaker Jeff Bar was published. It looked like an ordinary “Hello world from yet another content producer” and I bet if I have read it that time, I would forget it in a minute. Perfect post to enter the new epoch! Two years later AWS started a cloud revolution from the release of AWS S3, which was the simple object storage. We do not know the inner implementation details since it is a proprietary system, all we know is just the API… which probably processed some zettabytes of data for the last 13 years. The greatest benefit of S3 was simplicity — you just add some data to a bucket by a path and it starts living online with all those cloud benefits like fault tolerance and scaling out of the box. And all this was for free from the technical perspective, since we don’t see any internals, allowing to achieve it, all we see is the result — our object is just available to anyone with no downtime, doesn’t matter how great the load is.
After S3 Amazon released more services focussed on storage with the most notable one — DynamoDB coming online in 2012. It tried to compete with both Cassandra’s automagic and MongoDB’s document approach. At the same time, DynamoDB reduced maintenance efforts to tiny values with just two main tuning parameters: counts of read and write operations per second per table. Waaaay easier than installing and tuning MongoDB or Cassandra clusters from scratch. Also, AWS introduced managed relational DBMS service RDS which allowed to easily start a node or a read replica cluster with well known RDBMS installed and tuned: Postgresql, Oracle, MySQL, etc. At present Amazon tries to cover all the possible storage needs: AWS has Graph DBMS, Time Series DBMS, they even have a managed blockchain!
AWS was followed by Google Cloud Platform (GCP) in 2008 and Microsoft Azure in 2010. It seems the only difference between these platforms today is service names. AWS, GCP, and Azure formed a triadic basis of Modern Era cloud DBMS providers, with others taking smaller parts of the market.
Cloud data storage tradeoffs? There are. First of all using custom services of cloud providers usually implies vendor lock-in. If AWS decides to increase prices three times and you have DynamoDB tables in production with some terabytes of important data, you are probably caught in a trap, both migration and payment will be a pain. While the probability of such an event is low, there are many situations, when you’d better prefer to change the provider, but you can’t, since all your system is based on proprietary API. Of course there are techniques to reduce the impact of vendor lock, like abstracting proprietary APIs in your custom wrappers, which can be reimplemented for another provider, but still, the impact could be huge, especially if the system is big enough or if you use unique features, which are not available in other services. Another problem is outages. Any cloud provider suffers from outages. We can’t control them in any way and sometimes they hurt businesses much. The last tradeoff is a cost that is usually quite expensive, compared to less known cloud providers.
We are near the end of the Modern Era. So what about Relational DBMS? We didn’t mention them at all in this chapter, does the NoSQL hype kill them as they did with contenders earlier? Definitely no. As we told before RDBMS favor properties like strict consistency, transactions, and schema centric approach over more relaxed NoSQL systems and it has a demand on the market for sure. As we can see, relational DBMS are still on the top of the mountain and not near leaving it. Instead of killing RDBMS, the NoSQL movement had brought to life a new approach called, “polyglot persistence”. It states that for different types of data within the application different storage types should be used to achieve great flexibility and fulfill a wide range of requirements. You could have a small fraction of important data, let’s say, user profiles, for which ACID is crucial and you could have a huge amount of schemaless event data with reduced consistency requirements and no need for fast custom queries. In this hypothetic case, you could go polyglot and select both relational and some NoSQL storages to deal with your requirements.
Future Era…sort of (2015 — now)
The future always starts with a promise. Doesn’t matter if it will be fulfilled or not. Blockchain appeared in the late 2000’s as an engine for the first cryptocurrency Bitcoin but later became an independent technology. Blockchain promised the data storage revolution, because it implied automatic trust guarantees for the contents of the data through special peer-to-peer algorithms, almost removing the human factor. In prior approaches, trust was based on our belief that the authority who controls the data will not play against the rules, which was a very fragile assumption. Blockchain tried to fully replace it with smart algorithms that provided trust to all parties involved in the data exchange automatically.
First blockchain systems were mostly used for so-called cryptocurrencies, essentially holding a key-value mapping between digital wallets identifiers and how much money they store. Those systems lack flexibility since they were bound to digital currency operations only. In 2015 Ethereum platform was released for public usage and it allowed any custom data to be stored in the system, not only tokens of some predefined kind, like prior blockchains. The database state was changed by transactions, created with a special kind of programs, called smart contracts. After Ethereum many similar systems were released, but as we see now blockchain usage is still very limited. Current blockchain systems have all the attributes of technological immaturity including unfriendliness to programmers and users, excessive complexity and scaling issues which together do not allow blockchain to become a new leader for general data storage processing for now. Despite this, aftershocks of blockchain hype have broad consequences of how we understand data and more importantly, they showed that new paradigms of data storage are still a subject to appear.
I don’t want to dig deeper into the blockchain theme, which is, by all means, a deep one, I just needed to make a timeline notch for Future Era, please don’t blame me 😊
So here are my seven opinionated trends bringing us to the future of DBMS:
- Convergence. In 2014 PostgreSQL has added full-fledged JSON support including indexing, so that you can store documents in your relational tables. In 2016 AWS presented a new service called Athena, which allowed SQL queries to be used with their S3 object store, which is not relational in any way. In 2018 MongoDB and AWS DynamoDB have added multi-document transactional capabilities previously seen only in relational DBMS world. These different news have one in common: they show how SQL and NoSQL worlds are converging over time. This trend probably will continue and soon we will see more multimodel solutions under single brands.
- Democratization. Fifty years ago an entry threshold to start using DBMS was very high. By now it almost totally vanished. All you need to start today is spend 10 minutes on youtube and just do what you found there. God, you can even try DBMS just using your phone! The next step is for sure the ability to manage databases right from your IoT oven during cooking.
- Increased traffic and data volumes. In 21 century the amount of data grows very fast blah blah blah… I bet you heard it many times. But how fast? Very fast, about 5 times in the next 5 years. “Ok, it’s affordable”, — you say. Yes, but the growth speed is increasing as well. x5/3 years? x5/year? Even x5/hour? Possible… in the future.
- Bringing data closer to regular data consumers. In previous epochs, the art of creation of data models and even the data itself was purely the privilege of programmers or data analysts at best. In order to understand what is going in your own data storage, you would hire the gang of skilled ninjas to craft the usable system on top of raw values in databases. Today there is a demand to disband those gangs to optimize costs. But who will analyze data then? Simple: the one who needs to get the answers. At present many DBMS are equipped with convenient data navigation tools out of the box, as an example Neo4j Browser, which we have already mentioned. Not only DBMS by the way, there are also many tools that appeared in the last years, which address data visualization and analysis challenges. We have seen the rise to glory of Tableau as a software to enable visualization of almost any data no matter what storage it comes from. More tools like KeyLines and Linkurio are focused on the graph visualization domain. There are also open-source tools like Ontodia, which empowers the user to browse practically any graph-based data from different sources. I’m pretty sure the boom of data visualization will continue in the future.
- Increased maintenance automatization. If you want to become a DBA I suggest you think twice. Not because you will be bad in it, I’m pretty sure you are smart enough for any role, but the problem is this role will vanish. DBMS of the past needed much manual effort on all lifecycle stages, so the DBA role was important. Changes occurred and many things are now managed automatically behind the scenes without you even know it. In the future, I expect “many things” to become “everything”.
- New storage paradigms. Emerge of blockchain technology showed that there is definitely a place for innovation in the data storage domain and I’m pretty sure we will see the rise of new approaches in the upcoming decade.
- Prevalence of event-centric approaches. There are basically two ways to write a diary (which are usually mixed in some proportion): you either write a state which you are at on some date or you describe the events of the day that happened to you. The state-centric approach is actually a lossy compression of events to some important values, it saves you memory and time, but some details are lost, which can be important when viewing from a different point of view. DBMS history started mostly from state-centric approach because of technical limitations, not allowing to store and process all the huge amount of event data, but today event-centric trend comes to the fore. Blockchains, source control systems, Time Series DBMS and event sourcing as a general approach to system architecture becomes trendy and probably would get even more attention in the upcoming decade… but as usual, don’t forget about tradeoffs.
So Long, and Thanks for All the Data!
I’m not a professional writer, so I can’t assure myself, that this text is awesome. Probably it’s bad and if you read this line, you are definitely a hero! As a reward for your patience I would give you advice, which saved my technical career many times when I was not lazy and especially when I needed to make decisions regarding DBMS:
Consider different choices, choose wisely.
Also, I want to share data, which I used for plots in this article. I took it from https://db-engines.com/ with a small processing effort.
Thx for reading!
If you enjoyed this article don’t forget the applause ❤
As I mentioned at the beginning, our team is creating the TryDB service. It democratizes the way, we are interacting with DBMS systems through the messenger based UI right in your phone. At current stage we badly need your feedback about the service, what you like and don’t like about it. It will help us to make a better thing to use. If you enjoyed this article, the best gratitude will be if you try our service and write a few words in the “Leave a feedback” menu. Thx!