Learning Journey

  1. Roadmap
  2. Working with modern SQL
  3. Understanding the Application Data Stack
  4. RDBMS Internals
  5. Application Patterns
  6. Time Series Databases
  7. Caching
  8. Advanced OLAP Internals
  9. Related CS topics
    1. Distributed Systems
    2. Hashing and Compression
    3. Compilers

If you have decided that it is a good idea to learn about database systems, that’s great! It is an interesting topic and specialists are in high demand!

I hope that if you learn most of it, you will be able to evaluate different database systems. You will be able to appreciate them and improve existing solutions. Maybe even join a database company to work on the products themselves.

Roadmap

I will assume that many people come from the application developer space. They can start a database and know the basics of SQL. You can get pretty far without having deep knowledge about databases. However, eventually, it is very rewarding to get into the details of these systems.

The ideas here will apply to any database. However, I mainly use PostgreSQL as a good example. It is an open-source database.

Learning these topics will give you an insight to NoSQL databases as well. They have similar building blocks. However, they relax parts of the ACID properties and capabilities of an RDBMS. This is done to either optimize latency or scalability.

Working with modern SQL

Start with modern SQL, not only SQL-92. Understanding modern constructs like window functions or lateral joins is very useful. They can represent huge performance benefits. The High-Performance SQL course of Vlad Michaela is a great place to start. (Disclaimer: it is an affiliate link, but I bought the course myself and I genuinely loved it. – I use the money to make this site better.)

“Use the index Luke” is a good extension as a database performance guide for developers. However, I believe it complements the courses of Vlad instead of providing a free alternative.

Understanding SQL also involves being able to read query plans. Depesz has an amazing series of blogpost on this topic.

In practice, this topic could be after the internals discussion. But you can profit from it quicker if you focus on it beforehand.

Understanding the Application Data Stack

If you are a Java person, then I recommend High-Performance JPA persistence from Vlad to understand how Hibernate works.

ORMs have sometimes a bad reputation in the industry, I think half of it is justified. The other half comes from the necessity of understanding an ORM to use it efficiently. People rarely take dedicated time for that sadly.

There is a good description of the SqlAlchemy internals if you are even more interested in ORMs. This is useful even as a Java developer, since SqlAlchemy was heavily influenced by Hibernate. So by reading it, you can learn a lot about how Hibernate works internally as well.

At dynamically scalable systems an external database proxy layer is needed between the database and the application server instances. A good example for this is PG-Bouncer. The goal is to manage the number of connections to the database, even when new application instances are started.

RDBMS Internals

I believe latest at this point it makes sense to learn how relational databases work internally. NoSQL databases are similar as well, they just the internal setup for speed or scalability.

The best course for that is from the CMU Database group on Youtube. It is top quality material from one of the best research universities on the world for free, all assignments included.

The course from the CMU focuses on database systems engineering in a generic sense. The focus is less on “how to use a database” and more on “how to build a database”. Which made it much better than my old university course. It is a truly enjoyable lecture series.

When someone understands the generics, specifics help to add another layer of texture to the conversation. There are two great free e-books discussing the internals of PostgreSQL:

  1. This by Hironobu Suzuki has nicer visualizations for clean understanding.
  2. The PostgreSQL 14 Internals book on the other hand is more complete and has more coding examples.

Hence, I would recommend reading them in this order.

Application Patterns

Application patterns is an important topic in system design. It should, I believe, follow the internals discussion. This is because the trade-offs involved require good knowledge of the internal structure in databases.

Some common topics are: soft-deletes, queue-like workloads or counting records.

You find here some the patterns I already described.

Time Series Databases

Most developers meet time series at one point in their careers. It is good to know how the unique properties of time series data enable a specific optimizations. The CMU has a nice lecture series on this topic.

Caching

Caching for databases in itself a big topic, there are multiple layers and solutions to consider.

Only considering data caching behind the app server, there are:

  1. Application server synchronisers – Hollow from Netflix OSS
  2. Traditional in-memory caching – Caffeine for Java
  3. Simple external caches – Memcached
  4. Feature rich external caches – Redis
  5. ORM Caching – Hibernate 1st and 2nd level cache
  6. Materialized views in the Database
  7. Trigger based aggregations in a separate DB table

In the future I will release a more complete set of resources as a generic guide to caching.

Advanced OLAP Internals

If you’re interested in OLAP databases and how to optimize them, CMU offers a great course on the topic.

Distributed Systems

There is this great site about distributed systems resources.

This is not the main topic of my website but of course it can become important at big workloads.

I assume that most people work at companies where one database node can handle all traffic. If they were smart, they did not make it much more complicated than that.

The database, the application server, and the frontend already build a small heterogeneous distributed system. However, it is mostly a simple one because the state is actually stored on a single system, the central database. Every other node is just caching.

Distributed systems algorithms come in at homogeneous stateful systems, like a multi-node database cluster and what-not. And they become perplexingly complicated in multi-leader systems. The last type are environments that I assume only a few people have to worry about. Nevertheless it beneficially shapes our thinking to be aware of these systems.

Hashing and Compression

Hashing and compression algorithms are a staple in databases. Some examples include executing a hash-join, column-store compression, or Postgres trying to compress before sending data to TOAST.

This is a good overview of the most common compression and hashing algorithms.

Hashing algorithms are sometimes used together with a data structure like a hashmap (linear probe, cuckoo, chained or extensible). In other cases, they serve as a standardization step in other algorithms. This happens for example, at communication between nodes.

Compilers

Compiler-based solutions are used in databases at the query planning and execution:

  1. The process of parsing a SQL statement to actual execution has elements of compilers.
  2. Sometimes the query executor dynamically compiles code snippets for faster execution.

There is a good compilers class from Stanford on Youtube:

Leave a comment

Leave a comment

Things mainly about Databases, Hibernate & JDBC.