Projects
I'm really interested in databases, specifically in the context of distributed systems. Much of my work is focused on building better databases and related tooling from first principals.
Problems with Databases
Software engineers can choose from a variety of database software, from conventional relational databases to widely distributed key-value stores. Even though there are many systems available, each one has its engineering trade-offs. Unfortunately, if a given technology does not fit all of the use cases for a system, one may resort to connecting various databases together and incur the enormous headaches of keeping them synchronized and consistent.
Problem #1: Not enough data structures
Relational databases, for example, generally provide two data structures: a table (which I like to think of as a list of structs) and an index (usually a B+ tree). Tables are a fantastic choice for many types of transactional data (i.e. orders and invoices) but are terribly suited for others (i.e. hierarchical relationships and social network graphs). Forcing developers into a very constrained set of data structures from the start is a bad idea, yet we've been using RDBMS systems for decades.
Many NoSQL database solutions exist to solve a to subset of the above-mentioned problems, but they present their own challenges. For example, Redis is a popular data structure store that presents a variety of data structures but abandons data validation and structured schema entirely.
Problem #2: Polling
Most databases and nearly all RDBMS systems rely on a request-response API that ultimately leads to polling data. While JavaScript libraries have realized the benefit of automatically re-rendering the DOM when data are updated, back-end servers still resort to polling databases to fetch new data. While polling is simple to implement, it is wasteful and slow; most data is infrequently modified, which means that most poll requests waste network bandwidth, CPU, and memory. On the other hand, with longer polling intervals, users perceive polling as slow and unresponsive. It would be much better if the database could notify clients when data changes without the need for clients to request it.
Problem #3: Query languages
SQL is a relatively standardized language to interact with many databases, but it also has hidden costs. Read queries written in SQL merely declare the desired data output with little to no regard for the data structures and algorithms that will be employed to execute the query. Query languages need to be parsed and analyzed by a query planner to determine the most efficient way to fulfill the user's request. This means that query performance is opaque, which can lead to inadvertently poor data organization and missed opportunities to improve performance. A real API is almost always better than a custom language.
Problem #4: Data migrations
Overhauling the schema or layout of a production database can quickly turn into a nightmare for any database administrator or software developer. Developers often need to meticulously maintain SQL scripts to migrate the data forward (and also backward in case of a failure or regression). Migrations regularly lead to system downtime and/or degraded performance.
Problem #5: Data synchronization is hard
Database systems often do very little to help software developers synchronize data between servers and clients. For example, horizontally scaling the database to multiple servers is notoriously hard, especially for RDBMS systems, and when you need to scale a database to more and more users, this day of reckoning will eventually come.
For these reasons and many more, I really want to build a database that addresses these shortcomings.
Project List
For now, please feel free to check out my GitHub page.