Database Systems Implementation - CS3223
Course IDCS3223
AY TakenAY22/23 Sem 2
Taught byChan Chee Yong

Database Systems Implementation: CS3223 Review


All contents listed here are taught in-depth with all the technicalities and calculations:

  1. Data and Storage
  2. Data Indexing (B-tree and hash-based)
  3. Query Evaluation (Sorting, Selection, Projection)
  4. Join Query Evaluation (Sort Merge, Nested Loop Join Algs, Grace Hash)
  5. Query Optimization
  6. Transactions
  7. Concurrency Control
  8. Crash Recovery (ARIES)

Preface and tl;dr

Welcome to my overly objective course review for courses that I believe deserve a good feedback.

CS3223, under Prof. Chan Chee Yong, is up there in the list of the most well-designed courses. The course explores the implementation of database systems in a nice goldilocks style: It is not too difficult in terms of calculations or algorithms (no divine inspiration or prior knowledge needed) but still challenges you enough to reward your practices. The topics are well-structured and high-level enough for you to easily give layers of abstractions upon whilst not sacrificing the necessary technical depth.

The course is like CS2106 with (far) less contents, more practicalities (in terms of relatability to real-life), and more fun. Even if you don't like database, I'd strongly argue that the direct and indirect practical knowledge about cost of data/storage and systems design taught in the course are mandatory for everyone to know.

Contents and Usefulness

Under data and storage, we learned to appreciate and calculate cost of moving data around and why we care. The whole module builds upon this premise of minimizing page access even if it means we're using an n^3 algorithms to do simple tasks that can technically be done with lower time complexity.

Indexing is also really useful to be discussed in great depth like we did in this course; you can appreciate the usefulness after doing assignment 2. You learn the nitty gritty details of building a proper index on your relations for different uses, calculate the cost, and find out best plans to execute some queries.

Query evaluation and optimization are rather tedious and it is easy to just put everything under abstraction boxes but be wary that you'll need to be able to do the tedious parts naturally if you wanna score well. I'd say that these are rather less-practical but we do get to truly understand what the previous contents are useful for.

The last few topics on transactions, concurrency control, and ARIES are quite different in spirit compared to the previous ones. The whole point is basically for us to understand the design decisions of transaction controls and their implications towards the correctness of our transaction. This is very important especially for fine-tuning your concurrency control, lock granularities, and queries to achieve speedy ACID.

Stats

With 0 being nothing, 5 being neutral, and 10 being the maximum, the following are the stats for CS3223:

MetricScore
Workload for understanding the contents (incl. lecture)7
Workload for project/assignment 1 (given 2 weeks time)9
Workload for projects/assignments 2 and 3 (assuming you have a week for each)4
Workload for midterm/finals prep (assuming a week of prep)7
Technical depth of course (contents taught)8
Technical depth of exams7
Technical depth of assignment 1 - modifying the buffer manager's replacement policy of an RDBMS (PGSQL)10
Technical depth of assignment 2 - trying out different query plans4
Technical depth of assignment 3 - practical quiz on concurrency control8
Lectures' usefulness wrt. exams9
Tutorials' usefulness wrt. exams10
----

Flaw(s)(-ish?)

I don't think there's any major flaw in this course. If anything, I wish (in descending order of importance) some stuff could be improved:

  1. Assignment 1 timing. Assignment 1 is (as I'll explain later) extremely time-consuming for my team and lots of other teams. Prof had already taught the contents from lecture 1 and yet the assignment was released a few weeks after, giving us only two weeks to complete it with a deadline on recess week which gave a major hit for me and all my midterms on Week 7.
  2. Prof's laser pointer is really hard to see lol please tell him to have stronger laser.
  3. I wish tutorials are 2 hours or that there's an additional hour on recitation of some sorts to go through the topics in a little further depth to get more fun :)

Learning Pattern and Time Allocation

Friday is 2-hours of lecture, Wednesday and Thursday is for tutorials. For people currently taking the course, you might wanna read this.

Lectures

Prof Chan is a really good lecturer (you gotta trust me on this) and after the lecture, my friends and I usually gather around prof for another 10 to 30 minutes of discussion which was really fun. I recommend that you allocate an hour time after the lecture as this is probably your first time learning the content (it's rather rare to pre-study these stuff) and you will need clarifications from the Prof after the lectures.

Tutorials

The 1 hour tutorials are mandatory. My tutor (Prof Chan himself) preassigns each question to multiple students and then we discuss them together in class as Prof asks for explanations from the students assigned - this might differ between different tutorial groups.

Tutorial questions are do-able (not super hard/tedious like 2106/2100 tutorial questions) but you have to review the contents taught first. I highly recommend discussing/reviewing together with a friend; it usually takes me and my friend 1 to 3 hours of discussion to fully grasp what's going on and do the tutorial questions assigned to us. Not discussing with anyone will give you pain.

The tutorial questions are the essential pieces that made this course very well-designed. While you might think that you understand stuff taught in the lectures, doing the tutorial questions is an other-worldly experience that will wake you up and made you realize that you didn't really understand the technical details taught in the lecture even though you had felt smarter after leaving the lecture theatre.

This nice timing of the tutorial made sure that you are prepared for the next lecture's content. Beware that the content-building cycle is around 3 to 4 weeks; if you don't understand the first week then you'll fall behind for the next 3 weeks. Scary (not really).

The way Prof Chan taught this module is such that if you know how to do the tutorial questions, then you can do the exams. Apparently past year papers from different Profs aren't that useful anyway (real experience).

Time Management

Other than attending lectures and tutorials, I highly suggest you allocate 2 to 3 additional hours a week to review the contents (and do tut) to make yourself comfortable with the contents; don't be stupid and think you can cram everything in the end. In total, you need around 6 hours a week to study for this module and perhaps 10 hours for exam prep which I think is more than fair. For some weeks, you'll need to allocate more time for the assignments.

Assessments

Assignment 1 (2 people)

Ah, yes, the notorious reading thousands of lines of C code and try to modify some hundreds of lines of it (to be fair, PGSQL source code is one of the most well-documented and well-designed code I've ever seen in my whole life).

Prof explicitly stated that he wants us to learn tracing the documentation/code and warned us that it might take a few hours just to understand the code, and so got educated we did. Don't try to understand all this alone. A joint effort to grasp what the codebase is about (before able to start any development) should take around 7 hours.

ChatGPT can be a good tool as PGSQL source code was probably one of its training data and thus able to explain complex relation between multiple interlinked functions down to the small details.

Technically there is only one file you need to change. Any change you made in that file needs to be compiled and re-linked with the whole PGSQL codebase which takes an awful 3 to 10 minutes depending on where you work on. You'll learn about simple lock mechanisms to synchronize the buffer manager on your own if you hadn't know what those are from 2106 yet.

Overall, my friend and I put around a combined 50 to 60 hours of work into this assignment (<30 hours each) which I believe is higher than what most people spent (I heard some stronger groups can do it in less than 15 hours combined). Do take this into your consideration as midterm prep will be heavily affected if you do all the work on recess week.

On the bright side, I am now a master at jumping around in vim, know about how to make a really well-designed system and documentation, and just have a better grasp of everything overall. Again, I agree that this assignment educated me or some sorts but I wish we were given more time and that the deadline was before recess week.

Assignments 2 and 3

Both of these assignments are (shockingly) far easier than the previous one, with the 3rd one being a little more time-consuming than the 2nd. Second assignment was a quiz where we try out query plans and do some explanations which honestly was really easy. Expect 2 to 4 hours in total.

Third assignment was a 10 question true or false quiz but it do be quite hard in terms of brain use as we had to actually understand the lecture contents in depth. However, I'd say discussing this one was quite fun since we got to apply what we learn into something that actually matters: What kind of query and schema would allow us to sacrifice stronger concurrency control in our database in exchange for transaction speed? Can we proof that the transactions remain valid without anomaly if we use some certain weaker locking protocol?

You can expect around 10 hours in total for this assignment as you'll have to confirm everything over and over and over again if you aim for full marks. Otherwise you can perhaps sacrifice a point or two and do everything in less than 3 hours.

Midterms and Finals

Easier than what you probably thought it would be (final exam was significantly harder than the midterm though). All you need to do, provided you already understand the contents, is to practice on the tutorial questions and practice explaining stuff to your friend. Around 12 to 18 hours of prep is sufficient to get a fairly decent mastery in doing the exams.

Afterword

Total time spent in this course for me is around 147 hours including attending classes. Amortized over 15 weeks, that is around 10 hours a week and I'd say that is fair.

CS3223 (just like other DB courses in NUS other than CS2102) is one of the most well-taught courses I've ever taken and I am happy to fully grasp almost everything in this course as the knowledge is really useful. If you are a person who have mostly dealt with theoretical computer science and want to get a little technical without sacrificing the science-stuff much, I recommend you to take this module.

Check out my latest posts!


# First Pass Systematic testing of concurrent programs has been researched for quite some time and it is known to have the problem of *state-space explosion* whereby testing all possible interleaving of concurrent programs is exponential in the execution length. Deterministic Partial Order Reduction...
Partial Order Aware Concurrency Sampling (POS) - Yuan, et al.

Partial Order Aware Concurrency Sampling (POS) - Yuan, et al.

PaperBlog
# Problem Even when requirements are clear, verifying software correctness and ensuring our code works in all scenarios without affecting existing logic can be challenging. This is more apparent on ‘hot’ codes: Excerpts of code that got updated often due to its status as a new base logic or due to ...
Improving Software Testability with Math

Improving Software Testability with Math

Blog
# Introduction Database normalization is a fundamental process in database design, ensuring data integrity and minimizing redundancy. Despite its theoretical foundations, practical implications and occurrences of different normal forms in real-world databases are underexplored. Our recent work addre...
Testing DB Normalization Theory vs Practice - FDSampleRush

Testing DB Normalization Theory vs Practice - FDSampleRush

ProjectResearchPaper
# Overview From full stack development, data management, automation, random hops on other teams’ meetings, to realizing that people having double eyelid (or lack thereof) is a thing, I have expanded my knowledge and honed my SWE skills by a lot during my 6-month internship at ShopBack. This article ...
My ShopBack Adventure

My ShopBack Adventure

Blog