Big Data Systems for Data Science: CS4225 Review

Hi!

Big Data Systems for Data Science - CS4225
Course ID	CS4225
AY Taken	AY22/23 Sem 2
Taught by	He Bingsheng, Ai Xin

Big Data Systems for Data Science: CS4225 Review

Contents taught are as follows:

Principles of Big Data Systems
MapReduce/Hadoop
Performance analysis of Big Data Systems
NoSQL
Apache Spark
(Very Large) Graph Processing and Framework
Stream Processing
Data Lake

Preface and tl;dr

Welcome to my overly objective course reviews for courses that I believe deserve a good feedback.

CS4225 is without doubt one of the most enjoyable courses I've ever taken throughout my academic career. This course serves as an introduction to the history, technical analysis, ongoing research, and how-to-s of big data systems. As with CS3223, I'd recommend people to take this course regardless of whether they want to specialize in database or not: the exact contents taught may not be directly useful for you but you left the course with a bag of tricks, appreciation, and useful basic-to-intermediate-ish knowledge on data flow patterns and management.

Stats

With 0 being nothing, 5 being neutral, and 10 being the maximum, the following are the stats for CS4225:

Metric	Score
Workload for understanding the contents (incl. lecture)	3
Workload for the projects (assuming you have a week for each project)	2
Workload for midterm/finals prep (assuming a week of prep)	3
Technical depth of course (contents taught)	5
Technical depth of exams	6
Technical depth of assignment 1	7
Technical depth of assignment 2	4
Technical depth of optional contents	6
Lectures' usefulness wrt. exams	7

--	--

Who do I recommend this course for:

People with little to no knowledge of big data systems and non-relational data stores
People who's had experience(s) with projects/works involving RDBMS and anything related to managing or collecting data for analysis. Not having these experiences might deprive you from appreciating what's being taught throughout the course.
People willing to truly learn something from the course: I'd imagine that just memorizing things for the exams will just be boring. What made this course interesting is the Profs' extensive experience in the fields relating to the subject: listening to the little 'sidetracking'-s throughout the lecture and reading through the optional papers are fun and enlightening.

If you're just planning to skip all lectures and tutorials, not going through any optional materials, and rote-memorize things for a few hours before the exams, then I don't think this course will be of use for you despite the probability of scoring a rather decent grade.

Flaw(s)

The tutorials lack any care whatsoever. The tutor reads the tutorial answers off the slides verbatim with a tone so monotonous that even an AI-voiceover cannot compete with. Tutorials are over in 5 to 15 minutes as the questions aren't that profound anyways. My lecture schedule was in the afternoon but my tutorial was 08:30pm to 09:30pm; I spent 20 minutes on my way to attend the tutorials and another 30 minutes to go back, thus that 5-15 minutes tutorials were a total disappointment. I wish prof had allowed me to replace the role of the tutors for reading the slides on the spot so I can be paid whilst giving a better reading performance for the audience; two birds with one stone :).

Usefulness

The first part of the semester was basically the intros and slight technical details of Hadoop. I don't think anyone wants to use Hadoop now but learning it was not for nothing. Hadoop was kind of like a proof-of-concept that horizontally scaling computing resources by duplicating them for distributed data processing could work as a viable solution to big data systems. Learning about it to a certain level of technical depth from a person who's had a highly extensive experience and involvement in both the academic research and the industry (Prof He Bingsheng) made you appreciate the details of the design decisions and prepare you for the second part of the course is which more high-level-ish.

The second part was about NoSQL, Spark, Graph Processing, and Data Lake which are some technologies that are very recently developed. I need to stress the point that this course deals with a topic that is evolving, growing, and advancing at a very fast pace. Any technology used as example in the course might expand or merge with something else at any time and thus the course tries to match this by covering just the right amount of detail which, in my opinion, is a very good decision. For example, we learn that the early decision for NoSQL data stores was to compromise the ACID transactional property of traditional RDBMSs and opt for the BASE property instead; had the course taught only about specific technology as a detailed example, like MongoDB, everything taught will get invalidated as new development as of 2018 made MongoDB supports ACID. The decision to teach the development of NoSQL data stores and their continuous effort to support transactional properties instead is a wise, more relevant, and more future-proof move and I am grateful for that.

One of the tangible benefits I got from this course was having the ability to connect and talk with just enough depth about big data collection/processing systems and NoSQL design decisions with some of my internship interviewers (which, along with some other things, managed to make me land some roles and/or at least secure the next interview stage).

This course has also helped me in being more knowledgeable in teaching topics in CS1101S and especially CS2030S' second half of the semester whereby I can let the students know why and how learning what they're learning is useful for them.

Assessments

Two assignments (25% each)
Midterm (25%)
Final Test on week 13 (25%)

First assignment was on making a MapReduce program for Hadoop and it was quite fun. It is something you can do in less than three hours if you know how to learn your way around it. I blundered this assignment by making final modifications and not realizing that my submitted code did not compile. In the end I managed to appeal to some certain extent but still got heavily penalized (which, not including final test, I think pulled my cumulative scores straight down to median as the curve is quite steep). This was a bummer since I really like this module but, eh, I can live with it kinda but make sure your code compiled if you're taking this module.

Second assignment on SparkSQL and SparkMLLib was super high-level which honestly is fine but I found it to be a little boring as there is not much knowledge from the lectures nor tutorial applied in the assignments to make some smart decisions nor performance tunings that we were able to do in the first assignment. It was alright regardless; you can complete this assignment (not including setting things up) in around 1.5 hours if you have some sorts of ML experience from 2109S and at most 5 hours if you're really taking your time.

Midterms was pretty alright, too. It consists of really nicely thought questions that you can definitely answer if you understand and perhaps practice a little on one or two past year papers. There were one or two trivial-ish questions, though, and I regret that I did not print all the lecture materials to just rely on my cheat sheet as there was one 1.5 marks (percent) question that I could have answered if I got all the lecture materials with me. You can finish the exam in 15 minutes and you can redo it 3 more times for another 30 minutes before leaving the exam early.

Finals was a little different as it had a bunch of short/long answer questions. I needed the whole exam duration with barely any redo-ing to complete the final exam but, as the midterm, it was manageable if you understand the topics well. I must also stress that understanding the topics well is not hard for this course compared to other more-complicated CS courses.

Afterword

Take this course. If your career plan has something to do with data (big data, low-latency data management, data distribution, or basically everything tech-related) and you don't instantly know yet what the contents of the course are about just from seeing the syllabus, take this course. At the very least, you'll spend some effort, have some fun, and probably get a good grade.

Check out my latest posts!

Distributed databases promise linear scaling in both data size and throughput (at least in theory). In practice, harnessing the full power of a distributed system, especially one based on PostgreSQL like Citus, demands a holistic approach. While advanced partitioning schemes, elaborate caching layer...

Tuning Distributed PostgreSQL

Blog

# First Pass Systematic testing of concurrent programs has been researched for quite some time and it is known to have the problem of *state-space explosion* whereby testing all possible interleaving of concurrent programs is exponential in the execution length. Deterministic Partial Order Reduction...

Partial Order Aware Concurrency Sampling (POS) - Yuan, et al.

PaperBlog

# Problem Even when requirements are clear, verifying software correctness and ensuring our code works in all scenarios without affecting existing logic can be challenging. This is more apparent on ‘hot’ codes: Excerpts of code that got updated often due to its status as a new base logic or due to ...

Improving Software Testability with Math

Blog

# Introduction Database normalization is a fundamental process in database design, ensuring data integrity and minimizing redundancy. Despite its theoretical foundations, practical implications and occurrences of different normal forms in real-world databases are underexplored. Our recent work addre...

Testing DB Normalization Theory vs Practice - FDSampleRush

ProjectResearchPaper