Course ID | CS4225 |
AY Taken | AY22/23 Sem 2 |
Taught by | He Bingsheng, Ai Xin |
Contents taught are as follows:
Welcome to my overly objective course reviews for courses that I believe deserve a good feedback.
CS4225 is without doubt one of the most enjoyable courses I've ever taken throughout my academic career. This course serves as an introduction to the history, technical analysis, ongoing research, and how-to-s of big data systems. As with CS3223, I'd recommend people to take this course regardless of whether they want to specialize in database or not: the exact contents taught may not be directly useful for you but you left the course with a bag of tricks, appreciation, and useful basic-to-intermediate-ish knowledge on data flow patterns and management.
With 0 being nothing, 5 being neutral, and 10 being the maximum, the following are the stats for CS4225:
Metric | Score |
---|---|
Workload for understanding the contents (incl. lecture) | 3 |
Workload for the projects (assuming you have a week for each project) | 2 |
Workload for midterm/finals prep (assuming a week of prep) | 3 |
Technical depth of course (contents taught) | 5 |
Technical depth of exams | 6 |
Technical depth of assignment 1 | 7 |
Technical depth of assignment 2 | 4 |
Technical depth of optional contents | 6 |
Lectures' usefulness wrt. exams | 7 |
-- | -- |
---|
Who do I recommend this course for:
If you're just planning to skip all lectures and tutorials, not going through any optional materials, and rote-memorize things for a few hours before the exams, then I don't think this course will be of use for you despite the probability of scoring a rather decent grade.
The tutorials lack any care whatsoever. The tutor reads the tutorial answers off the slides verbatim with a tone so monotonous that even an AI-voiceover cannot compete with. Tutorials are over in 5 to 15 minutes as the questions aren't that profound anyways. My lecture schedule was in the afternoon but my tutorial was 08:30pm to 09:30pm; I spent 20 minutes on my way to attend the tutorials and another 30 minutes to go back, thus that 5-15 minutes tutorials were a total disappointment. I wish prof had allowed me to replace the role of the tutors for reading the slides on the spot so I can be paid whilst giving a better reading performance for the audience; two birds with one stone :).
The first part of the semester was basically the intros and slight technical details of Hadoop. I don't think anyone wants to use Hadoop now but learning it was not for nothing. Hadoop was kind of like a proof-of-concept that horizontally scaling computing resources by duplicating them for distributed data processing could work as a viable solution to big data systems. Learning about it to a certain level of technical depth from a person who's had a highly extensive experience and involvement in both the academic research and the industry (Prof He Bingsheng) made you appreciate the details of the design decisions and prepare you for the second part of the course is which more high-level-ish.
The second part was about NoSQL, Spark, Graph Processing, and Data Lake which are some technologies that are very recently developed. I need to stress the point that this course deals with a topic that is evolving, growing, and advancing at a very fast pace. Any technology used as example in the course might expand or merge with something else at any time and thus the course tries to match this by covering just the right amount of detail which, in my opinion, is a very good decision. For example, we learn that the early decision for NoSQL data stores was to compromise the ACID transactional property of traditional RDBMSs and opt for the BASE property instead; had the course taught only about specific technology as a detailed example, like MongoDB, everything taught will get invalidated as new development as of 2018 made MongoDB supports ACID. The decision to teach the development of NoSQL data stores and their continuous effort to support transactional properties instead is a wise, more relevant, and more future-proof move and I am grateful for that.
One of the tangible benefits I got from this course was having the ability to connect and talk with just enough depth about big data collection/processing systems and NoSQL design decisions with some of my internship interviewers (which, along with some other things, managed to make me land some roles and/or at least secure the next interview stage).
This course has also helped me in being more knowledgeable in teaching topics in CS1101S and especially CS2030S' second half of the semester whereby I can let the students know why and how learning what they're learning is useful for them.
First assignment was on making a MapReduce program for Hadoop and it was quite fun. It is something you can do in less than three hours if you know how to learn your way around it. I blundered this assignment by making final modifications and not realizing that my submitted code did not compile. In the end I managed to appeal to some certain extent but still got heavily penalized (which, not including final test, I think pulled my cumulative scores straight down to median as the curve is quite steep). This was a bummer since I really like this module but, eh, I can live with it kinda but make sure your code compiled if you're taking this module.
Second assignment on SparkSQL and SparkMLLib was super high-level which honestly is fine but I found it to be a little boring as there is not much knowledge from the lectures nor tutorial applied in the assignments to make some smart decisions nor performance tunings that we were able to do in the first assignment. It was alright regardless; you can complete this assignment (not including setting things up) in around 1.5 hours if you have some sorts of ML experience from 2109S and at most 5 hours if you're really taking your time.
Midterms was pretty alright, too. It consists of really nicely thought questions that you can definitely answer if you understand and perhaps practice a little on one or two past year papers. There were one or two trivial-ish questions, though, and I regret that I did not print all the lecture materials to just rely on my cheat sheet as there was one 1.5 marks (percent) question that I could have answered if I got all the lecture materials with me. You can finish the exam in 15 minutes and you can redo it 3 more times for another 30 minutes before leaving the exam early.
Finals was a little different as it had a bunch of short/long answer questions. I needed the whole exam duration with barely any redo-ing to complete the final exam but, as the midterm, it was manageable if you understand the topics well. I must also stress that understanding the topics well is not hard for this course compared to other more-complicated CS courses.
Take this course. If your career plan has something to do with data (big data, low-latency data management, data distribution, or basically everything tech-related) and you don't instantly know yet what the contents of the course are about just from seeing the syllabus, take this course. At the very least, you'll spend some effort, have some fun, and probably get a good grade.