Visualizing my first year in Berkeley’s data science master’s program

W205 (Data Engineering)

Richard Mathews II
4 min readJan 11, 2023

To round off my first year in Berkeley’s MIDS program, I took W205, a data engineering class that teaches modern cloud computing and data engineering technologies, covering topics like data lakes, NoSQL, and docker.

One of the things I appreciated about the course was the hands-on nature of the assignments. We were given business-oriented datasets to work with and tasked with designing and implementing ETL pipelines to extract, transform, and load the data into different databases. All the assignments took place in an AWS EC2 instance.

As I enter my second year, I want to take a step back and reflect on my first year with some visualizations :)

The Evolution of my Second Brain

Before I get to W205 Data Engineering, let me first revisit my Second Brain graphs from the first and second semesters.

After taking W200 (Python programming) and W201 (research design, business sense) in my first semester, my Obsidian graph looked like this.

My Obsidian knowledge graph with color-coded course notes (W200=blue, W201=red)

Once I took W203, a course on statistics and probability, my network of notes grew.

My Obsidian knowledge graph with color-coded course notes (W200=blue, W201=red, W203=gold)

In my third semester, I took W205. As you can see below, the W205 cluster (green) emerged closer to the W200 blue cluster (programming class), which makes sense because there were more programming concepts touched on than business sense (W201) and statistics (W203). It’s also interesting to see my technical knowledge group together in one “hemisphere” of my Second Brain, whereas my more nontechnical concepts have aggregated on the other hemisphere.

My Obsidian knowledge graph with color-coded course notes (W200=blue, W201=red, W203=gold, W205=green)

Visualizing Workloads

Based on my subjective feelings about workloads, here is my ranking of the courses I’ve taken thus far in order of least difficult to most difficult.

  1. W205 — Data Engineering
  2. W200 — Python Programming
  3. W201 — Research Design
  4. W203 — Statistics

I can also back these feelings up with data. Here are the plots of my time spent per day and per week for each semester.

Time-series plots for the time I spent on W200 and W201
Time-series plots for the time I spent on W203
Time-series plots for the time I spent on W205

W203 drained more time than any other course by far, and W200 and W201 combined for an average of 17 hours per week while W205 only took 6 hours per week on average. Of course, the time spent on each class depends on the student’s prior background in that domain. I have had plenty of experience with SQL and Python, hence why those classes were easier for me.

Visualizing W205

Knowledge

W205 covered quite a few topics related to Big Data, the most notable being data cleansing methods, SQL, NoSQL, Serverless SQL, data lakes and data warehouses, ETL/ELT pipelines, Docker, graph algorithms, and enterprise message queues. Here is my local W205 network at the end of it all.

My Obsidian local graph for W203 (blue nodes are course modules)

Words

I ran the word cloud script on my markdown files and found the most common terms were data, web server, database, container, memory, graph, and VM. No surprises here.

Word cloud generated using text from my W205 notes

Conclusion

I put these blog posts together as a fun way to communicate to upcoming, prospective, and current MIDS students a feel for the topics and workloads for each course. I plan on continuing these visualizations throughout my MIDS journey and posting them here on my Medium channel. If you want to get in contact with me to learn about my experience with Berkeley’s MIDS program, reach out to me on LinkedIn or through my personal website.

The code used to create my word cloud and time series visuals can be found here.

--

--

Richard Mathews II

Applied AI scientist, graduate student @ Berkeley, and biohacker. Interested in meta-learning, systems, AI, and data-driven lifestyles.