This 5-day camp offers an overview of Machine Learning using the Python programming language. Students will work in pairs and small groups on worksheets and Jupyter notebooks, interspersed with brief lectures and instructor-led live-coding segments.
Prerequisites: Participants should already have some familiarity with Python programming fundamentals, e.g. loops, conditional execution, importing modules, and calling functions. Participants should have access to a laptop computer. Anaconda should already be installed.
Date: Monday, August 12th to Friday, August 16th 2024
Time: 9h30 AM to 4:00 PM, each day (with breaks in between)
Location: McGill Downtown Campus,,room 150
Instructors: Jacob Errington, Faculty Lecturer in 㽶Ƶ's School of Computer Science, and Eric Mayhew, Professor of Computer Science Technology at Dawson College.
This summer camp is offered for free to the McGill community and priority is given to students.
Day 1: Fundamentals of Machine Learning in Python
Nowadays, machine learning (ML) is perhaps the hottest topic in all Computer Science, and with good reason: the variety of tasks that machine learning models can complete has exploded in the last 15 years as computing power has reached new heights. But what exactly is a “machine learning algorithm”? And at what cost do these advances in computing have for society and the environment? This workshop will introduce you to the basic terminology and concepts associated with machine learning in a hands-on way. We will explore common ML tasks such as data acquisition and cleaning as well as model training, testing, and validation by focusing on a particularly simple kind of model called k-nearest neighbours.
This workshop provides the necessary background for the subsequent sessions.
By the end of day 1, you will be able to:
- Articulate applications, limitations, and ethical considerations of machine learning.
- Enumerate the machine learning pipeline: data acquisition, data cleaning, algorithm selection, training, testing, and validation.
- Explain in plain English how the following algorithm works: k-nearest neighbours
Day 2: Intro to regression and data collection
This lesson will dive into regression and the process of data cleaning in machine learning. We will explore what regression is and how it differs from classification. In terms of algorithms, we will discuss how decision trees and support vector machines are used to do regression tasks. This workshop will introduce you to these types of machine learning models in a hands-on way. We will also cover the data collection process of the machine learning pipeline.
By the end of day 2, you will be able to:
- Describe plainly how decision trees and support vector machines work;
- Given a scaffolded environment and curated data set, train a decision tree and describe how this algorithm works at a high level;
- Articulate the data collection process along with common problems in data collection.
Day 3: Intro to unsupervised clustering and data cleaning
This session will focus on unsupervised machine learning and data cleaning. Unsupervised machine learning is a powerful technique where the algorithm analyzes and clusters unlabeled datasets. This workshop will scratch the surface of this side of machine learning, introducing unsupervised learning using the k-means and DBSCAN algorithms. This session will explore the data cleaning process in the machine learning pipeline in more detail.
By the end of day 3, you will be able to:
- Differentiate between supervised and unsupervised learning
- Given a scaffolded environment and curated data set, train a DBSCAN model and describe how this algorithm works at a high level
- Articulate the steps in data cleaning, along with the common issues and solutions to incomplete or faulty datasets.
Day 4: Intro to classification and the Train / Test Split
Classification is a task in which a data point is to be associated with a label. A machine learning model is able to perform such a task by first being trained on a dataset of known points already associated with labels. For example, a biologist might measure several body parts of a large number of penguins of different species; the set of measurements is the data point, and the species is the label. This labeled dataset, constructed manually, can be used to train a machine learning model that can then classify new, unlabelled data points.
In this session, you will implement some classifiers with two machine learning models: support vector machines and decision trees. We will explore how to assess the performance of our machine learning models by using the so-called “train/test split,” and we will discuss the danger of overfitting.
By the end of day 4, you will be able to:
- Describe at a high level how a support vector machine functions as a binary classifier.
- Differentiate between model testing and model validation.
- Describe the concern of overfitting and how a train / test split can reassure us we haven’t overfit our model.
Day 5: Basics of Neural Networks and Algorithm Training
One of the most discussed and perhaps mysterious machine learning models is the neural network. Neural networks are a kind of machine learning model inspired by biological processes taking place in the brain. This lesson will demystify neural networks and provide you with a plain-English explanation of how they work. We will train a neural network to recognize handwritten digits; this is a classification task. We will discuss some variants on neural networks such as convolutional neural networks. We will also discuss deep learning and further explore the training step in the machine learning pipeline.
By the end of day 5, you will be able to:
- Given a scaffolded environment and curated data set, follow a tutorial that trains a neural network to perform classification.
- Describe in plain English what a neural network is and what deep learning is.
- Describe at a high-level what the training process is for neural networks and distinguish it from the training processes seen previously.