Getting started 1: Working with Jupyter and Python
This tutorial offers a short introduction to Jupyter Notebooks and Python. Feel free to skip this section if you are already familiar with them.
Tutorial Objectives
Set up an execution environment for this tutorial using Python and Jupyter Notebooks
Learn how to use Jupyter Notebooks
Get started with Python, NumPy, Pandas, and Matplotlib
Where to run these tutorials
You can run these tutorials in the following three ways:
Tool | Purpose | Best For | Key Strengths |
|---|---|---|---|
Local execution Environment | Run Python & Jupyter locally | Offline dev, custom setups, large datasets, privacy-sensitive work | Full control; no internet required; persistent storage; customizable hardware |
Google Colab | Cloud-based Jupyter notebooks | Quick experiments, free GPU/TPU, education | No installation; free GPU/TPU; easy sharing via links; Google Drive integration |
Kaggle Notebooks | Cloud notebooks for data analysis & competitions | Kaggle competitions, reproducible analysis, dataset exploration | Direct dataset access; free CPU/GPU; collaboration; versioned notebooks |
Links:
Set up a local Environment
In this section, we provide an overview of different options for setting up a local Python environment and isolating separate execution environments.
Step 1: Understand the Tools
Before setting up your local Python environment, it’s important to understand what each tool does and when to use it.
Tool | Purpose | Best For | Key Strengths |
|---|---|---|---|
Conda | Full environment + package manager | Data science, ML, scientific computing | Handles both Python & non-Python dependencies |
pyenv + pyenv-virtualenv | Manage Python versions and isolated environments | Professional development, backend systems, multi-project setups | Fine-grained control over Python versions; lightweight; reproducible environments; integrates with CI/CD |
virtualenv / venv | Create isolated environments for one Python version | Small to medium projects | Built into Python since version 3.3; simple and fast; no external dependencies |
Links:
Step 2: Installation
After choosing your preferred toolset you can install the dependencies.
Option 1: Miniconda (Recommended for Data Science / ML)
As the Anaconda distribution comes with an extensive collection of packages, which we don’t need in this course, we recommend installing Miniconda, a free, miniature installation that includes only conda, Python, and a small number of essential packages.
Download the latest version for your OS:
Windows Command Prompt
curl https://repo.anaconda.com/miniconda/Miniconda3-latest-Windows-x86_64.exe -o .\miniconda.exe
start /wait "" .\miniconda.exe /S
del .\miniconda.exe
macOS
mkdir -p ~/miniconda3
curl https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-arm64.sh -o ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm ~/miniconda3/miniconda.sh
source ~/miniconda3/bin/activate
conda init --all
Linux
mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm ~/miniconda3/miniconda.sh
source ~/miniconda3/bin/activate
conda init --all
Option 2: pyenv + pyenv-virtualenv (Multi-Python Development for Linux and MacOS)
Install pyenv and friends:
curl https://pyenv.run | bash
👉 Note: This command redirects to the following source pyenv-installer
Install a Python version:
pyenv install 3.12.0
pyenv global 3.12.0
Create your first isolated environment:
pyenv virtualenv 3.12.0 mlcysec
pyenv activate mlcysec
Option 3: virtualenv / venv (Lightweight / Single Python Version)
You can only use this functionality if you are running a python_version >= 3.3.
Create a virtual environment:
python3 -m venv mlcysec
Activate the environment:
macOS/Linux
source mlcysec/bin/activate
Windows
mlcysec\Scripts\activate
Jupyter and Colab Notebooks
Jupyter Notebooks allow us to combine code, visualizations, and written explanations in a single document, making it much easier to learn, present, and share analyses. The cell-based environment also lets us run and test small sections of code independently. A Jupyter Notebook consists of two main components:
cells: A container for code or text (e.g., this is written within a markdown cell)
kernels: The “computational engine” which executes code blocks of the notebook
Installing Jupyter
To run a Jupyter Notebook locally, you first need to install Jupyter in your virtual environment and activate it. This installation step is not necessary if you are using web-based platforms such as Google Colab or Kaggle, which provide ready-to-use notebook environments.
Install the classic Jupyter Notebook with:
pip install notebook
Run the notebook:
jupyter notebook
Open the browser and visit http://localhost:8888 > Note: If you’re missing dependencies, you can optionally run the following on Google Colab: >
bash > !pip install --upgrade pip && pip install -r requirements.txt >> Or in your local Python environment: >bash > pip install -r requirements.txt >
Cells
Cells can contain either code or markdown
Check out keyboard shortcuts via Cmd/Ctrl + Shift + P.
Few important ones:
Shift + Enter: Executes the current cell and moves to the nextTab: AutocompletesShift + Tab: Brings up documentation. Try this after enteringnp.ones(
[1]:
def say_hello():
print('This is a code cell')
say_hello()
This is a code cell
[2]:
import sys
print("Python version")
print (sys.version)
print("Version info.")
print (sys.version_info)
Python version
3.12.2 (main, May 4 2025, 19:19:40) [Clang 17.0.0 (clang-1700.0.13.3)]
Version info.
sys.version_info(major=3, minor=12, micro=2, releaselevel='final', serial=0)
or it can be a markdown cell, like this one.
If you’re unfamiliar with Markdown syntax, check out this cheat sheet.
Some things you can do with Markdown:
This is a level 2 heading
This is a level 3 heading
Syntax
This is some plain text that forms a paragraph. Add emphasis via bold and bold, or italic and italic.
Paragraphs must be separated by an empty line.
Horizontal lines
You can divide the flow your text using horizontal lines like so
---
Quotes
If you need to quote a phrase set a > right in front of the quote.
“There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult.” ― Tony Hoare
Lists
Sometimes we want to include lists.
Which can be indented.
Lists can also be numbered.
For ordered lists.
Insert hyperlinks
You can create weblinks to point to a page outside the document, such as using Google Colab Notebooks for running notebooks on the cloud.
The code for this one is
[Google Colab Notebooks](https://colab.research.google.com/notebooks/intro.ipynb#recent=true)
Code blocks
Inline code uses single backticks: foo(), and code blocks use triple backticks:
bar()
Or can be intented by 4 spaces:
foo()
Latex Code
\(y=x^2\)
\(e^{i/pi} + 1 = 0\)
\(e^x=\sum_{i=0}^\infty \frac{1}{i!}x^i\)
\(\frac{n!}{k!(n-k)!} = {n \choose k}\)
Tables
First column name | Second column name
--- | ---
Row 1, Col 1 | Row 1, Col 2
Row 2, Col 1 | Row 2, Col 2
becomes:
First column name |
Second column name |
|---|---|
Row 1, Col 1 |
Row 1, Col 2 |
Row 2, Col 1 |
Row 2, Col 2 |
If you want to make your life easier, you can also use this online table generator. You can easily create your table using visual tools and then it will generate the markdown code for you.
HTML and images
In a notebook text cell, you may also enter HTML code like this:
<img src='https://www.uni-saarland.de/typo3conf/ext/uni_saarland/Resources/Public/Images/logo_uni_saarland.svg' width="200">
which will give the following
Alternatively, you may insert an image inline as follows

Measuring time
When experimenting with different Python implementations, you often want to know which runs fastest.
The timeit magic command allows you to measure the execution time of a statement or an entire cell. By comparing alternatives, you can identify the most efficient approach.
[3]:
%timeit l = [k for k in range(10**6)]
23.1 ms ± 173 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Python
Python Versions
Python has two main versions: 2.7 and 3.x.
Python 3.x is not backward-compatible with Python 2, meaning Python 2 code may not run in Python 3.
For this course, we will use Python 3.7 or higher.
Note: Python 2.7 was once widely used, but it has been deprecated and no longer receives security updates or bug fixes. If you still use Python 2 in any environment, it is strongly recommended to migrate to Python 3.
Basic Data Types
Numbers: Python supports integers (``int``) and floating-point numbers (``float``):
[4]:
# Integers
x = 3
print(type(x)) # <class 'int'>
print(x) # 3
# Arithmetic operations
print(x + 1) # 4
print(x - 1) # 2
print(x * 2) # 6
print(x ** 2) # 9
# In-place updates
x += 1
print(x) # 4
x *= 2
print(x) # 8
# Floating-point numbers
y = 2.5
print(type(y)) # <class 'float'>
print(y, y + 1, y * 2, y ** 2) # 2.5 3.5 5.0 6.25
<class 'int'>
3
4
2
6
9
4
8
<class 'float'>
2.5 3.5 5.0 6.25
+,-,*,**are addition, subtraction, multiplication, and exponentiation, respectively.+=and*=are shorthand for updating a variable in place.Booleans
Python supports Boolean values: True and False. These are commonly used in logical operations:
[5]:
t = True
f = False
print(type(t)) # <class 'bool'>
# Logical operations
print(t and f) # Logical AND; prints False
print(t or f) # Logical OR; prints True
print(not t) # Logical NOT; prints False
print(t != f) # Logical XOR; prints True
<class 'bool'>
False
True
False
True
andreturnsTrueif both operands areTrue.orreturnsTrueif at least one operand isTrue.notnegates the Boolean value.!=can be used as a simple XOR for two Boolean values.
Strings Python supports strings, which are sequences of characters:
[6]:
# String literals
hello = 'hello' # Single quotes
world = "world" # Double quotes (equivalent)
print(hello) # hello
print(len(hello)) # String length; 5
# Concatenation
hw = hello + ' ' + world
print(hw) # hello world
# Recommended: f-strings for inlining variables
hw12 = f'{hello} {world} {12}'
print(hw12) # hello world 12
hello
5
hello world
hello world 12
Strings can use single or double quotes interchangeably.
len()returns the length of a string.+concatenates strings.f-strings (formatted string literals) allow embedding variables and expressions directly.
Python Containers
Python extensively relies on four types of containers:
Lists
Dictionaries
Sets
Tuples
Lists
Lists are resizable arrays that can hold heterogeneous elements:
[7]:
# Creating a list
xs = [3, 1, 2, 'foo']
print(xs, xs[2]) # [3, 1, 2, 'foo'] 2
# Accessing elements
print(xs[-1]) # Negative indices count from the end; 'foo'
# Updating elements
xs[2] = 'foo'
print(xs) # [3, 1, 'foo', 'foo']
# Adding and removing elements
xs.append('bar') # Add to end
print(xs) # [3, 1, 'foo', 'foo', 'bar']
x = xs.pop() # Remove and return last element
print(x, xs) # bar [3, 1, 'foo', 'foo']
[3, 1, 2, 'foo'] 2
foo
[3, 1, 'foo', 'foo']
[3, 1, 'foo', 'foo', 'bar']
bar [3, 1, 'foo', 'foo']
Looping over a list
[8]:
animals = ['cat', 'dog', 'monkey']
for animal in animals:
print(animal)
cat
dog
monkey
Notes:
Lists are ordered and mutable.
Use negative indices to access elements from the end.
append()adds elements;pop()removes and returns the last element.Lists can contain elements of different types.
Dictionaries:
Map in other languages.Tip:
collections.defaultdictcan be convenient when you need default values for missing keys.
[9]:
# Creating a dictionary
d = {'cat': 'cute', 'dog': 'furry'}
# Accessing values
print(d['cat']) # cute
print('cat' in d) # True
# Adding or updating entries
d['fish'] = 'wet'
print(d)
print(d['fish']) # wet
# Accessing keys safely
# print(d['monkey']) # KeyError if key does not exist
print(d.get('monkey', 'N/A')) # N/A
print(d.get('fish', 'N/A')) # wet
# Removing entries
del d['fish']
print(d.get('fish', 'N/A')) # N/A
cute
True
{'cat': 'cute', 'dog': 'furry', 'fish': 'wet'}
wet
N/A
wet
N/A
Notes:
Dictionaries store key-value pairs.
Use
into check for keys.Use
get(key, default)to safely access values without raising an error.delremoves a key-value pair.
Sets:
Sets are unordered collections of unique elements.
Note: Both dictionaries and sets use curly braces
{...}, so be cautious when creating them.
[10]:
# Creating a set
animals = {'cat', 'dog'}
# Membership check
print('cat' in animals) # True
print('fish' in animals) # False
# Adding elements
animals.add('fish')
print('fish' in animals) # True
# Number of elements
print(len(animals)) # 3
# Adding duplicates does nothing
animals.add('cat')
print(len(animals)) # 3
# Removing elements
animals.remove('cat')
print(len(animals)) # 2
True
False
True
3
3
2
Notes:
Sets are unordered and do not allow duplicates.
Use
into check membership.add()inserts elements;remove()deletes elements.Curly braces
{...}are used for both sets and dictionaries.
Tuples:
[11]:
# Creating a dictionary with tuple keys
d = {(x, x + 1): x for x in range(10)}
# Creating a tuple
t = (5, 6)
print(type(t)) # <class 'tuple'>
# Accessing dictionary values using tuples as keys
print(d[t]) # 5
print(d[(1, 2)]) # 1
<class 'tuple'>
5
1
Notes:
Tuples are immutable; you cannot modify, add, or remove elements.
Useful as dictionary keys or to represent fixed collections of items.
Use parentheses
()to create a tuple.
Functions
Functions are reusable blocks of code that perform a specific task.
[12]:
# Define a function
def sign(x):
if x > 0:
return 'positive'
elif x < 0:
return 'negative'
else:
return 'zero'
# Test the function
for x in [-1, 0, 1]:
print(sign(x))
# Output:
# negative
# zero
# positive
negative
zero
positive
Notes:
defis used to define a function.returnoutputs a value from the function.Functions can be called multiple times with different arguments.
PEP 484 introduced type hints for Python to improve code readability and provide optional static type checking, for example:
[13]:
def greet(name: str) -> str:
return "Hello " + name
greet("Alice")
[13]:
'Hello Alice'
NumPy
A multidimensional array object (
ndarray)Various derived objects, such as masked arrays and matrices
A wide range of fast operations on arrays, including:
Mathematical and logical operations
Shape manipulation
Sorting and selecting
I/O
Discrete Fourier transforms
Basic linear algebra and statistical operations
Random simulations
Reference: What is NumPy?
[14]:
import numpy as np # Standard import convention
Motivation for using Numpy
Numpy is fast ⚡:
[15]:
# Pure Python matrix multiplication
def matrixmult(A, B):
rows_A = len(A)
cols_A = len(A[0])
rows_B = len(B)
cols_B = len(B[0])
if cols_A != rows_B:
print("Cannot multiply the two matrices. Incorrect dimensions.")
return
# Create the result matrix
C = [[0 for _ in range(cols_B)] for _ in range(rows_A)]
for i in range(rows_A):
for j in range(cols_B):
for k in range(cols_A):
C[i][j] += A[i][k] * B[k][j]
return C
[16]:
# Create random matrices
A = np.random.random((10**2, 10**2))
B = np.random.random((10**2, 10**2))
print(A.shape, B.shape)
(100, 100) (100, 100)
[17]:
# Timing pure Python multiplication
%time
C = matrixmult(A, B)
print(np.sum(C))
CPU times: user 1e+03 ns, sys: 0 ns, total: 1e+03 ns
Wall time: 8.11 μs
250602.59693873872
[18]:
# Timing NumPy multiplication
%time
C = A.dot(B) # Note: A*B performs element-wise multiplication
print(np.sum(C))
CPU times: user 1e+03 ns, sys: 1 μs, total: 2 μs
Wall time: 2.15 μs
250602.59693873874
Notes:
NumPy’s
dot()function (and other operations) are highly optimized, often using compiled C/Fortran code under the hood.For large arrays, NumPy can be orders of magnitude faster than nested Python loops.
A * Bin NumPy performs element-wise multiplication, not matrix multiplication.
Creating Numpy Arrays: NumPy arrays are multidimensional, fast, and convenient for numerical computations.
[19]:
import numpy as np
# Create an array from a Python list
a = np.array([
[1, 2, 3, 1],
[5, 7, 9, 10],
[4, 6, 8, 2],
])
print(a.shape) # Shape of the array (3, 4)
print(a[2, 2]) # Access single element (8)
print(a[1:2, 2:4]) # Slice rows and columns
print(a[:-1]) # All rows except the last
(3, 4)
8
[[ 9 10]]
[[ 1 2 3 1]
[ 5 7 9 10]]
Other ways of creating arrays
[20]:
# Array of zeros
a = np.zeros((2,2))
print(a) # [[0. 0.]
# [0. 0.]]
# Array of ones
b = np.ones((1,2))
print(b) # [[1. 1.]]
# Constant array
c = np.full((2,2), 7)
print(c) # [[7. 7.]
# [7. 7.]]
# Identity matrix
d = np.eye(2)
print(d) # [[1. 0.]
# [0. 1.]]
# Array of random values
e = np.random.random((2,2))
print(e)
print(e > 0.5) # Boolean array
print(e[e > 0.5]) # Filter values greater than 0.5
[[0. 0.]
[0. 0.]]
[[1. 1.]]
[[7 7]
[7 7]]
[[1. 0.]
[0. 1.]]
[[0.58519003 0.79314108]
[0.87911272 0.79805063]]
[[ True True]
[ True True]]
[0.58519003 0.79314108 0.87911272 0.79805063]
Notes:
Use
np.array()to convert Python lists to arrays.NumPy provides convenience functions:
zeros(),ones(),full(),eye(),random.random().Boolean indexing allows filtering arrays based on conditions (e.g.,
e[e > 0.5]).
Numpy Operations
NumPy supports elementwise operations and many convenient mathematical functions:
[21]:
x = np.array([[1, 2], [3, 4]], dtype=np.float64)
y = np.array([[5, 6], [7, 8]], dtype=np.float64)
# Elementwise addition
print(x + y)
print(np.add(x, y))
# [[ 6.0 8.0]
# [10.0 12.0]]
# Elementwise subtraction
print(x - y)
print(np.subtract(x, y))
# [[-4.0 -4.0]
# [-4.0 -4.0]]
# Elementwise multiplication
print(x * y)
print(np.multiply(x, y))
# [[ 5.0 12.0]
# [21.0 32.0]]
# Elementwise division
print(x / y)
print(np.divide(x, y))
# [[0.2 0.33333333]
# [0.42857143 0.5 ]]
# Elementwise square root
print(np.sqrt(x))
# [[1. 1.41421356]
# [1.73205081 2. ]]
[[ 6. 8.]
[10. 12.]]
[[ 6. 8.]
[10. 12.]]
[[-4. -4.]
[-4. -4.]]
[[-4. -4.]
[-4. -4.]]
[[ 5. 12.]
[21. 32.]]
[[ 5. 12.]
[21. 32.]]
[[0.2 0.33333333]
[0.42857143 0.5 ]]
[[0.2 0.33333333]
[0.42857143 0.5 ]]
[[1. 1.41421356]
[1.73205081 2. ]]
Notes:
NumPy operations are vectorized, meaning they apply to all elements without explicit loops.
Most arithmetic operations have both operator and function forms (
+vsnp.add,*vsnp.multiply).NumPy provides many universal functions (
sqrt,exp,log,sin, etc.) that operate elementwise.
Pandas
Note: We’ll explore Pandas in more detail in Getting Started 2
[22]:
import pandas as pd # Standard import convention
Pandas DataFrames
[23]:
df = pd.DataFrame({
'A': [1., 2., 3., 4.],
'B': pd.Timestamp('20130102'),
'C': pd.Series(1, index=list(range(4)), dtype='float32'),
'D': np.array([3] * 4, dtype='int32'),
'E': pd.Categorical(["test", "train", "test", "train"]),
'F': ['foo', 'bar', 'foo', 'bar']
})
display(df)
| A | B | C | D | E | F | |
|---|---|---|---|---|---|---|
| 0 | 1.0 | 2013-01-02 | 1.0 | 3 | test | foo |
| 1 | 2.0 | 2013-01-02 | 1.0 | 3 | train | bar |
| 2 | 3.0 | 2013-01-02 | 1.0 | 3 | test | foo |
| 3 | 4.0 | 2013-01-02 | 1.0 | 3 | train | bar |
Notes:
DataFrames can hold heterogeneous data types (numbers, strings, timestamps, categorical, etc.).
Each column is a Series, which is a labeled, one-dimensional array.
DataFrames are indexed by default, allowing for easy access and manipulation of rows and columns.
Load some existing data
We can load datasets from the web or local files directly into a pandas DataFrame.
[24]:
# bash
# Download the dataset if not already present
! if [ ! -f iris.csv ]; then wget https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/d546eaee765268bf2f487608c537c05e22e4b221/iris.csv; fi
[25]:
# bash
# Preview the first lines of the file
! head iris.csv
sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5.0,3.6,1.4,0.2,setosa
5.4,3.9,1.7,0.4,setosa
4.6,3.4,1.4,0.3,setosa
5.0,3.4,1.5,0.2,setosa
4.4,2.9,1.4,0.2,setosa
[26]:
# Load the CSV file into a DataFrame
iris = pd.read_csv('iris.csv')
# Check the type
print(type(iris)) # <class 'pandas.core.frame.DataFrame'>
# Display the DataFrame
iris
<class 'pandas.core.frame.DataFrame'>
[26]:
| sepal_length | sepal_width | petal_length | petal_width | species | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
| ... | ... | ... | ... | ... | ... |
| 145 | 6.7 | 3.0 | 5.2 | 2.3 | virginica |
| 146 | 6.3 | 2.5 | 5.0 | 1.9 | virginica |
| 147 | 6.5 | 3.0 | 5.2 | 2.0 | virginica |
| 148 | 6.2 | 3.4 | 5.4 | 2.3 | virginica |
| 149 | 5.9 | 3.0 | 5.1 | 1.8 | virginica |
150 rows × 5 columns
Notes:
Use
pd.read_csv()to read CSV files into pandas DataFrames.Once loaded, the DataFrame can be manipulated, inspected, and visualized efficiently.
Viewing
Pandas provides convenient functions to explore and inspect a DataFrame:
[27]:
# Show the first few rows
print('Head')
display(iris.head(n=5))
# Show the last few rows
print('Tail')
display(iris.tail(n=3))
# Show a random sample of rows
print('Random sample')
display(iris.sample(n=5))
# List column names
display(iris.columns)
# Summary statistics for numeric columns
display(iris.describe())
Head
| sepal_length | sepal_width | petal_length | petal_width | species | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
Tail
| sepal_length | sepal_width | petal_length | petal_width | species | |
|---|---|---|---|---|---|
| 147 | 6.5 | 3.0 | 5.2 | 2.0 | virginica |
| 148 | 6.2 | 3.4 | 5.4 | 2.3 | virginica |
| 149 | 5.9 | 3.0 | 5.1 | 1.8 | virginica |
Random sample
| sepal_length | sepal_width | petal_length | petal_width | species | |
|---|---|---|---|---|---|
| 37 | 4.9 | 3.1 | 1.5 | 0.1 | setosa |
| 34 | 4.9 | 3.1 | 1.5 | 0.1 | setosa |
| 69 | 5.6 | 2.5 | 3.9 | 1.1 | versicolor |
| 79 | 5.7 | 2.6 | 3.5 | 1.0 | versicolor |
| 103 | 6.3 | 2.9 | 5.6 | 1.8 | virginica |
Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
'species'],
dtype='object')
| sepal_length | sepal_width | petal_length | petal_width | |
|---|---|---|---|---|
| count | 150.000000 | 150.000000 | 150.000000 | 150.000000 |
| mean | 5.843333 | 3.054000 | 3.758667 | 1.198667 |
| std | 0.828066 | 0.433594 | 1.764420 | 0.763161 |
| min | 4.300000 | 2.000000 | 1.000000 | 0.100000 |
| 25% | 5.100000 | 2.800000 | 1.600000 | 0.300000 |
| 50% | 5.800000 | 3.000000 | 4.350000 | 1.300000 |
| 75% | 6.400000 | 3.300000 | 5.100000 | 1.800000 |
| max | 7.900000 | 4.400000 | 6.900000 | 2.500000 |
Notes:
head(n)shows the firstnrows;tail(n)shows the lastnrows.sample(n)provides a random subset of rows.columnslists all column names.describe()provides basic statistics for numeric columns (mean, std, min, max, quartiles).
Selection: Pandas allows flexible selection of rows and columns, as well as filtering based on conditions.
[28]:
# Select specific columns
sample = iris.sample(n=5)
print('Selecting columns')
display(sample[['sepal_length', 'species']])
# Select specific rows
print('Selecting rows')
display(sample[:3])
# Filter rows based on criteria
print('Filter rows based on some criteria')
# Single condition
display(iris[iris['petal_length'] > 6.0])
# Multiple conditions (logical AND)
display(iris[(iris['petal_length'] > 6.0) & (iris['petal_width'] < 2.0)])
Selecting columns
| sepal_length | species | |
|---|---|---|
| 149 | 5.9 | virginica |
| 39 | 5.1 | setosa |
| 2 | 4.7 | setosa |
| 47 | 4.6 | setosa |
| 17 | 5.1 | setosa |
Selecting rows
| sepal_length | sepal_width | petal_length | petal_width | species | |
|---|---|---|---|---|---|
| 149 | 5.9 | 3.0 | 5.1 | 1.8 | virginica |
| 39 | 5.1 | 3.4 | 1.5 | 0.2 | setosa |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
Filter rows based on some criteria
| sepal_length | sepal_width | petal_length | petal_width | species | |
|---|---|---|---|---|---|
| 105 | 7.6 | 3.0 | 6.6 | 2.1 | virginica |
| 107 | 7.3 | 2.9 | 6.3 | 1.8 | virginica |
| 109 | 7.2 | 3.6 | 6.1 | 2.5 | virginica |
| 117 | 7.7 | 3.8 | 6.7 | 2.2 | virginica |
| 118 | 7.7 | 2.6 | 6.9 | 2.3 | virginica |
| 122 | 7.7 | 2.8 | 6.7 | 2.0 | virginica |
| 130 | 7.4 | 2.8 | 6.1 | 1.9 | virginica |
| 131 | 7.9 | 3.8 | 6.4 | 2.0 | virginica |
| 135 | 7.7 | 3.0 | 6.1 | 2.3 | virginica |
| sepal_length | sepal_width | petal_length | petal_width | species | |
|---|---|---|---|---|---|
| 107 | 7.3 | 2.9 | 6.3 | 1.8 | virginica |
| 130 | 7.4 | 2.8 | 6.1 | 1.9 | virginica |
Notes:
Use
df[columns]to select specific columns.Slice
df[start:end]to select rows by position.Boolean indexing allows filtering rows based on one or more conditions.
Combine multiple conditions using
&(AND) or|(OR) with parentheses.
Operations: Pandas allows vectorized operations and applying functions across rows or columns.
[29]:
# Apply a function to each column or row
display(iris.sample(n=5).apply(np.cumsum)) # Cumulative sum along numeric columns
# Compute the mean of each numeric column
display(iris.mean(numeric_only=True))
| sepal_length | sepal_width | petal_length | petal_width | species | |
|---|---|---|---|---|---|
| 75 | 6.6 | 3.0 | 4.4 | 1.4 | versicolor |
| 108 | 13.3 | 5.5 | 10.2 | 3.2 | versicolorvirginica |
| 26 | 18.3 | 8.9 | 11.8 | 3.6 | versicolorvirginicasetosa |
| 100 | 24.6 | 12.2 | 17.8 | 6.1 | versicolorvirginicasetosavirginica |
| 84 | 30.0 | 15.2 | 22.3 | 7.6 | versicolorvirginicasetosavirginicaversicolor |
sepal_length 5.843333
sepal_width 3.054000
petal_length 3.758667
petal_width 1.198667
dtype: float64
Notes:
apply(func)applies a function to each column by default (axis=0) or to each row (axis=1).Use
numeric_only=Truewhen applying operations likemean()to ignore non-numeric columns.You can use NumPy functions like
np.cumsum,np.mean,np.sum, etc., directly on numeric DataFrames.Operations on DataFrames are vectorized and efficient, avoiding explicit Python loops.
Matplotlib
Matplotlib is one of the most widely used libraries for data visualization in Python.
[30]:
import matplotlib.pyplot as plt # Standard import convention
# To display plots inline, use this special Jupyter command
%matplotlib inline
Notes:
matplotlib.pyplotis typically imported as plt — this is the standard convention.%matplotlib inlineis a Jupyter-specific magic command that ensures plots appear directly within the notebook instead of opening in a separate window.Matplotlib supports multiple backends (inline, notebook, interactive, etc.), making it versatile for both exploratory research and production environments.
Barebones example
A simple example of plotting two functions (sine and cosine) using Matplotlib.
[31]:
# Create an array of values from 0 to 3π with a step of 0.01
t = np.arange(0, np.pi * 3, 0.01)
# Compute sine and cosine for each value in t
y1 = np.sin(t)
y2 = np.cos(t)
# Plot both functions
plt.plot(t, y1)
plt.plot(t, y2)
# Display the figure
plt.show()
Notes:
np.arange(start, stop, step)creates evenly spaced values.plt.plot()draws a line plot for given x and y values.plt.show()renders the figure in the notebook or output cell.Multiple calls to
plt.plot()allow overlaying several curves on the same axes.
Let’s beautify this
[32]:
# Create figure and axes
fig, ax = plt.subplots(nrows=1, ncols=1)
# Plot sine and cosine curves with labels and line width
ax.plot(t, y1, label='$\sin(x)$', linewidth=3.0)
ax.plot(t, y2, label='$\cos(x)$', linewidth=3.0)
# Add title and axis labels
ax.set_title('Sine and Cosine Functions', fontsize=16)
ax.set_xlabel('$x$', fontsize=16)
ax.set_ylabel('$f(x)$', fontsize=16)
# Add legend with style options
ax.legend(loc='best', fancybox=True, framealpha=0.5, fontsize=16)
# Add gridlines
ax.grid(True)
# Display the figure
plt.show()
<>:5: SyntaxWarning: invalid escape sequence '\s'
<>:6: SyntaxWarning: invalid escape sequence '\c'
<>:5: SyntaxWarning: invalid escape sequence '\s'
<>:6: SyntaxWarning: invalid escape sequence '\c'
/var/folders/zm/0r1v0fg50w53tw219k1jp0ym0000gn/T/ipykernel_15173/2200726450.py:5: SyntaxWarning: invalid escape sequence '\s'
ax.plot(t, y1, label='$\sin(x)$', linewidth=3.0)
/var/folders/zm/0r1v0fg50w53tw219k1jp0ym0000gn/T/ipykernel_15173/2200726450.py:6: SyntaxWarning: invalid escape sequence '\c'
ax.plot(t, y2, label='$\cos(x)$', linewidth=3.0)
Notes:
plt.subplots()creates a figure (fig) and axes (ax) for more control over plot elements.ax.plot()plots data on the specified axes. You can set labels, line width, and other style parameters.ax.set_title(),ax.set_xlabel(), andax.set_ylabel()add titles and axis labels.ax.legend()displays a legend;fancyboxandframealphaimprove its appearance.ax.grid(True)adds gridlines for readability.This structured approach is preferred for complex or multi-panel plots.
Enhancing Plots with Seaborn
Seaborn is a Python library built on top of Matplotlib that simplifies plot styling, color palettes, and overall visual aesthetics. It is particularly useful for creating publication-quality plots with minimal configuration.
ou can install Seaborn using:
pip install seaborn
Using Seaborn to Beautify Plots
[33]:
import seaborn as sns
# Set Seaborn theme for nicer default styles
sns.set_theme()
# Create figure and axes
fig, ax = plt.subplots(nrows=1, ncols=1)
# Plot sine and cosine functions
ax.plot(t, y1, label='$\sin(x)$', linewidth=3.0)
ax.plot(t, y2, label='$\cos(x)$', linewidth=3.0)
# Add title and axis labels
ax.set_title('Sine and Cosine Functions')
ax.set_xlabel('$x$')
ax.set_ylabel('$f(x)$')
# Display legend
ax.legend()
# Show the plot
plt.show()
# Reset to original Matplotlib styles if needed
sns.reset_orig()
<>:10: SyntaxWarning: invalid escape sequence '\s'
<>:11: SyntaxWarning: invalid escape sequence '\c'
<>:10: SyntaxWarning: invalid escape sequence '\s'
<>:11: SyntaxWarning: invalid escape sequence '\c'
/var/folders/zm/0r1v0fg50w53tw219k1jp0ym0000gn/T/ipykernel_15173/4091915364.py:10: SyntaxWarning: invalid escape sequence '\s'
ax.plot(t, y1, label='$\sin(x)$', linewidth=3.0)
/var/folders/zm/0r1v0fg50w53tw219k1jp0ym0000gn/T/ipykernel_15173/4091915364.py:11: SyntaxWarning: invalid escape sequence '\c'
ax.plot(t, y2, label='$\cos(x)$', linewidth=3.0)
Notes:
sns.set_theme()automatically adjusts font sizes, line widths, and colors for a cleaner look.Seaborn overrides Matplotlib defaults, so use
sns.reset_orig()to revert back to the original Matplotlib styling.Seaborn works seamlessly with Matplotlib’s object-oriented interface, allowing full control while improving aesthetics.
Ideal for creating quick, attractive plots without manually adjusting every styling parameter.
Subplots
When you want to display multiple plots simultaneously, Matplotlib’s subplots()
[34]:
# Time vector
dt = 0.01
t = np.arange(0, 30, dt)
# Generate two white noise signals
nse1 = np.random.randn(len(t)) # white noise 1
nse2 = np.random.randn(len(t)) # white noise 2
# Two signals with a coherent part at 10Hz and a random part
s1 = np.sin(2 * np.pi * 10 * t) + nse1
s2 = np.sin(2 * np.pi * 10 * t) + nse2
# Create figure and two subplots
fig, axs = plt.subplots(2, 1, figsize=(10.0, 6.0))
# Top plot: time series
axs[0].plot(t, s1, t, s2)
axs[0].set_xlim(0, 2)
axs[0].set_xlabel('time', fontsize=16)
axs[0].set_ylabel('s1 and s2', fontsize=16)
axs[0].grid(True)
# Bottom plot: coherence
cxy, f = axs[1].cohere(s1, s2, 256, 1. / dt)
axs[1].set_xlabel('frequency', fontsize=16)
axs[1].set_ylabel('coherence', fontsize=16)
# Adjust layout to prevent overlap
fig.tight_layout()
plt.show()
/var/folders/zm/0r1v0fg50w53tw219k1jp0ym0000gn/T/ipykernel_15173/2617816087.py:24: MatplotlibDeprecationWarning: Passing the NFFT parameter of cohere() positionally is deprecated since Matplotlib 3.10; the parameter will become keyword-only in 3.12.
cxy, f = axs[1].cohere(s1, s2, 256, 1. / dt)
Notes:
plt.subplots(nrows, ncols)returns a figure and an array of axes for plotting multiple plots in one figure.You can index into
axsto plot on individual subplots (e.g.,axs[0],axs[1]).figsizecontrols the overall figure size.fig.tight_layout()automatically adjusts spacing between subplots to avoid overlapping labels.Subplots allow you to compare related plots side by side or stacked vertically.
Scatter Plots
[35]:
x_label = 'sepal_length'
y_label = 'sepal_width'
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(5.0, 5.0))
# Plot each species with different colors
for spec in ['setosa', 'versicolor', 'virginica']:
df = iris[iris['species'] == spec]
ax.scatter(df[x_label], df[y_label], label=spec)
# Add title and axis labels
ax.set_title('Iris Dataset', fontsize=16)
ax.set_xlabel(x_label, fontsize=16)
ax.set_ylabel(y_label, fontsize=16)
# Display legend with style options
ax.legend(loc='best', fancybox=True, framealpha=0.5, fontsize=16)
# Add gridlines
ax.grid()
plt.show()
Notes:
ax.scatter(x, y)creates a scatter plot of x versus y.Looping over categories allows coloring and labeling points by class.
Legends help identify different categories in the plot.
Gridlines improve readability of the scatter plot.
Figure size can be controlled using
figsize.
Optional: JAX as NumPy on GPU
torch.Tensor for arrays and supports operations like matrix multiplication and computing the mean.JAX provides a similar API to NumPy, called jax.numpy (jnp), which allows you to write code almost identical to NumPy while leveraging accelerators like GPUs or TPUs.
Think of JAX as NumPy on accelerators. > Note: The web version of this notebook was compiled without GPU acceleration. > To see the performance benefits of JAX, run this notebook on Colab with a GPU or TPU backend.
[36]:
import jax
import jax.numpy as jnp # Common alias to differentiate from regular NumPy
import numpy as np
import torch
import random
import matplotlib.pyplot as plt
Notes:
jax.numpy(jnp) mirrors most of the NumPy API, allowing easy switching between them with minimal code changes.Arrays created with JAX are immutable and support automatic differentiation, GPU/TPU acceleration, and just-in-time compilation.
Using
jnpprovides a clear distinction from regular NumPy arrays (np) to avoid confusion.
Arrays on GPU with JAX
JAX automatically places arrays on available devices, such as CPUs, GPUs, or TPUs. Unlike PyTorch, you usually don’t need to manually move arrays to a device.
Distinction from NumPy:
Regular NumPy arrays (
np.array) always reside on the CPU and cannot utilize GPUs or TPUs.With JAX (
jnp.array), the same NumPy-like operations can run on accelerators automatically.This allows you to write code almost identical to NumPy while benefiting from hardware acceleration.
[37]:
# Create a simple array
x = jnp.arange(10)
print(x)
[0 1 2 3 4 5 6 7 8 9]
[38]:
# Check which device the array x is on
print(x.device)
TFRT_CPU_0
[39]:
# Perform a computation
y = jnp.dot(x, x)
print(y)
285
[40]:
# Check which device the array y is on
print(y.device)
TFRT_CPU_0
Notes:
JAX automatically assigns arrays and computations to the best available device.
Use
x.device()to check the current device.If you are using Google Colab, try switching the runtime type between CPU, GPU, or TPU and observe how the device changes.
Operations on arrays in JAX are compiled and accelerated on the selected device without additional code.
JAXPRs (JAX Program Representations)
jax.grad.[41]:
# Define a simple function
def myfun(x, y):
z = x ** 2 + y
return z
x = jnp.array(2.0)
y = jnp.array(3.0)
# Display the JAXPR of the function
jax.make_jaxpr(myfun)(x, y)
[41]:
{ lambda ; a:f32[] b:f32[]. let
c:f32[] = integer_pow[y=2] a
d:f32[] = add c b
in (d,) }
[42]:
# Compute the derivative of the function with respect to the first argument
d_myfun = jax.grad(myfun) # returns a new function
jax.make_jaxpr(d_myfun)(x, y)
[42]:
{ lambda ; a:f32[] b:f32[]. let
c:f32[] = integer_pow[y=2] a
d:f32[] = integer_pow[y=1] a
e:f32[] = mul 2.0:f32[] d
_:f32[] = add c b
f:f32[] = mul 1.0:f32[] e
in (f,) }
[43]:
# Evaluate the derivative at a specific point
x = jnp.array(10.0)
y = jnp.array(12.0)
d_myfun(x, y)
[43]:
Array(20., dtype=float32, weak_type=True)
[44]:
# Compute the second-order derivative
jax.grad(jax.grad(myfun))(x, y)
[44]:
Array(2., dtype=float32, weak_type=True)
Notes:
JAXPRs are an intermediate representation that makes JAX transformations possible.
jax.grad(f)returns a new function representing the gradient offwith respect to its first argument.You can nest
jax.gradto compute higher-order derivatives.JAX allows you to differentiate through nearly arbitrary Python+JAX code efficiently and automatically.
JAX is FAST ⚡
[45]:
# Create two random 500x500 matrices in JAX
rng = jax.random.PRNGKey(0)
key, rng = jax.random.split(rng)
m1 = jax.random.normal(key, (500, 500))
key, rng = jax.random.split(rng)
m2 = jax.random.normal(key, (500, 500))
# Check the shapes of the matrices
m1.shape, m2.shape
[45]:
((500, 500), (500, 500))
[46]:
# Time JAX matrix multiplication
# block_until_ready() ensures the GPU computation finishes before timing
%timeit jnp.dot(m1, m2).block_until_ready()
654 μs ± 20.5 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
[47]:
# Create two random 500x500 matrices in PyTorch
b1 = torch.normal(torch.ones(500, 500), torch.ones(500, 500))
b2 = torch.normal(torch.ones(500, 500), torch.ones(500, 500))
# Time PyTorch matrix multiplication (synchronize for GPU timing)
%timeit torch.matmul(b1, b2); torch.cuda.synchronize() if torch.cuda.is_available() else None
152 μs ± 2.72 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Notes:
.block_until_ready()ensures accurate timing for asynchronous GPU operations in JAX.JAX automatically dispatches computations to the GPU or TPU if available.
PyTorch requires explicit
torch.cuda.synchronize()to measure GPU execution time accurately.Both frameworks can perform matrix multiplication much faster on hardware accelerators than on CPU.
JIT compilation with jax.jit
JAX can just-in-time (JIT) compile functions to accelerate execution by fusing operations and optimizing memory usage.
[48]:
# Define a simple function
def myfun(x, y):
z = x ** 2 + y
return z
x = jnp.array(2.0)
y = jnp.array(3.0)
# Measure execution time without JIT
%timeit myfun(x, y).block_until_ready()
22 μs ± 822 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
[49]:
# Measure execution time with JIT
%timeit jax.jit(myfun)(x, y).block_until_ready()
29.6 μs ± 237 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Notes:
jax.jittakes a JAX function and compiles it for faster execution.Operations are fused into a single kernel, optimizing memory and compute.
This is conceptually similar to writing a CUDA kernel manually.
jax.jitcan also be used as a decorator for cleaner syntax:
@jax.jit
def myfun(x, y):
# code
...
myfun(x, y)
Using JIT is especially beneficial for repeated computations on large arrays or in loops.
Parallelization with jax.vmap
jax.vmap allows you to vectorize functions over batch dimensions automatically, eliminating explicit Python loops.
[50]:
# Define a function for single vector input
def myfun(x, y):
# compute dot product and then square the result
return jnp.dot(x, y) ** 2
# Single vector example
x = jnp.array([2, 2], dtype=jnp.float32)
y = jnp.array([3, 3], dtype=jnp.float32)
z = myfun(x, y)
print('result = (2*3 + 2*3)^2 =', z)
result = (2*3 + 2*3)^2 = 144.0
myfuntakes two 1-dimensional arrays and returns a scalar.Shapes:
[51]:
# currently our function takes two 1-dimensional arrays as input
print('x shape', x.shape)
print('y shape', y.shape)
print('z shape', z.shape)
x shape (2,)
y shape (2,)
z shape ()
What if we have batches of data?
[52]:
x = jnp.array([[2, 2], [4, 4], [6, 6]], dtype=jnp.float32) # 3 batches of vectors
y = jnp.array([[3, 3], [5, 5], [7, 7]], dtype=jnp.float32) # 3 batches of vectors
print('x shape', x.shape)
print('y shape', y.shape)
try:
myfun(x, y)
except Exception as e:
print()
print('EXCEPTION THROWN!')
print(e)
x shape (3, 2)
y shape (3, 2)
EXCEPTION THROWN!
dot_general requires contracting dimensions to have the same shape, got (2,) and (3,).
We get an error because
myfunis written for single vectors, not batches.The dot product cannot be directly applied to the input matrices in the intended batched manner.
Solution: ``jax.vmap``
jax.vmapvectorizes a function to automatically map it over leading array axes:
[53]:
jax.vmap(myfun)(x, y)
[53]:
Array([ 144., 1600., 7056.], dtype=float32)
Each row of
xandyis passed tomyfunin parallel.This avoids writing explicit for-loops and is much faster on accelerators.
Combining with JIT
For even better performance, we can JIT-compile the vectorized function:
[54]:
# We can use JIT compilation on top of jax.vmap, speeding things up further!
jax.jit(jax.vmap(myfun))(x, y)
[54]:
Array([ 144., 1600., 7056.], dtype=float32)
Exercise 1: Linear Regression Exercise: Normal Equation with NumPy (or JAX)
In this exercise, you will compute the beta matrix (\(\boldsymbol{\beta}\)) for Linear Regression using the Normal Equation and NumPy (or JAX), and use it to make predictions.
The Normal Equation formula is:
\(\boldsymbol{\beta} = (X^T X)^{-1} X^T y\)
Where:
\(X \in \mathbb{R}^{m \times n}\) is the feature matrix.
\(y \in \mathbb{R}^{m \times 1}\) is the target vector.
\(\boldsymbol{\beta} \in \mathbb{R}^{(n+1) \times 1}\) is the matrix of coefficients.
The Prediction Equation formula is:
Once \(\boldsymbol{\beta}\) is computed, predictions are made using:
\(\hat{y} = X \boldsymbol{\beta}\)
Import Libraries and Generate a Synthetic Dataset
[55]:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(42)
# Parameters
a = 2.5
num_points = 50
noise_std = 5.0
# Generate x values
x = np.linspace(0, 10, num_points)
# Reshape x to be a feature matrix (num_points x 1)
X = x.reshape(-1, 1)
# Generate noisy y values
noise = np.random.normal(0, noise_std, size=num_points)
y = a * x + noise
# for demonstration purposes we'll use a sine function
#y = a * np.sin(x) + noise
TODO: Plot the Data using matplotlib
[56]:
# TODO: # Visualize the dataset
TODO: Compute Beta Matrix Using Normal Equation
[57]:
# TODO: Compute the beta matrix
# beta = (X_b^T X_b)^(-1) X_b^T y
beta = None # replace None
TODO: Make Predictions Using Beta Matrix
[58]:
# TODO: Use prediction equation: y_hat = X_b @ beta
y_pred = None # replace None
TODO: Visualize the predictions
[59]:
# TODO: Visualize the predictions
Solution - Exercise 1
Derivation of the Normal Equation of Linear Regression
Step 1: Linear Model
For a simple linear regression:
where:
\(y_i\) = observed response
\(x_i\) = predictor
\(\beta_0\) = intercept
\(\beta_1\) = slope
\(\varepsilon_i\) = error term
Step 2: Design Matrix Form
All \(n\) equations can be written in matrix form:
with:
First column of \(X\) is all 1’s → represents the intercept \(\beta_0\).
Second column of \(X\) contains the predictor values \(x_i\) → represents the slope \(\beta_1\).
\(y\) is the vector of observed responses.
\(\varepsilon\) is the vector of residuals/errors.
Step 3: Residual Sum of Squares (RSS)
The RSS is a function of \(\beta\):
Residual vector:
Step 4: Expanding the RSS
Quadratic term: \(\beta^T X^T X \beta\)
Linear term: \(-2 y^T X \beta\)
Constant term: \(y^T y\)
Step 5: Differentiation with Matrix Rules
Quadratic form: \(\frac{\partial}{\partial \beta} (\beta^T A \beta) = 2 A \beta\) if \(A\) is symmetric
Linear form: \(\frac{\partial}{\partial \beta} (b^T \beta) = b\)
Constant term: derivative = 0
Apply to RSS:
Step 6: Solve for :math:`beta` (Normal Equation)
Set derivative to zero:
Closed-form solution:
[60]:
# Visualize the dataset - using a scatter plot
plt.scatter(x, y, label='Data Points')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Noisy Linear Data')
plt.legend()
plt.show()
[61]:
# Compute the beta matrix - using the Normal Equation
# beta = (X_b^T X_b)^(-1) X_b^T y
# the matrix inverse is computed using np.linalg.inv
# the @ operator denotes matrix multiplication
beta = np.linalg.inv((X.T @ X)) @ X.T @ y
# alternatively matrix multiplication using np.dot
#beta = np.dot(np.linalg.inv(np.dot(X.T, X)), np.dot(X.T, y))
print('Estimated beta:', beta)
Estimated beta: [2.25792712]
[62]:
# Use prediction equation: y_hat = X_b @ beta
y_pred = X @ beta
# alternatively using np.dot
#y_pred = np.dot(X, beta)
print('Predicted y values:', y_pred)
Predicted y values: [ 0. 0.46080145 0.9216029 1.38240436 1.84320581 2.30400726
2.76480871 3.22561017 3.68641162 4.14721307 4.60801452 5.06881598
5.52961743 5.99041888 6.45122033 6.91202179 7.37282324 7.83362469
8.29442614 8.75522759 9.21602905 9.6768305 10.13763195 10.5984334
11.05923486 11.52003631 11.98083776 12.44163921 12.90244067 13.36324212
13.82404357 14.28484502 14.74564647 15.20644793 15.66724938 16.12805083
16.58885228 17.04965374 17.51045519 17.97125664 18.43205809 18.89285955
19.353661 19.81446245 20.2752639 20.73606536 21.19686681 21.65766826
22.11846971 22.57927116]
[63]:
# Visualize the predictions
# data points as scatter plot
plt.scatter(x, y, label='Data Points')
# line plot of predictions
plt.plot(x, y_pred, color='red', label='Linear Regression Fit')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Noisy Linear Data with Predictions')
plt.legend()
plt.show()
Helper function for plots
A simple NumPy-based implementation of LOESS (locally weighted scatterplot smoothing).
[64]:
def lowess_numpy(x, y, frac=0.6, iters=3):
"""
Simple NumPy implementation of LOESS smoothing.
Parameters:
x (array): predictor values (1D)
y (array): response values (1D)
frac (float): fraction of data used in local regression
iters (int): robustness iterations
Returns:
smoothed (ndarray): smoothed y values corresponding to x
"""
n = len(x)
r = int(np.ceil(frac * n))
smoothed = np.zeros(n)
x_sorted_idx = np.argsort(x)
x_sorted = x[x_sorted_idx]
y_sorted = y[x_sorted_idx]
# Distance weights (tricube)
def tricube(d):
w = np.clip(1 - np.abs(d)**3, 0, 1)**3
return w
# Initial robustness weights
robustness = np.ones(n)
for iteration in range(iters):
for i in range(n):
# Distances to all points
distances = np.abs(x_sorted - x_sorted[i])
# Find bandwidth based on frac
bandwidth = np.sort(distances)[r]
# Compute weights
w = tricube(distances / bandwidth) * robustness
# Weighted linear regression
Xw = np.column_stack((np.ones(n), x_sorted))
W = np.diag(w)
beta = np.linalg.pinv(Xw.T @ W @ Xw) @ Xw.T @ W @ y_sorted
smoothed[i] = beta[0] + beta[1] * x_sorted[i]
# Update robustness weights (based on residuals)
residuals_iter = y_sorted - smoothed
s = np.median(np.abs(residuals_iter))
if s == 0:
break
robustness = tricube(residuals_iter / (6.0 * s))
# Return smoothed values in original order
smoothed_unsorted = np.zeros_like(smoothed)
smoothed_unsorted[x_sorted_idx] = smoothed
return smoothed_unsorted
Optional - Side note: Residual Analysis
In this section we’re using four common techniques to measure the quality of our linear regression model.
Purpose of Residual Analysis:
Check linearity: Residuals should show no systematic pattern when plotted against predicted values.
Check homoscedasticity: Residuals should have constant variance across all fitted values.
Check normality: Residuals should be approximately normally distributed for valid inference.
Identify outliers/influential points: Large residuals or high leverage points can disproportionately affect the model.
1 Residuals vs Fitted Plot
Curved pattern → consider adding nonlinear terms
Funnel shape → variance changes with fitted values (heteroscedasticity)
Clusters → possible missing categorical variables or interactions
[65]:
# Residual Plot for Linear Regression with outer quantiles
# compute residuals
residuals = y - y_pred
# Compute LOESS-smoothed residuals
loess_line = lowess_numpy(y_pred, residuals, frac=0.6)
plt.scatter(y_pred, residuals)
plt.plot(y_pred, loess_line, color='red', linewidth=2, label='LOESS smooth')
plt.axhline(0, color='gray', linestyle='--')
plt.xlabel('Predicted y values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.show()
2 Normal Q–Q Plot
S-shaped curve → skewed residuals
Heavy tails → outliers or non-normal errors
[66]:
# Normal Q-Q Plot
import scipy.stats as stats
# Generate a Q-Q plot using scipy.stats
stats.probplot(residuals, dist="norm", plot=plt)
plt.title('Normal Q-Q Plot')
plt.show()
3 Scale–Location (Spread–Location) Plot
Upward/downward slope → variance changes with fitted values Rescue: Transform the response (e.g., log(y)) or use weighted least squares.
[67]:
# Scale−Location Plot
# Compute sqrt of absolute residuals
sqrt_residuals = np.sqrt(np.abs(residuals))
# Compute LOESS smooth for sqrt(|residuals|)
loess_line = lowess_numpy(y_pred, sqrt_residuals, frac=0.6)
# Plot Scale–Location
plt.figure(figsize=(7,5))
plt.scatter(y_pred, sqrt_residuals, facecolors='none', edgecolors='k', label='Data')
plt.plot(y_pred, loess_line, color='red', linewidth=2, label='LOESS smooth')
plt.xlabel('Predicted y values')
plt.ylabel('Sqrt of |Residuals|')
plt.title('Scale–Location Plot with NumPy LOESS Smooth')
plt.legend()
plt.show()
4 Residuals vs Leverage Plot (with Cook’s Distance)
High leverage + large residuals → influential outliers
Cook’s D > 0.5–1 → may unduly affect model coefficients Rescue: Investigate these points, consider removal, transformation, or special handling.
[68]:
# Residuals vs Leverage Plot
# Add constant (intercept term)
X_const = np.column_stack((np.ones(X.shape[0]), X))
# Fit OLS using NumPy: beta = (X'X)^(-1) X'y
beta = np.linalg.inv(X_const.T @ X_const) @ X_const.T @ y
# Predictions and residuals
y_pred = X_const @ beta
residuals = y - y_pred
# Compute leverage values (diagonal of hat matrix)
H = X_const @ np.linalg.inv(X_const.T @ X_const) @ X_const.T
leverage = np.diag(H)
# Estimate residual variance
n, p = X_const.shape
s2 = np.sum(residuals**2) / (n - p)
# Standardized residuals
std_resid = residuals / np.sqrt(s2 * (1 - leverage))
# Cook’s distance
cooks_d = (std_resid**2 * leverage) / (p * (1 - leverage))
# Compute LOESS smooth for residuals vs leverage
loess_line = lowess_numpy(leverage, std_resid, frac=0.6)
# --- Plot ---
plt.figure(figsize=(8,6))
sc = plt.scatter(leverage, std_resid, c=cooks_d, cmap='viridis', edgecolors='k', label='Data')
plt.plot(leverage, loess_line, color='red', linewidth=2, label='LOESS smooth')
plt.xlabel('Leverage')
plt.ylabel('Standardized Residuals')
plt.ylim(-5, 5)
plt.title('Residuals vs Leverage Plot (NumPy + LOESS)')
plt.colorbar(sc, label="Cook's Distance")
plt.axhline(y=0, color='gray', linestyle='--')
# Add Cook's distance contours (approximate reference lines)
x_vals = np.linspace(0, np.max(leverage)*1.1, 100)
for d, color in zip([0.5, 1], ['orange', 'red']):
y_vals = np.sqrt((d * p * (1 - x_vals)) / x_vals)
plt.plot(x_vals, y_vals, color=color, linestyle='--', label=f"Cook's D = {d}")
plt.plot(x_vals, -y_vals, color=color, linestyle='--')
plt.legend()
plt.show()
/var/folders/zm/0r1v0fg50w53tw219k1jp0ym0000gn/T/ipykernel_15173/3567787468.py:43: RuntimeWarning: divide by zero encountered in divide
y_vals = np.sqrt((d * p * (1 - x_vals)) / x_vals)
Conclusion
This tutorial has guided you through setting up a Python environment, running Jupyter or Colab notebooks, and using basic notebook features like code cells, Markdown, and timing commands.
Basic Python programming concepts were introduced, along with the most important libraries for data analysis:
NumPy for numerical computing and array operations
Pandas for data loading, cleaning, and processing
Matplotlib and Seaborn for data visualization
With these foundations, you are now ready to experiment, document, and share your Python analyses efficiently.
References
Here are some useful references and resources to deepen your understanding:
Python
NumPy
Pandas
Matplotlib
Seaborn
JAX (Optional / Advanced)
General Data Science Resources
Additional References Used in This Tutorial