Getting started 1: Working with Jupyter and Python

Status
Open notebook on: View filled on Github Open filled In Collab
Author: Christoph R. Landolt

This tutorial offers a short introduction to Jupyter Notebooks and Python. Feel free to skip this section if you are already familiar with them.

Tutorial Objectives

  • Set up an execution environment for this tutorial using Python and Jupyter Notebooks

  • Learn how to use Jupyter Notebooks

  • Get started with Python, NumPy, Pandas, and Matplotlib

Where to run these tutorials

You can run these tutorials in the following three ways:

Tool

Purpose

Best For

Key Strengths

Local execution Environment

Run Python & Jupyter locally

Offline dev, custom setups, large datasets, privacy-sensitive work

Full control; no internet required; persistent storage; customizable hardware

Google Colab

Cloud-based Jupyter notebooks

Quick experiments, free GPU/TPU, education

No installation; free GPU/TPU; easy sharing via links; Google Drive integration

Kaggle Notebooks

Cloud notebooks for data analysis & competitions

Kaggle competitions, reproducible analysis, dataset exploration

Direct dataset access; free CPU/GPU; collaboration; versioned notebooks

Links:

Set up a local Environment

In this section, we provide an overview of different options for setting up a local Python environment and isolating separate execution environments.

Step 1: Understand the Tools

Before setting up your local Python environment, it’s important to understand what each tool does and when to use it.

Tool

Purpose

Best For

Key Strengths

Conda

Full environment + package manager

Data science, ML, scientific computing

Handles both Python & non-Python dependencies

pyenv + pyenv-virtualenv

Manage Python versions and isolated environments

Professional development, backend systems, multi-project setups

Fine-grained control over Python versions; lightweight; reproducible environments; integrates with CI/CD

virtualenv / venv

Create isolated environments for one Python version

Small to medium projects

Built into Python since version 3.3; simple and fast; no external dependencies

Links:

Step 2: Installation

After choosing your preferred toolset you can install the dependencies.

Option 1: Miniconda (Recommended for Data Science / ML)

As the Anaconda distribution comes with an extensive collection of packages, which we don’t need in this course, we recommend installing Miniconda, a free, miniature installation that includes only conda, Python, and a small number of essential packages.

Download the latest version for your OS:

👉 Miniconda Download

Windows Command Prompt

curl https://repo.anaconda.com/miniconda/Miniconda3-latest-Windows-x86_64.exe -o .\miniconda.exe
start /wait "" .\miniconda.exe /S
del .\miniconda.exe

macOS

mkdir -p ~/miniconda3
curl https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-arm64.sh -o ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm ~/miniconda3/miniconda.sh
source ~/miniconda3/bin/activate
conda init --all

Linux

mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm ~/miniconda3/miniconda.sh
source ~/miniconda3/bin/activate
conda init --all

Option 2: pyenv + pyenv-virtualenv (Multi-Python Development for Linux and MacOS)

  1. Install pyenv and friends:

curl https://pyenv.run | bash

👉 Note: This command redirects to the following source pyenv-installer

  1. Install a Python version:

pyenv install 3.12.0
pyenv global 3.12.0
  1. Create your first isolated environment:

pyenv virtualenv 3.12.0 mlcysec
pyenv activate mlcysec

Option 3: virtualenv / venv (Lightweight / Single Python Version)

You can only use this functionality if you are running a python_version >= 3.3.

  1. Create a virtual environment:

python3 -m venv mlcysec
  1. Activate the environment:

macOS/Linux

source mlcysec/bin/activate

Windows

mlcysec\Scripts\activate

Jupyter and Colab Notebooks

Jupyter Notebooks allow us to combine code, visualizations, and written explanations in a single document, making it much easier to learn, present, and share analyses. The cell-based environment also lets us run and test small sections of code independently. A Jupyter Notebook consists of two main components:

  1. cells: A container for code or text (e.g., this is written within a markdown cell)

  2. kernels: The “computational engine” which executes code blocks of the notebook

Installing Jupyter

To run a Jupyter Notebook locally, you first need to install Jupyter in your virtual environment and activate it. This installation step is not necessary if you are using web-based platforms such as Google Colab or Kaggle, which provide ready-to-use notebook environments.

  1. Install the classic Jupyter Notebook with:

pip install notebook
  1. Run the notebook:

jupyter notebook
  1. Open the browser and visit http://localhost:8888 > Note: If you’re missing dependencies, you can optionally run the following on Google Colab: > bash > !pip install --upgrade pip && pip install -r requirements.txt > > Or in your local Python environment: > bash > pip install -r requirements.txt >

Cells

Cells can contain either code or markdown

Check out keyboard shortcuts via Cmd/Ctrl + Shift + P.

Few important ones:

  • Shift + Enter: Executes the current cell and moves to the next

  • Tab: Autocompletes

  • Shift + Tab: Brings up documentation. Try this after entering np.ones(

[1]:
def say_hello():
    print('This is a code cell')

say_hello()
This is a code cell
[2]:
import sys
print("Python version")
print (sys.version)
print("Version info.")
print (sys.version_info)
Python version
3.12.2 (main, May  4 2025, 19:19:40) [Clang 17.0.0 (clang-1700.0.13.3)]
Version info.
sys.version_info(major=3, minor=12, micro=2, releaselevel='final', serial=0)

or it can be a markdown cell, like this one.

If you’re unfamiliar with Markdown syntax, check out this cheat sheet.

Some things you can do with Markdown:

This is a level 2 heading

This is a level 3 heading

Syntax

This is some plain text that forms a paragraph. Add emphasis via bold and bold, or italic and italic.

Paragraphs must be separated by an empty line.

Horizontal lines

You can divide the flow your text using horizontal lines like so

---

Quotes

If you need to quote a phrase set a > right in front of the quote.

“There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult.” ― Tony Hoare

Lists

  • Sometimes we want to include lists.

  • Which can be indented.

  1. Lists can also be numbered.

  2. For ordered lists.

Code blocks

Inline code uses single backticks: foo(), and code blocks use triple backticks:

bar()

Or can be intented by 4 spaces:

foo()

Latex Code

\(y=x^2\)

\(e^{i/pi} + 1 = 0\)

\(e^x=\sum_{i=0}^\infty \frac{1}{i!}x^i\)

\(\frac{n!}{k!(n-k)!} = {n \choose k}\)

\[\begin{split}A_{m,n} = \begin{pmatrix} a_{1,1} & a_{1,2} & \cdots & a_{1,n} \\ a_{2,1} & a_{2,2} & \cdots & a_{2,n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m,1} & a_{m,2} & \cdots & a_{m,n} \end{pmatrix}\end{split}\]

Tables

First column name | Second column name
--- | ---
Row 1, Col 1 | Row 1, Col 2
Row 2, Col 1 | Row 2, Col 2

becomes:

First column name

Second column name

Row 1, Col 1

Row 1, Col 2

Row 2, Col 1

Row 2, Col 2

If you want to make your life easier, you can also use this online table generator. You can easily create your table using visual tools and then it will generate the markdown code for you.

HTML and images

In a notebook text cell, you may also enter HTML code like this:

<img src='https://www.uni-saarland.de/typo3conf/ext/uni_saarland/Resources/Public/Images/logo_uni_saarland.svg' width="200">

which will give the following

36f8f68678364be38b5a3d69ae24331a

Alternatively, you may insert an image inline as follows

![Saarland University Logo](https://www.uni-saarland.de/typo3conf/ext/uni_saarland/Resources/Public/Images/logo_uni_saarland.svg)

Measuring time

When experimenting with different Python implementations, you often want to know which runs fastest.

The timeit magic command allows you to measure the execution time of a statement or an entire cell. By comparing alternatives, you can identify the most efficient approach.

[3]:
%timeit l = [k for k in range(10**6)]
23.1 ms ± 173 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Python

Python Versions

Python has two main versions: 2.7 and 3.x.

  • Python 3.x is not backward-compatible with Python 2, meaning Python 2 code may not run in Python 3.

  • For this course, we will use Python 3.7 or higher.

Note: Python 2.7 was once widely used, but it has been deprecated and no longer receives security updates or bug fixes. If you still use Python 2 in any environment, it is strongly recommended to migrate to Python 3.

Basic Data Types

  • Numbers: Python supports integers (``int``) and floating-point numbers (``float``):

[4]:
# Integers
x = 3
print(type(x))  # <class 'int'>
print(x)        # 3

# Arithmetic operations
print(x + 1)    # 4
print(x - 1)    # 2
print(x * 2)    # 6
print(x ** 2)   # 9

# In-place updates
x += 1
print(x)        # 4
x *= 2
print(x)        # 8

# Floating-point numbers
y = 2.5
print(type(y))  # <class 'float'>
print(y, y + 1, y * 2, y ** 2)  # 2.5 3.5 5.0 6.25
<class 'int'>
3
4
2
6
9
4
8
<class 'float'>
2.5 3.5 5.0 6.25
  • +, -, *, ** are addition, subtraction, multiplication, and exponentiation, respectively.
  • += and *= are shorthand for updating a variable in place.

  • Booleans

Python supports Boolean values: True and False. These are commonly used in logical operations:

[5]:
t = True
f = False

print(type(t))  # <class 'bool'>

# Logical operations
print(t and f)  # Logical AND; prints False
print(t or f)   # Logical OR; prints True
print(not t)    # Logical NOT; prints False
print(t != f)   # Logical XOR; prints True
<class 'bool'>
False
True
False
True
  • and returns True if both operands are True.

  • or returns True if at least one operand is True.

  • not negates the Boolean value.

  • != can be used as a simple XOR for two Boolean values.

  • Strings Python supports strings, which are sequences of characters:

[6]:
# String literals
hello = 'hello'    # Single quotes
world = "world"    # Double quotes (equivalent)

print(hello)       # hello
print(len(hello))  # String length; 5

# Concatenation
hw = hello + ' ' + world
print(hw)          # hello world

# Recommended: f-strings for inlining variables
hw12 = f'{hello} {world} {12}'
print(hw12)        # hello world 12
hello
5
hello world
hello world 12
  • Strings can use single or double quotes interchangeably.

  • len() returns the length of a string.

  • + concatenates strings.

  • f-strings (formatted string literals) allow embedding variables and expressions directly.

Python Containers

Python extensively relies on four types of containers:

  1. Lists

  2. Dictionaries

  3. Sets

  4. Tuples

Lists

Lists are resizable arrays that can hold heterogeneous elements:

[7]:
# Creating a list
xs = [3, 1, 2, 'foo']
print(xs, xs[2])  # [3, 1, 2, 'foo'] 2

# Accessing elements
print(xs[-1])     # Negative indices count from the end; 'foo'

# Updating elements
xs[2] = 'foo'
print(xs)         # [3, 1, 'foo', 'foo']

# Adding and removing elements
xs.append('bar')  # Add to end
print(xs)         # [3, 1, 'foo', 'foo', 'bar']
x = xs.pop()      # Remove and return last element
print(x, xs)      # bar [3, 1, 'foo', 'foo']
[3, 1, 2, 'foo'] 2
foo
[3, 1, 'foo', 'foo']
[3, 1, 'foo', 'foo', 'bar']
bar [3, 1, 'foo', 'foo']

Looping over a list

[8]:
animals = ['cat', 'dog', 'monkey']
for animal in animals:
    print(animal)
cat
dog
monkey

Notes:

  • Lists are ordered and mutable.

  • Use negative indices to access elements from the end.

  • append() adds elements; pop() removes and returns the last element.

  • Lists can contain elements of different types.

Dictionaries:

Dictionaries are key-value pairs, similar to a Map in other languages.
They are unordered (Python 3.7+ preserves insertion order) and mutable.

Tip: collections.defaultdict can be convenient when you need default values for missing keys.

[9]:
# Creating a dictionary
d = {'cat': 'cute', 'dog': 'furry'}

# Accessing values
print(d['cat'])        # cute
print('cat' in d)      # True

# Adding or updating entries
d['fish'] = 'wet'
print(d)
print(d['fish'])       # wet

# Accessing keys safely
# print(d['monkey'])   # KeyError if key does not exist
print(d.get('monkey', 'N/A'))  # N/A
print(d.get('fish', 'N/A'))    # wet

# Removing entries
del d['fish']
print(d.get('fish', 'N/A'))    # N/A
cute
True
{'cat': 'cute', 'dog': 'furry', 'fish': 'wet'}
wet
N/A
wet
N/A

Notes:

  • Dictionaries store key-value pairs.

  • Use in to check for keys.

  • Use get(key, default) to safely access values without raising an error.

  • del removes a key-value pair.

Sets:

Sets are unordered collections of unique elements.

Note: Both dictionaries and sets use curly braces {...}, so be cautious when creating them.

[10]:
# Creating a set
animals = {'cat', 'dog'}

# Membership check
print('cat' in animals)    # True
print('fish' in animals)   # False

# Adding elements
animals.add('fish')
print('fish' in animals)   # True

# Number of elements
print(len(animals))        # 3

# Adding duplicates does nothing
animals.add('cat')
print(len(animals))        # 3

# Removing elements
animals.remove('cat')
print(len(animals))        # 2
True
False
True
3
3
2

Notes:

  • Sets are unordered and do not allow duplicates.

  • Use in to check membership.

  • add() inserts elements; remove() deletes elements.

  • Curly braces {...} are used for both sets and dictionaries.

Tuples:

Tuples are immutable sequences, similar to lists but cannot be changed after creation.
They are often used as keys in dictionaries because of their immutability.
[11]:
# Creating a dictionary with tuple keys
d = {(x, x + 1): x for x in range(10)}

# Creating a tuple
t = (5, 6)
print(type(t))    # <class 'tuple'>

# Accessing dictionary values using tuples as keys
print(d[t])       # 5
print(d[(1, 2)])  # 1
<class 'tuple'>
5
1

Notes:

  • Tuples are immutable; you cannot modify, add, or remove elements.

  • Useful as dictionary keys or to represent fixed collections of items.

  • Use parentheses () to create a tuple.

Functions

Functions are reusable blocks of code that perform a specific task.

[12]:
# Define a function
def sign(x):
    if x > 0:
        return 'positive'
    elif x < 0:
        return 'negative'
    else:
        return 'zero'

# Test the function
for x in [-1, 0, 1]:
    print(sign(x))
# Output:
# negative
# zero
# positive
negative
zero
positive

Notes:

  • def is used to define a function.

  • return outputs a value from the function.

  • Functions can be called multiple times with different arguments.

PEP 484 introduced type hints for Python to improve code readability and provide optional static type checking, for example:

[13]:
def greet(name: str) -> str:
    return "Hello " + name

greet("Alice")
[13]:
'Hello Alice'

NumPy

NumPy is the fundamental package for scientific computing in Python.
It provides:
  • A multidimensional array object (ndarray)

  • Various derived objects, such as masked arrays and matrices

  • A wide range of fast operations on arrays, including:

    • Mathematical and logical operations

    • Shape manipulation

    • Sorting and selecting

    • I/O

    • Discrete Fourier transforms

    • Basic linear algebra and statistical operations

    • Random simulations

Reference: What is NumPy?

[14]:
import numpy as np  # Standard import convention

Motivation for using Numpy

Numpy is fast ⚡:

One of the main reasons to use NumPy is speed, especially for large numerical computations.
Below is an example comparing pure Python matrix multiplication versus NumPy’s optimized routines:
[15]:
# Pure Python matrix multiplication
def matrixmult(A, B):
    rows_A = len(A)
    cols_A = len(A[0])
    rows_B = len(B)
    cols_B = len(B[0])

    if cols_A != rows_B:
        print("Cannot multiply the two matrices. Incorrect dimensions.")
        return

    # Create the result matrix
    C = [[0 for _ in range(cols_B)] for _ in range(rows_A)]

    for i in range(rows_A):
        for j in range(cols_B):
            for k in range(cols_A):
                C[i][j] += A[i][k] * B[k][j]
    return C
[16]:
# Create random matrices
A = np.random.random((10**2, 10**2))
B = np.random.random((10**2, 10**2))
print(A.shape, B.shape)
(100, 100) (100, 100)
[17]:
# Timing pure Python multiplication
%time
C = matrixmult(A, B)
print(np.sum(C))
CPU times: user 1e+03 ns, sys: 0 ns, total: 1e+03 ns
Wall time: 8.11 μs
250602.59693873872
[18]:
# Timing NumPy multiplication
%time
C = A.dot(B)  # Note: A*B performs element-wise multiplication
print(np.sum(C))
CPU times: user 1e+03 ns, sys: 1 μs, total: 2 μs
Wall time: 2.15 μs
250602.59693873874

Notes:

  • NumPy’s dot() function (and other operations) are highly optimized, often using compiled C/Fortran code under the hood.

  • For large arrays, NumPy can be orders of magnitude faster than nested Python loops.

  • A * B in NumPy performs element-wise multiplication, not matrix multiplication.

Creating Numpy Arrays: NumPy arrays are multidimensional, fast, and convenient for numerical computations.

[19]:
import numpy as np

# Create an array from a Python list
a = np.array([
    [1, 2, 3, 1],
    [5, 7, 9, 10],
    [4, 6, 8, 2],
])

print(a.shape)        # Shape of the array (3, 4)
print(a[2, 2])        # Access single element (8)
print(a[1:2, 2:4])    # Slice rows and columns
print(a[:-1])         # All rows except the last
(3, 4)
8
[[ 9 10]]
[[ 1  2  3  1]
 [ 5  7  9 10]]

Other ways of creating arrays

[20]:
# Array of zeros
a = np.zeros((2,2))
print(a)  # [[0. 0.]
          #  [0. 0.]]

# Array of ones
b = np.ones((1,2))
print(b)  # [[1. 1.]]

# Constant array
c = np.full((2,2), 7)
print(c)  # [[7. 7.]
          #  [7. 7.]]

# Identity matrix
d = np.eye(2)
print(d)  # [[1. 0.]
          #  [0. 1.]]

# Array of random values
e = np.random.random((2,2))
print(e)
print(e > 0.5)      # Boolean array
print(e[e > 0.5])   # Filter values greater than 0.5
[[0. 0.]
 [0. 0.]]
[[1. 1.]]
[[7 7]
 [7 7]]
[[1. 0.]
 [0. 1.]]
[[0.58519003 0.79314108]
 [0.87911272 0.79805063]]
[[ True  True]
 [ True  True]]
[0.58519003 0.79314108 0.87911272 0.79805063]

Notes:

  • Use np.array() to convert Python lists to arrays.

  • NumPy provides convenience functions: zeros(), ones(), full(), eye(), random.random().

  • Boolean indexing allows filtering arrays based on conditions (e.g., e[e > 0.5]).

Numpy Operations

NumPy supports elementwise operations and many convenient mathematical functions:

[21]:
x = np.array([[1, 2], [3, 4]], dtype=np.float64)
y = np.array([[5, 6], [7, 8]], dtype=np.float64)

# Elementwise addition
print(x + y)
print(np.add(x, y))
# [[ 6.0  8.0]
#  [10.0 12.0]]

# Elementwise subtraction
print(x - y)
print(np.subtract(x, y))
# [[-4.0 -4.0]
#  [-4.0 -4.0]]

# Elementwise multiplication
print(x * y)
print(np.multiply(x, y))
# [[ 5.0 12.0]
#  [21.0 32.0]]

# Elementwise division
print(x / y)
print(np.divide(x, y))
# [[0.2        0.33333333]
#  [0.42857143 0.5       ]]

# Elementwise square root
print(np.sqrt(x))
# [[1.         1.41421356]
#  [1.73205081 2.        ]]
[[ 6.  8.]
 [10. 12.]]
[[ 6.  8.]
 [10. 12.]]
[[-4. -4.]
 [-4. -4.]]
[[-4. -4.]
 [-4. -4.]]
[[ 5. 12.]
 [21. 32.]]
[[ 5. 12.]
 [21. 32.]]
[[0.2        0.33333333]
 [0.42857143 0.5       ]]
[[0.2        0.33333333]
 [0.42857143 0.5       ]]
[[1.         1.41421356]
 [1.73205081 2.        ]]

Notes:

  • NumPy operations are vectorized, meaning they apply to all elements without explicit loops.

  • Most arithmetic operations have both operator and function forms (+ vs np.add, * vs np.multiply).

  • NumPy provides many universal functions (sqrt, exp, log, sin, etc.) that operate elementwise.

Pandas

Pandas is a Python library designed to make working with relational or labeled data easy and intuitive.
It is a convenient tool for real-world data analysis in Python.

Note: We’ll explore Pandas in more detail in Getting Started 2

[22]:
import pandas as pd # Standard import convention

Pandas DataFrames

Often you’ll be dealing with :math:`d`-dimensional datapoints (or \(d\) features).
A DataFrame provides a convenient way to encapsulate this data in a tabular structure with rows and columns.
[23]:
df = pd.DataFrame({
    'A': [1., 2., 3., 4.],
    'B': pd.Timestamp('20130102'),
    'C': pd.Series(1, index=list(range(4)), dtype='float32'),
    'D': np.array([3] * 4, dtype='int32'),
    'E': pd.Categorical(["test", "train", "test", "train"]),
    'F': ['foo', 'bar', 'foo', 'bar']
})

display(df)
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 2.0 2013-01-02 1.0 3 train bar
2 3.0 2013-01-02 1.0 3 test foo
3 4.0 2013-01-02 1.0 3 train bar

Notes:

  • DataFrames can hold heterogeneous data types (numbers, strings, timestamps, categorical, etc.).

  • Each column is a Series, which is a labeled, one-dimensional array.

  • DataFrames are indexed by default, allowing for easy access and manipulation of rows and columns.

Load some existing data

We can load datasets from the web or local files directly into a pandas DataFrame.

[24]:
# bash
# Download the dataset if not already present
! if [ ! -f iris.csv ]; then wget https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/d546eaee765268bf2f487608c537c05e22e4b221/iris.csv; fi
[25]:
# bash
# Preview the first lines of the file
! head iris.csv
sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5.0,3.6,1.4,0.2,setosa
5.4,3.9,1.7,0.4,setosa
4.6,3.4,1.4,0.3,setosa
5.0,3.4,1.5,0.2,setosa
4.4,2.9,1.4,0.2,setosa
[26]:
# Load the CSV file into a DataFrame
iris = pd.read_csv('iris.csv')

# Check the type
print(type(iris))  # <class 'pandas.core.frame.DataFrame'>

# Display the DataFrame
iris
<class 'pandas.core.frame.DataFrame'>
[26]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica

150 rows × 5 columns

Notes:

  • Use pd.read_csv() to read CSV files into pandas DataFrames.

  • Once loaded, the DataFrame can be manipulated, inspected, and visualized efficiently.

Viewing

Pandas provides convenient functions to explore and inspect a DataFrame:

[27]:
# Show the first few rows
print('Head')
display(iris.head(n=5))

# Show the last few rows
print('Tail')
display(iris.tail(n=3))

# Show a random sample of rows
print('Random sample')
display(iris.sample(n=5))

# List column names
display(iris.columns)

# Summary statistics for numeric columns
display(iris.describe())
Head
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
Tail
sepal_length sepal_width petal_length petal_width species
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica
Random sample
sepal_length sepal_width petal_length petal_width species
37 4.9 3.1 1.5 0.1 setosa
34 4.9 3.1 1.5 0.1 setosa
69 5.6 2.5 3.9 1.1 versicolor
79 5.7 2.6 3.5 1.0 versicolor
103 6.3 2.9 5.6 1.8 virginica
Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667
std 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000

Notes:

  • head(n) shows the first n rows; tail(n) shows the last n rows.

  • sample(n) provides a random subset of rows.

  • columns lists all column names.

  • describe() provides basic statistics for numeric columns (mean, std, min, max, quartiles).

Selection: Pandas allows flexible selection of rows and columns, as well as filtering based on conditions.

[28]:
# Select specific columns
sample = iris.sample(n=5)
print('Selecting columns')
display(sample[['sepal_length', 'species']])

# Select specific rows
print('Selecting rows')
display(sample[:3])

# Filter rows based on criteria
print('Filter rows based on some criteria')

# Single condition
display(iris[iris['petal_length'] > 6.0])

# Multiple conditions (logical AND)
display(iris[(iris['petal_length'] > 6.0) & (iris['petal_width'] < 2.0)])
Selecting columns
sepal_length species
149 5.9 virginica
39 5.1 setosa
2 4.7 setosa
47 4.6 setosa
17 5.1 setosa
Selecting rows
sepal_length sepal_width petal_length petal_width species
149 5.9 3.0 5.1 1.8 virginica
39 5.1 3.4 1.5 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
Filter rows based on some criteria
sepal_length sepal_width petal_length petal_width species
105 7.6 3.0 6.6 2.1 virginica
107 7.3 2.9 6.3 1.8 virginica
109 7.2 3.6 6.1 2.5 virginica
117 7.7 3.8 6.7 2.2 virginica
118 7.7 2.6 6.9 2.3 virginica
122 7.7 2.8 6.7 2.0 virginica
130 7.4 2.8 6.1 1.9 virginica
131 7.9 3.8 6.4 2.0 virginica
135 7.7 3.0 6.1 2.3 virginica
sepal_length sepal_width petal_length petal_width species
107 7.3 2.9 6.3 1.8 virginica
130 7.4 2.8 6.1 1.9 virginica

Notes:

  • Use df[columns] to select specific columns.

  • Slice df[start:end] to select rows by position.

  • Boolean indexing allows filtering rows based on one or more conditions.

  • Combine multiple conditions using & (AND) or | (OR) with parentheses.

Operations: Pandas allows vectorized operations and applying functions across rows or columns.

[29]:
# Apply a function to each column or row
display(iris.sample(n=5).apply(np.cumsum))  # Cumulative sum along numeric columns

# Compute the mean of each numeric column
display(iris.mean(numeric_only=True))
sepal_length sepal_width petal_length petal_width species
75 6.6 3.0 4.4 1.4 versicolor
108 13.3 5.5 10.2 3.2 versicolorvirginica
26 18.3 8.9 11.8 3.6 versicolorvirginicasetosa
100 24.6 12.2 17.8 6.1 versicolorvirginicasetosavirginica
84 30.0 15.2 22.3 7.6 versicolorvirginicasetosavirginicaversicolor
sepal_length    5.843333
sepal_width     3.054000
petal_length    3.758667
petal_width     1.198667
dtype: float64

Notes:

  • apply(func) applies a function to each column by default (axis=0) or to each row (axis=1).

  • Use numeric_only=True when applying operations like mean() to ignore non-numeric columns.

  • You can use NumPy functions like np.cumsum, np.mean, np.sum, etc., directly on numeric DataFrames.

  • Operations on DataFrames are vectorized and efficient, avoiding explicit Python loops.

Matplotlib

Matplotlib is one of the most widely used libraries for data visualization in Python.

[30]:
import matplotlib.pyplot as plt # Standard import convention

# To display plots inline, use this special Jupyter command
%matplotlib inline

Notes:

  • matplotlib.pyplot is typically imported as plt — this is the standard convention.

  • %matplotlib inline is a Jupyter-specific magic command that ensures plots appear directly within the notebook instead of opening in a separate window.

  • Matplotlib supports multiple backends (inline, notebook, interactive, etc.), making it versatile for both exploratory research and production environments.

Barebones example

A simple example of plotting two functions (sine and cosine) using Matplotlib.

[31]:
# Create an array of values from 0 to 3π with a step of 0.01
t = np.arange(0, np.pi * 3, 0.01)

# Compute sine and cosine for each value in t
y1 = np.sin(t)
y2 = np.cos(t)

# Plot both functions
plt.plot(t, y1)
plt.plot(t, y2)

# Display the figure
plt.show()
../../_images/tutorial_notebooks_getting_started_with_jupyter_and_python_getting_started_with_jupyter_and_python_85_0.png

Notes:

  • np.arange(start, stop, step) creates evenly spaced values.

  • plt.plot() draws a line plot for given x and y values.

  • plt.show() renders the figure in the notebook or output cell.

  • Multiple calls to plt.plot() allow overlaying several curves on the same axes.

Let’s beautify this

[32]:
# Create figure and axes
fig, ax = plt.subplots(nrows=1, ncols=1)

# Plot sine and cosine curves with labels and line width
ax.plot(t, y1, label='$\sin(x)$', linewidth=3.0)
ax.plot(t, y2, label='$\cos(x)$', linewidth=3.0)

# Add title and axis labels
ax.set_title('Sine and Cosine Functions', fontsize=16)
ax.set_xlabel('$x$', fontsize=16)
ax.set_ylabel('$f(x)$', fontsize=16)

# Add legend with style options
ax.legend(loc='best', fancybox=True, framealpha=0.5, fontsize=16)

# Add gridlines
ax.grid(True)

# Display the figure
plt.show()
<>:5: SyntaxWarning: invalid escape sequence '\s'
<>:6: SyntaxWarning: invalid escape sequence '\c'
<>:5: SyntaxWarning: invalid escape sequence '\s'
<>:6: SyntaxWarning: invalid escape sequence '\c'
/var/folders/zm/0r1v0fg50w53tw219k1jp0ym0000gn/T/ipykernel_15173/2200726450.py:5: SyntaxWarning: invalid escape sequence '\s'
  ax.plot(t, y1, label='$\sin(x)$', linewidth=3.0)
/var/folders/zm/0r1v0fg50w53tw219k1jp0ym0000gn/T/ipykernel_15173/2200726450.py:6: SyntaxWarning: invalid escape sequence '\c'
  ax.plot(t, y2, label='$\cos(x)$', linewidth=3.0)
../../_images/tutorial_notebooks_getting_started_with_jupyter_and_python_getting_started_with_jupyter_and_python_88_1.png

Notes:

  • plt.subplots() creates a figure (fig) and axes (ax) for more control over plot elements.

  • ax.plot() plots data on the specified axes. You can set labels, line width, and other style parameters.

  • ax.set_title(), ax.set_xlabel(), and ax.set_ylabel() add titles and axis labels.

  • ax.legend() displays a legend; fancybox and framealpha improve its appearance.

  • ax.grid(True) adds gridlines for readability.

  • This structured approach is preferred for complex or multi-panel plots.

Enhancing Plots with Seaborn

Seaborn is a Python library built on top of Matplotlib that simplifies plot styling, color palettes, and overall visual aesthetics. It is particularly useful for creating publication-quality plots with minimal configuration.

ou can install Seaborn using:

pip install seaborn

Using Seaborn to Beautify Plots

[33]:
import seaborn as sns

# Set Seaborn theme for nicer default styles
sns.set_theme()

# Create figure and axes
fig, ax = plt.subplots(nrows=1, ncols=1)

# Plot sine and cosine functions
ax.plot(t, y1, label='$\sin(x)$', linewidth=3.0)
ax.plot(t, y2, label='$\cos(x)$', linewidth=3.0)

# Add title and axis labels
ax.set_title('Sine and Cosine Functions')
ax.set_xlabel('$x$')
ax.set_ylabel('$f(x)$')

# Display legend
ax.legend()

# Show the plot
plt.show()

# Reset to original Matplotlib styles if needed
sns.reset_orig()

<>:10: SyntaxWarning: invalid escape sequence '\s'
<>:11: SyntaxWarning: invalid escape sequence '\c'
<>:10: SyntaxWarning: invalid escape sequence '\s'
<>:11: SyntaxWarning: invalid escape sequence '\c'
/var/folders/zm/0r1v0fg50w53tw219k1jp0ym0000gn/T/ipykernel_15173/4091915364.py:10: SyntaxWarning: invalid escape sequence '\s'
  ax.plot(t, y1, label='$\sin(x)$', linewidth=3.0)
/var/folders/zm/0r1v0fg50w53tw219k1jp0ym0000gn/T/ipykernel_15173/4091915364.py:11: SyntaxWarning: invalid escape sequence '\c'
  ax.plot(t, y2, label='$\cos(x)$', linewidth=3.0)
../../_images/tutorial_notebooks_getting_started_with_jupyter_and_python_getting_started_with_jupyter_and_python_91_1.png

Notes:

  • sns.set_theme() automatically adjusts font sizes, line widths, and colors for a cleaner look.

  • Seaborn overrides Matplotlib defaults, so use sns.reset_orig() to revert back to the original Matplotlib styling.

  • Seaborn works seamlessly with Matplotlib’s object-oriented interface, allowing full control while improving aesthetics.

  • Ideal for creating quick, attractive plots without manually adjusting every styling parameter.

Subplots

When you want to display multiple plots simultaneously, Matplotlib’s subplots()

[34]:
# Time vector
dt = 0.01
t = np.arange(0, 30, dt)

# Generate two white noise signals
nse1 = np.random.randn(len(t))  # white noise 1
nse2 = np.random.randn(len(t))  # white noise 2

# Two signals with a coherent part at 10Hz and a random part
s1 = np.sin(2 * np.pi * 10 * t) + nse1
s2 = np.sin(2 * np.pi * 10 * t) + nse2

# Create figure and two subplots
fig, axs = plt.subplots(2, 1, figsize=(10.0, 6.0))

# Top plot: time series
axs[0].plot(t, s1, t, s2)
axs[0].set_xlim(0, 2)
axs[0].set_xlabel('time', fontsize=16)
axs[0].set_ylabel('s1 and s2', fontsize=16)
axs[0].grid(True)

# Bottom plot: coherence
cxy, f = axs[1].cohere(s1, s2, 256, 1. / dt)
axs[1].set_xlabel('frequency', fontsize=16)
axs[1].set_ylabel('coherence', fontsize=16)

# Adjust layout to prevent overlap
fig.tight_layout()
plt.show()
/var/folders/zm/0r1v0fg50w53tw219k1jp0ym0000gn/T/ipykernel_15173/2617816087.py:24: MatplotlibDeprecationWarning: Passing the NFFT parameter of cohere() positionally is deprecated since Matplotlib 3.10; the parameter will become keyword-only in 3.12.
  cxy, f = axs[1].cohere(s1, s2, 256, 1. / dt)
../../_images/tutorial_notebooks_getting_started_with_jupyter_and_python_getting_started_with_jupyter_and_python_94_1.png

Notes:

  • plt.subplots(nrows, ncols) returns a figure and an array of axes for plotting multiple plots in one figure.

  • You can index into axs to plot on individual subplots (e.g., axs[0], axs[1]).

  • figsize controls the overall figure size.

  • fig.tight_layout() automatically adjusts spacing between subplots to avoid overlapping labels.

  • Subplots allow you to compare related plots side by side or stacked vertically.

Scatter Plots

Scatter plots are useful for visualizing the relationship between two numeric variables.
In this example, we plot sepal length vs sepal width for the Iris dataset, coloring each species differently.
[35]:
x_label = 'sepal_length'
y_label = 'sepal_width'

fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(5.0, 5.0))

# Plot each species with different colors
for spec in ['setosa', 'versicolor', 'virginica']:
    df = iris[iris['species'] == spec]
    ax.scatter(df[x_label], df[y_label], label=spec)

# Add title and axis labels
ax.set_title('Iris Dataset', fontsize=16)
ax.set_xlabel(x_label, fontsize=16)
ax.set_ylabel(y_label, fontsize=16)

# Display legend with style options
ax.legend(loc='best', fancybox=True, framealpha=0.5, fontsize=16)

# Add gridlines
ax.grid()

plt.show()
../../_images/tutorial_notebooks_getting_started_with_jupyter_and_python_getting_started_with_jupyter_and_python_97_0.png

Notes:

  • ax.scatter(x, y) creates a scatter plot of x versus y.

  • Looping over categories allows coloring and labeling points by class.

  • Legends help identify different categories in the plot.

  • Gridlines improve readability of the scatter plot.

  • Figure size can be controlled using figsize.

Optional: JAX as NumPy on GPU

Deep learning frameworks often provide their own APIs for handling data arrays.
For example, PyTorch uses torch.Tensor for arrays and supports operations like matrix multiplication and computing the mean.

JAX provides a similar API to NumPy, called jax.numpy (jnp), which allows you to write code almost identical to NumPy while leveraging accelerators like GPUs or TPUs.

Think of JAX as NumPy on accelerators. > Note: The web version of this notebook was compiled without GPU acceleration. > To see the performance benefits of JAX, run this notebook on Colab with a GPU or TPU backend.

[36]:
import jax
import jax.numpy as jnp  # Common alias to differentiate from regular NumPy

import numpy as np
import torch
import random
import matplotlib.pyplot as plt

Notes:

  • jax.numpy (jnp) mirrors most of the NumPy API, allowing easy switching between them with minimal code changes.

  • Arrays created with JAX are immutable and support automatic differentiation, GPU/TPU acceleration, and just-in-time compilation.

  • Using jnp provides a clear distinction from regular NumPy arrays (np) to avoid confusion.

Arrays on GPU with JAX

JAX automatically places arrays on available devices, such as CPUs, GPUs, or TPUs. Unlike PyTorch, you usually don’t need to manually move arrays to a device.

Distinction from NumPy:

  • Regular NumPy arrays (np.array) always reside on the CPU and cannot utilize GPUs or TPUs.

  • With JAX (jnp.array), the same NumPy-like operations can run on accelerators automatically.

  • This allows you to write code almost identical to NumPy while benefiting from hardware acceleration.

[37]:
# Create a simple array
x = jnp.arange(10)
print(x)
[0 1 2 3 4 5 6 7 8 9]
[38]:
# Check which device the array x is on
print(x.device)
TFRT_CPU_0
[39]:
# Perform a computation
y = jnp.dot(x, x)
print(y)
285
[40]:
# Check which device the array y is on
print(y.device)
TFRT_CPU_0

Notes:

  • JAX automatically assigns arrays and computations to the best available device.

  • Use x.device() to check the current device.

  • If you are using Google Colab, try switching the runtime type between CPU, GPU, or TPU and observe how the device changes.

  • Operations on arrays in JAX are compiled and accelerated on the selected device without additional code.

JAXPRs (JAX Program Representations)

JAX transforms Python functions into JAXPRs, a low-level, primitive representation of the computation.
These representations enable optimizations and function transformations, such as automatic differentiation with jax.grad.
[41]:
# Define a simple function
def myfun(x, y):
    z = x ** 2 + y
    return z

x = jnp.array(2.0)
y = jnp.array(3.0)

# Display the JAXPR of the function
jax.make_jaxpr(myfun)(x, y)
[41]:
{ lambda ; a:f32[] b:f32[]. let
    c:f32[] = integer_pow[y=2] a
    d:f32[] = add c b
  in (d,) }
[42]:
# Compute the derivative of the function with respect to the first argument
d_myfun = jax.grad(myfun)  # returns a new function
jax.make_jaxpr(d_myfun)(x, y)
[42]:
{ lambda ; a:f32[] b:f32[]. let
    c:f32[] = integer_pow[y=2] a
    d:f32[] = integer_pow[y=1] a
    e:f32[] = mul 2.0:f32[] d
    _:f32[] = add c b
    f:f32[] = mul 1.0:f32[] e
  in (f,) }
[43]:
# Evaluate the derivative at a specific point
x = jnp.array(10.0)
y = jnp.array(12.0)
d_myfun(x, y)
[43]:
Array(20., dtype=float32, weak_type=True)
[44]:
# Compute the second-order derivative
jax.grad(jax.grad(myfun))(x, y)
[44]:
Array(2., dtype=float32, weak_type=True)

Notes:

  • JAXPRs are an intermediate representation that makes JAX transformations possible.

  • jax.grad(f) returns a new function representing the gradient of f with respect to its first argument.

  • You can nest jax.grad to compute higher-order derivatives.

  • JAX allows you to differentiate through nearly arbitrary Python+JAX code efficiently and automatically.

JAX is FAST ⚡

JAX can leverage GPUs/TPUs to perform computations extremely efficiently.
Here we compare matrix multiplication using JAX and PyTorch.
[45]:
# Create two random 500x500 matrices in JAX
rng = jax.random.PRNGKey(0)
key, rng = jax.random.split(rng)
m1 = jax.random.normal(key, (500, 500))
key, rng = jax.random.split(rng)
m2 = jax.random.normal(key, (500, 500))

# Check the shapes of the matrices
m1.shape, m2.shape
[45]:
((500, 500), (500, 500))
[46]:
# Time JAX matrix multiplication
# block_until_ready() ensures the GPU computation finishes before timing
%timeit jnp.dot(m1, m2).block_until_ready()
654 μs ± 20.5 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
[47]:
# Create two random 500x500 matrices in PyTorch
b1 = torch.normal(torch.ones(500, 500), torch.ones(500, 500))
b2 = torch.normal(torch.ones(500, 500), torch.ones(500, 500))

# Time PyTorch matrix multiplication (synchronize for GPU timing)
%timeit torch.matmul(b1, b2); torch.cuda.synchronize() if torch.cuda.is_available() else None
152 μs ± 2.72 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Notes:

  • .block_until_ready() ensures accurate timing for asynchronous GPU operations in JAX.

  • JAX automatically dispatches computations to the GPU or TPU if available.

  • PyTorch requires explicit torch.cuda.synchronize() to measure GPU execution time accurately.

  • Both frameworks can perform matrix multiplication much faster on hardware accelerators than on CPU.

JIT compilation with jax.jit

JAX can just-in-time (JIT) compile functions to accelerate execution by fusing operations and optimizing memory usage.

[48]:
# Define a simple function
def myfun(x, y):
    z = x ** 2 + y
    return z

x = jnp.array(2.0)
y = jnp.array(3.0)

# Measure execution time without JIT
%timeit myfun(x, y).block_until_ready()
22 μs ± 822 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
[49]:
# Measure execution time with JIT
%timeit jax.jit(myfun)(x, y).block_until_ready()
29.6 μs ± 237 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Notes:

  • jax.jit takes a JAX function and compiles it for faster execution.

  • Operations are fused into a single kernel, optimizing memory and compute.

  • This is conceptually similar to writing a CUDA kernel manually.

  • jax.jit can also be used as a decorator for cleaner syntax:

@jax.jit
def myfun(x, y):
    # code
    ...

myfun(x, y)
  • Using JIT is especially beneficial for repeated computations on large arrays or in loops.

Parallelization with jax.vmap

jax.vmap allows you to vectorize functions over batch dimensions automatically, eliminating explicit Python loops.

[50]:
# Define a function for single vector input
def myfun(x, y):
    # compute dot product and then square the result
    return jnp.dot(x, y) ** 2

# Single vector example
x = jnp.array([2, 2], dtype=jnp.float32)
y = jnp.array([3, 3], dtype=jnp.float32)
z = myfun(x, y)
print('result = (2*3 + 2*3)^2 =', z)
result = (2*3 + 2*3)^2 = 144.0
  • myfun takes two 1-dimensional arrays and returns a scalar.

  • Shapes:

[51]:
# currently our function takes two 1-dimensional arrays as input
print('x shape', x.shape)
print('y shape', y.shape)
print('z shape', z.shape)
x shape (2,)
y shape (2,)
z shape ()

What if we have batches of data?

[52]:
x = jnp.array([[2, 2], [4, 4], [6, 6]], dtype=jnp.float32)  # 3 batches of vectors
y = jnp.array([[3, 3], [5, 5], [7, 7]], dtype=jnp.float32)  # 3 batches of vectors

print('x shape', x.shape)
print('y shape', y.shape)

try:
    myfun(x, y)
except Exception as e:
    print()
    print('EXCEPTION THROWN!')
    print(e)

x shape (3, 2)
y shape (3, 2)

EXCEPTION THROWN!
dot_general requires contracting dimensions to have the same shape, got (2,) and (3,).
  • We get an error because myfun is written for single vectors, not batches.

  • The dot product cannot be directly applied to the input matrices in the intended batched manner.

Solution: ``jax.vmap``

  • jax.vmap vectorizes a function to automatically map it over leading array axes:

[53]:
jax.vmap(myfun)(x, y)
[53]:
Array([ 144., 1600., 7056.], dtype=float32)
  • Each row of x and y is passed to myfun in parallel.

  • This avoids writing explicit for-loops and is much faster on accelerators.

Combining with JIT

  • For even better performance, we can JIT-compile the vectorized function:

[54]:
# We can use JIT compilation on top of jax.vmap, speeding things up further!
jax.jit(jax.vmap(myfun))(x, y)
[54]:
Array([ 144., 1600., 7056.], dtype=float32)

Exercise 1: Linear Regression Exercise: Normal Equation with NumPy (or JAX)

In this exercise, you will compute the beta matrix (\(\boldsymbol{\beta}\)) for Linear Regression using the Normal Equation and NumPy (or JAX), and use it to make predictions.

The Normal Equation formula is:

\(\boldsymbol{\beta} = (X^T X)^{-1} X^T y\)

Where:

  • \(X \in \mathbb{R}^{m \times n}\) is the feature matrix.

  • \(y \in \mathbb{R}^{m \times 1}\) is the target vector.

  • \(\boldsymbol{\beta} \in \mathbb{R}^{(n+1) \times 1}\) is the matrix of coefficients.

The Prediction Equation formula is:

Once \(\boldsymbol{\beta}\) is computed, predictions are made using:

\(\hat{y} = X \boldsymbol{\beta}\)

Import Libraries and Generate a Synthetic Dataset

[55]:
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

# Parameters
a = 2.5
num_points = 50
noise_std = 5.0

# Generate x values
x = np.linspace(0, 10, num_points)

# Reshape x to be a feature matrix (num_points x 1)
X = x.reshape(-1, 1)

# Generate noisy y values
noise = np.random.normal(0, noise_std, size=num_points)
y = a * x + noise

# for demonstration purposes we'll use a sine function
#y = a * np.sin(x) + noise



TODO: Plot the Data using matplotlib

[56]:
# TODO: # Visualize the dataset

TODO: Compute Beta Matrix Using Normal Equation

[57]:
# TODO: Compute the beta matrix
# beta = (X_b^T X_b)^(-1) X_b^T y

beta = None  # replace None

TODO: Make Predictions Using Beta Matrix

[58]:
# TODO: Use prediction equation: y_hat = X_b @ beta

y_pred = None  # replace None

TODO: Visualize the predictions

[59]:
# TODO: Visualize the predictions

Solution - Exercise 1

Derivation of the Normal Equation of Linear Regression

Step 1: Linear Model

For a simple linear regression:

\[y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \quad i = 1, \dots, n\]

where:

  • \(y_i\) = observed response

  • \(x_i\) = predictor

  • \(\beta_0\) = intercept

  • \(\beta_1\) = slope

  • \(\varepsilon_i\) = error term

Step 2: Design Matrix Form

All \(n\) equations can be written in matrix form:

\[y = X \beta + \varepsilon\]

with:

\[\begin{split}X = \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{bmatrix}, \quad \beta = \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix}, \quad y = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}, \quad \varepsilon = \begin{bmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n \end{bmatrix}\end{split}\]
  • First column of \(X\) is all 1’s → represents the intercept \(\beta_0\).

  • Second column of \(X\) contains the predictor values \(x_i\) → represents the slope \(\beta_1\).

  • \(y\) is the vector of observed responses.

  • \(\varepsilon\) is the vector of residuals/errors.

Step 3: Residual Sum of Squares (RSS)

The RSS is a function of \(\beta\):

\[\text{RSS}(\beta) = \|y - X \beta\|^2 = (y - X \beta)^T (y - X \beta)\]

Residual vector:

\[\begin{split}y - X \beta = \begin{bmatrix} y_1 - (\beta_0 + \beta_1 x_1) \\ y_2 - (\beta_0 + \beta_1 x_2) \\ \vdots \\ y_n - (\beta_0 + \beta_1 x_n) \end{bmatrix}\end{split}\]

Step 4: Expanding the RSS

\[\text{RSS}(\beta) = (y - X \beta)^T (y - X \beta) = y^T y - 2 y^T X \beta + \beta^T X^T X \beta\]
  • Quadratic term: \(\beta^T X^T X \beta\)

  • Linear term: \(-2 y^T X \beta\)

  • Constant term: \(y^T y\)

Step 5: Differentiation with Matrix Rules

  • Quadratic form: \(\frac{\partial}{\partial \beta} (\beta^T A \beta) = 2 A \beta\) if \(A\) is symmetric

  • Linear form: \(\frac{\partial}{\partial \beta} (b^T \beta) = b\)

  • Constant term: derivative = 0

Apply to RSS:

\[\frac{\partial \text{RSS}}{\partial \beta} = 2 X^T X \beta - 2 X^T y\]

Step 6: Solve for :math:`beta` (Normal Equation)

Set derivative to zero:

\[2 X^T X \beta - 2 X^T y = 0 \quad \Rightarrow \quad X^T X \beta = X^T y\]

Closed-form solution:

\[\boxed{\beta = (X^T X)^{-1} X^T y}\]
[60]:
# Visualize the dataset - using a scatter plot
plt.scatter(x, y, label='Data Points')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Noisy Linear Data')
plt.legend()
plt.show()
../../_images/tutorial_notebooks_getting_started_with_jupyter_and_python_getting_started_with_jupyter_and_python_145_0.png
[61]:
# Compute the beta matrix - using the Normal Equation
# beta = (X_b^T X_b)^(-1) X_b^T y

# the matrix inverse is computed using np.linalg.inv
# the @ operator denotes matrix multiplication
beta = np.linalg.inv((X.T @ X)) @ X.T @ y

# alternatively matrix multiplication using np.dot
#beta = np.dot(np.linalg.inv(np.dot(X.T, X)), np.dot(X.T, y))

print('Estimated beta:', beta)
Estimated beta: [2.25792712]
[62]:
# Use prediction equation: y_hat = X_b @ beta

y_pred = X @ beta

# alternatively using np.dot
#y_pred = np.dot(X, beta)

print('Predicted y values:', y_pred)

Predicted y values: [ 0.          0.46080145  0.9216029   1.38240436  1.84320581  2.30400726
  2.76480871  3.22561017  3.68641162  4.14721307  4.60801452  5.06881598
  5.52961743  5.99041888  6.45122033  6.91202179  7.37282324  7.83362469
  8.29442614  8.75522759  9.21602905  9.6768305  10.13763195 10.5984334
 11.05923486 11.52003631 11.98083776 12.44163921 12.90244067 13.36324212
 13.82404357 14.28484502 14.74564647 15.20644793 15.66724938 16.12805083
 16.58885228 17.04965374 17.51045519 17.97125664 18.43205809 18.89285955
 19.353661   19.81446245 20.2752639  20.73606536 21.19686681 21.65766826
 22.11846971 22.57927116]
[63]:
# Visualize the predictions

# data points as scatter plot
plt.scatter(x, y, label='Data Points')

# line plot of predictions
plt.plot(x, y_pred, color='red', label='Linear Regression Fit')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Noisy Linear Data with Predictions')
plt.legend()
plt.show()
../../_images/tutorial_notebooks_getting_started_with_jupyter_and_python_getting_started_with_jupyter_and_python_148_0.png

Helper function for plots

A simple NumPy-based implementation of LOESS (locally weighted scatterplot smoothing).

[64]:
def lowess_numpy(x, y, frac=0.6, iters=3):
    """
    Simple NumPy implementation of LOESS smoothing.

    Parameters:
        x (array): predictor values (1D)
        y (array): response values (1D)
        frac (float): fraction of data used in local regression
        iters (int): robustness iterations

    Returns:
        smoothed (ndarray): smoothed y values corresponding to x
    """
    n = len(x)
    r = int(np.ceil(frac * n))
    smoothed = np.zeros(n)
    x_sorted_idx = np.argsort(x)
    x_sorted = x[x_sorted_idx]
    y_sorted = y[x_sorted_idx]

    # Distance weights (tricube)
    def tricube(d):
        w = np.clip(1 - np.abs(d)**3, 0, 1)**3
        return w

    # Initial robustness weights
    robustness = np.ones(n)

    for iteration in range(iters):
        for i in range(n):
            # Distances to all points
            distances = np.abs(x_sorted - x_sorted[i])
            # Find bandwidth based on frac
            bandwidth = np.sort(distances)[r]
            # Compute weights
            w = tricube(distances / bandwidth) * robustness
            # Weighted linear regression
            Xw = np.column_stack((np.ones(n), x_sorted))
            W = np.diag(w)
            beta = np.linalg.pinv(Xw.T @ W @ Xw) @ Xw.T @ W @ y_sorted
            smoothed[i] = beta[0] + beta[1] * x_sorted[i]

        # Update robustness weights (based on residuals)
        residuals_iter = y_sorted - smoothed
        s = np.median(np.abs(residuals_iter))
        if s == 0:
            break
        robustness = tricube(residuals_iter / (6.0 * s))

    # Return smoothed values in original order
    smoothed_unsorted = np.zeros_like(smoothed)
    smoothed_unsorted[x_sorted_idx] = smoothed
    return smoothed_unsorted

Optional - Side note: Residual Analysis

In this section we’re using four common techniques to measure the quality of our linear regression model.

Purpose of Residual Analysis:

  1. Check linearity: Residuals should show no systematic pattern when plotted against predicted values.

  2. Check homoscedasticity: Residuals should have constant variance across all fitted values.

  3. Check normality: Residuals should be approximately normally distributed for valid inference.

  4. Identify outliers/influential points: Large residuals or high leverage points can disproportionately affect the model.

1 Residuals vs Fitted Plot

Purpose: Check linearity and constant variance (homoscedasticity).
Good sign: Residuals randomly scattered around 0 → linear model fits well.
Bad signs:
  • Curved pattern → consider adding nonlinear terms

  • Funnel shape → variance changes with fitted values (heteroscedasticity)

  • Clusters → possible missing categorical variables or interactions

[65]:
# Residual Plot for Linear Regression with outer quantiles

# compute residuals
residuals = y - y_pred


# Compute LOESS-smoothed residuals
loess_line = lowess_numpy(y_pred, residuals, frac=0.6)

plt.scatter(y_pred, residuals)
plt.plot(y_pred, loess_line, color='red', linewidth=2, label='LOESS smooth')
plt.axhline(0, color='gray', linestyle='--')
plt.xlabel('Predicted y values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.show()
../../_images/tutorial_notebooks_getting_started_with_jupyter_and_python_getting_started_with_jupyter_and_python_152_0.png

2 Normal Q–Q Plot

Purpose: Check if residuals are normally distributed.
Good sign: Points lie close to the diagonal line → residuals approximately normal.
Bad signs:
  • S-shaped curve → skewed residuals

  • Heavy tails → outliers or non-normal errors

[66]:
# Normal Q-Q Plot
import scipy.stats as stats

# Generate a Q-Q plot using scipy.stats
stats.probplot(residuals, dist="norm", plot=plt)
plt.title('Normal Q-Q Plot')
plt.show()
../../_images/tutorial_notebooks_getting_started_with_jupyter_and_python_getting_started_with_jupyter_and_python_154_0.png

3 Scale–Location (Spread–Location) Plot

Purpose: Check homoscedasticity (constant variance of residuals).
Good sign: LOESS (Locally Estimated Scatterplot Smoothing) line roughly horizontal, spread consistent.
Bad signs:
  • Upward/downward slope → variance changes with fitted values Rescue: Transform the response (e.g., log(y)) or use weighted least squares.

[67]:
# Scale−Location Plot

# Compute sqrt of absolute residuals
sqrt_residuals = np.sqrt(np.abs(residuals))

# Compute LOESS smooth for sqrt(|residuals|)
loess_line = lowess_numpy(y_pred, sqrt_residuals, frac=0.6)

# Plot Scale–Location
plt.figure(figsize=(7,5))
plt.scatter(y_pred, sqrt_residuals, facecolors='none', edgecolors='k', label='Data')
plt.plot(y_pred, loess_line, color='red', linewidth=2, label='LOESS smooth')
plt.xlabel('Predicted y values')
plt.ylabel('Sqrt of |Residuals|')
plt.title('Scale–Location Plot with NumPy LOESS Smooth')
plt.legend()
plt.show()
../../_images/tutorial_notebooks_getting_started_with_jupyter_and_python_getting_started_with_jupyter_and_python_156_0.png

4 Residuals vs Leverage Plot (with Cook’s Distance)

Purpose: Identify influential observations.
Good sign: Most points near center with small Cook’s distances.
Bad signs:
  • High leverage + large residuals → influential outliers

  • Cook’s D > 0.5–1 → may unduly affect model coefficients Rescue: Investigate these points, consider removal, transformation, or special handling.

[68]:
# Residuals vs Leverage Plot
# Add constant (intercept term)
X_const = np.column_stack((np.ones(X.shape[0]), X))

# Fit OLS using NumPy: beta = (X'X)^(-1) X'y
beta = np.linalg.inv(X_const.T @ X_const) @ X_const.T @ y

# Predictions and residuals
y_pred = X_const @ beta
residuals = y - y_pred

# Compute leverage values (diagonal of hat matrix)
H = X_const @ np.linalg.inv(X_const.T @ X_const) @ X_const.T
leverage = np.diag(H)

# Estimate residual variance
n, p = X_const.shape
s2 = np.sum(residuals**2) / (n - p)

# Standardized residuals
std_resid = residuals / np.sqrt(s2 * (1 - leverage))

# Cook’s distance
cooks_d = (std_resid**2 * leverage) / (p * (1 - leverage))

# Compute LOESS smooth for residuals vs leverage
loess_line = lowess_numpy(leverage, std_resid, frac=0.6)

# --- Plot ---
plt.figure(figsize=(8,6))
sc = plt.scatter(leverage, std_resid, c=cooks_d, cmap='viridis', edgecolors='k', label='Data')
plt.plot(leverage, loess_line, color='red', linewidth=2, label='LOESS smooth')
plt.xlabel('Leverage')
plt.ylabel('Standardized Residuals')
plt.ylim(-5, 5)
plt.title('Residuals vs Leverage Plot (NumPy + LOESS)')
plt.colorbar(sc, label="Cook's Distance")
plt.axhline(y=0, color='gray', linestyle='--')

# Add Cook's distance contours (approximate reference lines)
x_vals = np.linspace(0, np.max(leverage)*1.1, 100)
for d, color in zip([0.5, 1], ['orange', 'red']):
    y_vals = np.sqrt((d * p * (1 - x_vals)) / x_vals)
    plt.plot(x_vals, y_vals, color=color, linestyle='--', label=f"Cook's D = {d}")
    plt.plot(x_vals, -y_vals, color=color, linestyle='--')

plt.legend()
plt.show()
/var/folders/zm/0r1v0fg50w53tw219k1jp0ym0000gn/T/ipykernel_15173/3567787468.py:43: RuntimeWarning: divide by zero encountered in divide
  y_vals = np.sqrt((d * p * (1 - x_vals)) / x_vals)
../../_images/tutorial_notebooks_getting_started_with_jupyter_and_python_getting_started_with_jupyter_and_python_158_1.png

Conclusion

This tutorial has guided you through setting up a Python environment, running Jupyter or Colab notebooks, and using basic notebook features like code cells, Markdown, and timing commands.

Basic Python programming concepts were introduced, along with the most important libraries for data analysis:

  • NumPy for numerical computing and array operations

  • Pandas for data loading, cleaning, and processing

  • Matplotlib and Seaborn for data visualization

With these foundations, you are now ready to experiment, document, and share your Python analyses efficiently.

References

Here are some useful references and resources to deepen your understanding:

Python

NumPy

Pandas

Matplotlib

Seaborn

JAX (Optional / Advanced)

General Data Science Resources

Additional References Used in This Tutorial


Star our repository If you found this tutorial helpful, please ⭐ star our repository to show your support.
Ask questions For any questions, typos, or bugs, kindly open an issue on GitHub — we appreciate your feedback!