pytorch fundamentals

What is PyTorch?
PyTorch is an open source machine learning and deep learning framework.

What can PyTorch be used for?
PyTorch allows you to manipulate and process data and write machine learning algorithms using Python code.

Who uses PyTorch?
Many of the worlds largest technology companies such as Meta (Facebook), Tesla and Microsoft as well as artificial intelligence research companies such as OpenAI use PyTorch to power research and bring machine learning to their products.

pytorch being used across industry and research

For example, Andrej Karpathy (head of AI at Tesla) has given several talks (PyTorch DevCon 2019, Tesla AI Day 2021) about how Tesla use PyTorch to power their self-driving computer vision models.

PyTorch is also used in other industries such as argiculture to power computer vision on tractors.

Why use PyTorch?
Machine learning researchers love using PyTorch. And as of February 2022, PyTorch is the most used deep learning framework on Papers With Code, a website for tracking machine learning research papers and the code repositories attached with them.

PyTorch also helps take care of many things such as GPU acceleration (making your code run faster) behind the scenes.

So you can focus on manipulating data and writing algorithms and PyTorch will make sure it runs fast.

And if companies such as Tesla and Meta (Facebook) use it to build models they deploy to power hundreds of applications, drive thousands of cars and deliver content to billions of people, it’s clearly capable on the development front too.

What we’re going to cover in this module
This course is broken down into different sections (notebooks).

Each notebook covers important ideas and concepts within PyTorch.

Subsequent notebooks build upon knowledge from the previous one (numbering starts at 00, 01, 02 and goes to whatever it ends up going to).

This notebook deals with the basic building block of machine learning and deep learning, the tensor.

Specifically, we’re going to cover:

Topic Contents
Introduction to tensors Tensors are the basic building block of all of machine learning and deep learning.
Creating tensors Tensors can represent almost any kind of data (images, words, tables of numbers).
Getting information from tensors If you can put information into a tensor, you’ll want to get it out too.
Manipulating tensors Machine learning algorithms (like neural networks) involve manipulating tensors in many different ways such as adding, multiplying, combining.
Dealing with tensor shapes One of the most common issues in machine learning is dealing with shape mismatches (trying to mixed wrong shaped tensors with other tensors).
Indexing on tensors If you’ve indexed on a Python list or NumPy array, it’s very similar with tensors, except they can have far more dimensions.
Mixing PyTorch tensors and NumPy PyTorch plays with tensors (torch.Tensor), NumPy likes arrays (np.ndarray) sometimes you’ll want to mix and match these.
Reproducibility Machine learning is very experimental and since it uses a lot of randomness to work, sometimes you’ll want that randomness to not be so random.
Running tensors on GPU GPUs (Graphics Processing Units) make your code faster, PyTorch makes it easy to run your code on GPUs.
Where can can you get help?
All of the materials for this course live on GitHub.

And if you run into trouble, you can ask a question on the Discussions page there too.

There’s also the PyTorch developer forums, a very helpful place for all things PyTorch.

Importing PyTorch
Note: Before running any of the code in this notebook, you should have gone through the PyTorch setup steps.

However, if you’re running on Google Colab, everything should work (Google Colab comes with PyTorch and other libraries installed).

Let’s start by importing PyTorch and checking the version we’re using.

import torch
torch.__version__
'1.10.0+cu111'

Wonderful, it looks like we’ve got PyTorch 1.10.0 (as of Decemeber 2021). This means if you’re going through these materials, you’ll see most compatability with PyTorch 1.10.0, however if your version number is far higher than that, you might notice some inconsistencies.

And if you do have any issues, please post on the GitHub Discussions page.

Introduction to tensors
Now we’ve got PyTorch imported, it’s time to learn about tensors.

Tensors are the fundamental building block of machine learning.

Their job is to represent data in a numerical way.

For example, you could represent an image as a tensor with shape [3, 224, 224] which would mean [colour_channels, height, width], as in the image has 3 colour channels (red, green, blue), a height of 224 pixels and a width of 224 pixels.

example of going from an input image to a tensor representation of the image, image gets broken down into 3 colour channels as well as numbers to represent the height and width

In tensor-speak (the language used to describe tensors), the tensor would have three dimensions, one for colour_channels, height and width.

But we’re getting ahead of ourselves.

Let’s learn more about tensors by coding them.

Creating tensors
PyTorch loves tensors. So much so there’s a whole documentation page dedicated to the torch.Tensor class.

Your first piece of homework is to read through the documentation on torch.Tensor for 10-minutes. But you can get to that later.

Let’s code.

The first thing we’re going to create is a scalar.

A scalar is a single number and in tensor-speak it’s a zero dimension tensor.

Note: That’s a trend for this course. We’ll focus on writing specific code. But often I’ll set exercises which involve reading and getting familiar with the PyTorch documentaiton. Because after all, once you’re finished this course, you’ll no doubt want to learn more. And the documentaiton is somewhere you’ll be finding yourself quite often.

# Scalar
scalar = torch.tensor(7)
scalar
tensor(7)
See how the above printed out tensor(7)?

That means although scalar is a single number, it’s of type torch.Tensor.

We can check the dimensions of a tensor using the ndim attribute.

scalar.ndim
0

What if we wanted to retrieve the number from the tensor?
As in, turn it from torch.Tensor to a Python integer?
To do we can use the item() method.

# Get the Python number within a tensor (only works with one-element tensors)
scalar.item()
7

A vector is a single dimension tensor but can contain many numbers.

As in, you could have a vector [3, 2] to describe [bedrooms, bathrooms] in your house. Or you could have [3, 2, 2] to describe [bedrooms, bathrooms, car_parks] in your house.

The important trend here is that a vector is flexible in what it can represent (the same with tensors).

# Vector
vector = torch.tensor([7, 7])
vector
tensor([7, 7])
# Check the number of dimensions of vector
vector.ndim
1
# Check shape of vector
vector.shape
torch.Size([2])
# Matrix
MATRIX = torch.tensor([[7, 8],
[9, 10]])
MATRIX
tensor([[ 7, 8],
[ 9, 10]])
# Check number of dimensions
MATRIX.ndim
2
MATRIX.shape
torch.Size([2, 2])
#We get the output torch.Size([2, 2]) because MATRIX is two elements deep and two elements wide.
# Tensor
TENSOR = torch.tensor([[[1, 2, 3],
[3, 6, 9],
[2, 4, 5]]])
TENSOR
tensor([[[1, 2, 3],
[3, 6, 9],
[2, 4, 5]]])
# Check number of dimensions for TENSOR
TENSOR.ndim
3
# Check shape of TENSOR
TENSOR.shape
torch.Size([1, 3, 3])
#Alright, it outputs torch.Size([1, 3, 3]).

The dimensions go outer to inner.

That means there’s 1 dimension of 3 by 3.

example of different tensor dimensions

Note: You might’ve noticed me using lowercase letters for scalar and vector and uppercase letters for MATRIX and TENSOR. This was on purpose. In practice, you’ll often see scalars and vectors denoted as lowercase letters such as y or a. And matrices and tensors denoted as uppercase letters such as X or W.

You also might notice the names matrix and tensor used interchangably. This is common. Since in PyTorch you’re often dealing with torch.Tensor’s (hence the tensor name), however, the shape and dimensions of what’s inside will dictate what it actually is.

Let’s summarise.

Name What is it? Number of dimensions Lower or upper (usually/example)
scalar a single number 0 Lower (a)
vector a number with direction (e.g. wind speed with direction) but can also have many other numbers 1 Lower (y)
matrix a 2-dimensional array of numbers 2 Upper (Q)
tensor an n-dimensional array of numbers can be any number, a 0-dimension tensor is a scalar, a 1-dimension tensor is a vector Upper (X)
scalar vector matrix tensor and what they look like

Random tensors
We’ve established tensors represent some form of data.

And machine learning models such as neural networks manipulate and seek patterns within tensors.

But when building machine learning models with PyTorch, it’s rare you’ll create tenors by hand (like what we’ve being doing).

Instead, a machine learning model often starts out with large random tensors of numbers and adjusts these random numbers as it works through data to better represent it.

In essence:

Start with random numbers -> look at data -> update random numbers -> look at data -> update random numbers…

As a data scientist, you can define how the machine learning model starts (initialization), looks at data (representation) and updates (optimization) its random numbers.

We’ll get hands on with these steps later on.

For now, let’s see how to create a tensor of random numbers.

We can do so using torch.rand() and passing in the size parameter.

# Create a random tensor of size (3, 4)
random_tensor = torch.rand(size=(3, 4))
random_tensor, random_tensor.dtype
(tensor([[0.4090, 0.2527, 0.8699, 0.2002],
[0.8421, 0.1428, 0.1431, 0.0111],
[0.2281, 0.0345, 0.6734, 0.3866]]), torch.float32)
#The flexibility of torch.rand() is that we can adjust the size to be whatever we want.
# Create a random tensor of size (224, 224, 3)
random_image_size_tensor = torch.rand(size=(224, 224, 3))
random_image_size_tensor.shape, random_image_size_tensor.ndim
(torch.Size([224, 224, 3]), 3)
#Zeros and ones
# Create a tensor of all zeros
zeros = torch.zeros(size=(3, 4))
zeros, zeros.dtype
(tensor([[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]]), torch.float32)
# Create a tensor of all ones
ones = torch.ones(size=(3, 4))
ones, ones.dtype
(tensor([[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.]]), torch.float32)
#Sometimes you might want a range of numbers, such as 1 to 10 or 0 to 100.
#You can use torch.arange(start, end, step) to do so.
# Use torch.arange(), torch.range() is deprecated
zero_to_ten_deprecated = torch.range(0, 10) # Note: this may return an error in the future

# Create a range of values 0 to 10
zero_to_ten = torch.arange(start=0, end=10, step=1)
zero_to_ten
#tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])


Sometimes you might want one tensor of a certain type with the same shape as another tensor.
For example, a tensor of all zeros with the same shape as a previous tensor.
To do so you can use torch.zeros_like(input) or torch.ones_like(input) which return a tensor filled with zeros or ones in the same shape as the input respectively.

# Can also create a tensor of zeros similar to another tensor
ten_zeros = torch.zeros_like(input=zero_to_ten) # will have same shape
ten_zeros
tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])


Tensor datatypes
There are many different tensor datatypes available in PyTorch.

Some are specific for CPU and some are better for GPU.

Getting to know which is which can take some time.

Generally if you see torch.cuda anywhere, the tensor is being used for GPU (since Nvidia GPUs use a computing toolkit called CUDA).

The most common type (and generally the default) is torch.float32 or torch.float.

This is referred to as “32-bit floating point”.

But there’s also 16-bit floating point (torch.float16 or torch.half) and 64-bit floating point (torch.float64 or torch.double).

And to confuse things even more there’s also 8-bit, 16-bit, 32-bit and 64-bit integers.

Plus more!

Note: An integer is a flat round number like 7 whereas a float has a decimal 7.0.

The reason for all of these is to do with precision in computing.

Precision is the amount of detail used to describe a number.

The higher the precision value (8, 16, 32), the more detail and hence data used to express a number.

This matters in deep learning and numerical computing because you’re making so many operations, the more detail you have to calculate on, the more compute you have to use.

So lower precision datatypes are generally faster to compute on but sacrifice some performance on evaluation metrics like accuracy (faster to compute but less accurate).

Resources:

See the PyTorch documentation for a list of all available tensor datatypes.
Read the Wikipedia page for an overview of what precision in computing) is.
Let’s see how to create some tensors with specific datatypes. We can do so using the dtype parameter.

# Default datatype for tensors is float32
float_32_tensor = torch.tensor([3.0, 6.0, 9.0],
dtype=None, # defaults to None, which is torch.float32 or whatever datatype is passed
device=None, # defaults to None, which uses the default tensor type
requires_grad=False) # if True, operations perfromed on the tensor are recorded
float_32_tensor.shape, float_32_tensor.dtype, float_32_tensor.device
(torch.Size([3]), torch.float32, device(type='cpu'))

Aside from shape issues (tensor shapes don’t match up), two of the other most common issues you’ll come across in PyTorch are datatype and device issues.

For example, one of tensors is torch.float32 and the other is torch.float16 (PyTorch often likes tensors to be the same format).

Or one of your tensors is on the CPU and the other is on the GPU (PyTorch likes calculations between tensors to be on the same device).

We’ll see more of this device talk later on.

For now let’s create a tensor with dtype=torch.float16.

float_16_tensor = torch.tensor([3.0, 6.0, 9.0],
dtype=torch.float16) # torch.half would also workfloat_16_tensor.dtype
torch.float16


Getting information from tensors
Once you’ve created tensors (or someone else or a PyTorch module has created them for you), you might want to get some information from them.

We’ve seen these before but three of the most common attributes you’ll want to find out about tensors are:

shape – what shape is the tensor? (some operations require specific shape rules)
dtype – what datatype are the elements within the tensor stored in?
device – what device is the tensor stored on? (usually GPU or CPU)
Let’s create a random tensor and find out details about it.

# Create a tensor
some_tensor = torch.rand(3, 4)

# Find out details about it
print(some_tensor)
print(f"Shape of tensor: {some_tensor.shape}")
print(f"Datatype of tensor: {some_tensor.dtype}")
print(f"Device tensor is stored on: {some_tensor.device}") # will default to CPU
tensor([[0.7799, 0.8140, 0.0893, 0.2062],
[0.7525, 0.3845, 0.8207, 0.4587],
[0.9277, 0.8166, 0.9052, 0.0953]])
#Shape of tensor: torch.Size([3, 4])
#Datatype of tensor: torch.float32
#Device tensor is stored on: cpu


Note: When you run into issues in PyTorch, it’s very often one to do with one of the three attributes above. So when the error messages show up, sing yourself a little song called “what, what, where”:

“what shape are my tensors? what datatype are they and where are they stored? what shape, what datatype, where where where”
Manipulating tensors (tensor operations)
In deep learning, data (images, text, video, audio, protein structures, etc) gets represented as tensors.

A model learns by investigating those tensors and performing a series of operations (could be 1,000,000s+) on tensors to create a representation of the patterns in the input data.

These operations are often a wonderful dance between:

Addition
Subtraction
Multiplication (element-wise)
Division
Matrix multiplication
And that’s it. Sure there are a few more here and there but these are the basic building blocks of neural networks.

Stacking these building blocks in the right way, you can create the most sophisticated of neural networks (just like lego!).

Basic operations
Let’s start with a few of the fundamental operations, addition (+), subtraction (-), mutliplication (*).

They work just as you think they would.

# Create a tensor of values and add a number to it
tensor = torch.tensor([1, 2, 3])
tensor + 10
tensor([11, 12, 13])
# Multiply it by 10
tensor * 10
tensor([10, 20, 30])

Notice how the tensor values above didn’t end up being tensor([110, 120, 130]), this is because the values inside the tensor don’t change unless they’re reassigned.

# Tensors don't change unless reassigned
tensor
tensor([1, 2, 3])
Let's subtract a number and this time we'll reassign the tensor variable.
# Subtract and reassign
tensor = tensor - 10
tensor
tensor([-9, -8, -7])
# Add and reassign
tensor = tensor + 10
tensor
tensor([1, 2, 3])
# PyTorch also has a bunch of built-in functions like torch.mul() (short for multiplcation) and torch.add() to perform basic operations.
# Can also use torch functions
torch.multiply(tensor, 10)
tensor([10, 20, 30])
# Original tensor is still unchanged
tensor
tensor([1, 2, 3])


However, it’s more common to use the operator symbols like * instead of torch.mul()

# Element-wise multiplication (each element multiplies its equivalent, index 0->0, 1->1, 2->2)
print(tensor, "*", tensor)
print("Equals:", tensor * tensor)
tensor([1, 2, 3]) * tensor([1, 2, 3])
#Equals: tensor([1, 4, 9])


Matrix multiplication (is all you need)
One of the most common operations in machine learning and deep learning algorithms (like neural networks) is matrix multiplication.

PyTorch implements matrix multiplication functionality in the torch.matmul() method.

The main two rules for matrix multiplication to remember are:

The inner dimensions must match:
(3, 2) @ (3, 2) won’t work
(2, 3) @ (3, 2) will work
(3, 2) @ (2, 3) will work
The resulting matrix has the shape of the outer dimensions:
(2, 3) @ (3, 2) -> (2, 2)
(3, 2) @ (2, 3) -> (3, 3)
Note: “@” in Python is the symbol for matrix multiplication.

Resource: You can see all of the rules for matrix multiplication using torch.matmul() in the PyTorch documentation.

Let’s create a tensor and perform element-wise multiplication and matrix multiplication on it.

import torch
tensor = torch.tensor([1, 2, 3])
tensor.shape
torch.Size([3])

The difference between element-wise multiplication and matrix multiplication is the addition of values.
For our tensor variable with values [1, 2, 3]:

Operation Calculation Code
Element-wise multiplication [1*1, 2*2, 3*3] = [1, 4, 9] tensor * tensor
Matrix multiplication [1*1 + 2*2 + 3*3] = [14] tensor.matmul(tensor)

# Element-wise matrix mutlication
tensor * tensor
tensor([1, 4, 9])
# Matrix multiplication
torch.matmul(tensor, tensor)
tensor(14)
# Can also use the "@" symbol for matrix multiplication, though not recommended
tensor @ tensor
tensor(14)

You can do matrix multiplication by hand but it’s not recommended.

The in-built torch.matmul() method is faster.

# Matrix multiplication by hand
# (avoid doing operations with for loops at all cost, they are computationally expensive)
value = 0
for i in range(len(tensor)):
value += tensor[i] * tensor[i]
value
CPU times: user 146 µs, sys: 38 µs, total: 184 µs
Wall time: 227 µs
%%time
torch.matmul(tensor, tensor)
CPU times: user 27 µs, sys: 7 µs, total: 34 µs
Wall time: 36.7 µs
tensor(14)


One of the most common errors in deep learning (shape errors)
Because much of deep learning is multiplying and performing operations on matrices and matrices have a strict rule about what shapes and sizes can be combined, one of the most common errors you’ll run into in deep learning is shape mismatches.

# Shapes need to be in the right way
tensor_A = torch.tensor([[1, 2],
[3, 4],
[5, 6]], dtype=torch.float32)
tensor_B = torch.tensor([[7, 10],
[8, 11],
[9, 12]], dtype=torch.float32)
torch.matmul(tensor_A, tensor_B) # (this will error)

RuntimeError: mat1 and mat2 shapes cannot be multiplied (3×2 and 3×2)
We can make matrix multiplication work between tensor_A and tensor_B by making their inner dimensions match.

One of the ways to do this is with a transpose (switch the dimensions of a given tensor).

You can perform transposes in PyTorch using either:

torch.transpose(input, dim0, dim1) – where input is the desired tensor to transpose and dim0 and dim1 are the dimensions to be swapped.
tensor.T – where tensor is the desired tensor to transpose.
Let’s try the latter.

# View tensor_A and tensor_B
print(tensor_A)
print(tensor_B)
tensor([[1., 2.],
[3., 4.],
[5., 6.]])
tensor([[ 7., 10.],
[ 8., 11.],
[ 9., 12.]])
# View tensor_A and tensor_B.T
print(tensor_A)
print(tensor_B.T)
tensor([[1., 2.],
[3., 4.],
[5., 6.]])
tensor([[ 7., 8., 9.],
[10., 11., 12.]])
# The operation works when tensor_B is transposed
print(f”Original shapes: tensor_A = {tensor_A.shape}, tensor_B = {tensor_B.shape}\n”)
print(f”New shapes: tensor_A = {tensor_A.shape} (same as above), tensor_B.T = {tensor_B.T.shape}\n”)
print(f”Multiplying: {tensor_A.shape} * {tensor_B.T.shape} <- inner dimensions match\n”) print(“Output:\n”) output = torch.matmul(tensor_A, tensor_B.T) print(output) print(f”\nOutput shape:

Output shape: torch.Size([3, 6])
Question: What happens if you change in_features from 2 to 3 above? Does it error? How could you change the shape of the input (x) to accomodate to the error? Hint: what did we have to do to tensor_B above?

If you’ve never done it before, matrix multiplication can be a confusing topic at first.

But after you’ve played around with it a few times and even cracked open a few neural networks, you’ll notice it’s everywhere.

Remember, matrix multiplication is all you need.

matrix multiplication is all you need

When you start digging into neural network layers and building your own, you’ll find matrix multiplications everywhere. Source: https://marksaroufim.substack.com/p/working-class-deep-learner

Finding the min, max, mean, sum, etc (aggregation)
Now we’ve seen a few ways to manipulate tensors, let’s run through a few ways to aggregate them (go from more values to less values).

First we’ll create a tensor and then find the max, min, mean and sum of it.

# Create a tensor
x = torch.arange(0, 100, 10)
x
tensor([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90])
Now let’s perform some aggregation.

print(f”Minimum: {x.min()}”)
print(f”Maximum: {x.max()}”)
# print(f”Mean: {x.mean()}”) # this will error
print(f”Mean: {x.type(torch.float32).mean()}”) # won’t work without float datatype
print(f”Sum: {x.sum()}”)
Minimum: 0
Maximum: 90
Mean: 45.0
Sum: 450
Note: You may find some methods such as torch.mean() require tensors to be in torch.float32 (the most common) or another specific datatype, otherwise the operation will fail.

You can also do the same as above with torch methods.

torch.max(x), torch.min(x), torch.mean(x.type(torch.float32)), torch.sum(x)
(tensor(90), tensor(0), tensor(45.), tensor(450))
Positional min/max
You can also find the index of a tensor where the max or minimum occurs with torch.argmax() and torch.argmin() respectively.

This is helpful incase you just want the position where the highest (or lowest) value is and not the actual value itself (we’ll see this in a later section when using the softmax activation function).

# Create a tensor
tensor = torch.arange(10, 100, 10)
print(f”Tensor: {tensor}”)

# Returns index of max and min values
print(f”Index where max value occurs: {tensor.argmax()}”)
print(f”Index where min value occurs: {tensor.argmin()}”)
Tensor: tensor([10, 20, 30, 40, 50, 60, 70, 80, 90])
Index where max value occurs: 8
Index where min value occurs: 0
Change tensor datatype
As mentioned, a common issue with deep learning operations is having your tensors in different datatypes.

If one tensor is in torch.float64 and another is in torch.float32, you might run into some errors.

But there’s a fix.

You can change the datatypes of tensors using torch.Tensor.type(dtype=None) where the dtype parameter is the datatype you’d like to use.

First we’ll create a tensor and check it’s datatype (the default is torch.float32).

# Create a tensor and check its datatype
tensor = torch.arange(10., 100., 10.)
tensor.dtype
torch.float32
Now we’ll create another tensor the same as before but change its datatype to torch.float16.

# Create a float16 tensor
tensor_float16 = tensor.type(torch.float16)
tensor_float16
tensor([10., 20., 30., 40., 50., 60., 70., 80., 90.], dtype=torch.float16)
And we can do something similar to make a torch.int8 tensor.

# Create a int8 tensor
tensor_int8 = tensor.type(torch.int8)
tensor_int8
tensor([10, 20, 30, 40, 50, 60, 70, 80, 90], dtype=torch.int8)
Note: Different datatypes can be confusing to begin with. But think of it like this, the lower the number (e.g. 32, 16, 8), the less precise a computer stores the value. And with a lower amount of storage, this generally results in faster computation and a smaller overall model. Mobile-based neural networks often operate with 8-bit integers, smaller and faster to run but less accurate than their float32 counterparts. For more on this, I’d read up about precision in computing).

Exercise: So far we’ve covered a fair few tensor methods but there’s a bunch more in the torch.Tensor documentation, I’d recommend spending 10-minutes scrolling through and looking into any that catch your eye. Click on them and then write them out in code yourself to see what happens.

Reshaping, stacking, squeezing and unsqueezing
Often times you’ll want to reshape or change the dimensions of your tensors without actually changing the values inside them.

To do so, some popular methods are:

Method One-line description
torch.reshape(input, shape) Reshapes input to shape (if compatible), can also use torch.Tensor.reshape().
torch.Tensor.view(shape) Returns a view of the original tensor in a different shape but shares the same data as the original tensor.
torch.stack(tensors, dim=0) Concatenates a sequence of tensors along a new dimension (dim), all tensors must be same size.
torch.squeeze(input) Squeezes input to remove all the dimenions with value 1.
torch.unsqueeze(input, dim) Returns input with a dimension value of 1 added at dim.
torch.permute(input, dims) Returns a view of the original input with its dimensions permuted (rearranged) to dims.
Why do any of these?

Because deep learning models (neural networks) are all about manipulating tensors in some way. And because of the rules of matrix multiplication, if you’ve got shape mismatches, you’ll run into errors. These methods help you make the right elements of your tensors are mixing with the right elements of other tensors.

Let’s try them out.

First, we’ll create a tensor.

# Create a tensor
import torch
x = torch.arange(1., 8.)
x, x.shape
(tensor([1., 2., 3., 4., 5., 6., 7.]), torch.Size([7]))
Now let’s add an extra dimension with torch.reshape().

# Add an extra dimension
x_reshaped = x.reshape(1, 7)
x_reshaped, x_reshaped.shape
(tensor([[1., 2., 3., 4., 5., 6., 7.]]), torch.Size([1, 7]))
We can also change the view with torch.view().

# Change view (keeps same data as original but changes view)
# See more: https://stackoverflow.com/a/54507446/7900723
z = x.view(1, 7)
z, z.shape
(tensor([[1., 2., 3., 4., 5., 6., 7.]]), torch.Size([1, 7]))
Remember though, changing the view of a tensor with torch.view() really only creates a new view of the same tensor.

So changing the view changes the original tensor too.

# Changing z changes x
z[:, 0] = 5
z, x
(tensor([[5., 2., 3., 4., 5., 6., 7.]]), tensor([5., 2., 3., 4., 5., 6., 7.]))
If we wanted to stack our new tensor on top of itself four times, we could do so with torch.stack().

# Stack tensors on top of each other
x_stacked = torch.stack([x, x, x, x], dim=0) # try changing dim to dim=1 and see what happens
x_stacked
tensor([[5., 2., 3., 4., 5., 6., 7.],
[5., 2., 3., 4., 5., 6., 7.],
[5., 2., 3., 4., 5., 6., 7.],
[5., 2., 3., 4., 5., 6., 7.]])
How about removing all single dimensions from a tensor?

To do so you can use torch.squeeze() (I remember this as squeezing the tensor to only have dimensions over 1).

print(f”Previous tensor: {x_reshaped}”)
print(f”Previous shape: {x_reshaped.shape}”)

# Remove extra dimension from x_reshaped
x_squeezed = x_reshaped.squeeze()
print(f”\nNew tensor: {x_squeezed}”)
print(f”New shape: {x_squeezed.shape}”)
Previous tensor: tensor([[5., 2., 3., 4., 5., 6., 7.]])
Previous shape: torch.Size([1, 7])

New tensor: tensor([5., 2., 3., 4., 5., 6., 7.])
New shape: torch.Size([7])
And to do the reverse of torch.squeeze() you can use torch.unsqueeze() to add a dimension value of 1 at a specific index.

print(f”Previous tensor: {x_squeezed}”)
print(f”Previous shape: {x_squeezed.shape}”)

## Add an extra dimension with unsqueeze
x_unsqueezed = x_squeezed.unsqueeze(dim=0)
print(f”\nNew tensor: {x_unsqueezed}”)
print(f”New shape: {x_unsqueezed.shape}”)
Previous tensor: tensor([5., 2., 3., 4., 5., 6., 7.])
Previous shape: torch.Size([7])

New tensor: tensor([[5., 2., 3., 4., 5., 6., 7.]])
New shape: torch.Size([1, 7])
You can also rearrange the order of axes values with torch.permute(input, dims), where the input gets turned into a view with new dims.

# Create tensor with specific shape
x_original = torch.rand(size=(224, 224, 3))

# Permute the original tensor to rearrange the axis order
x_permuted = x_original.permute(2, 0, 1) # shifts axis 0->1, 1->2, 2->0

print(f”Previous shape: {x_original.shape}”)
print(f”New shape: {x_permuted.shape}”)
Previous shape: torch.Size([224, 224, 3])
New shape: torch.Size([3, 224, 224])
Note: Because permuting returns a view (shares the same data as the original), the values in the permuted tensor will be the same as the original tensor and if you change the values in the view, it will change the values of the original.

Indexing (selecting data from tensors)
Sometimes you’ll want to select specific data from tensors (for example, only the first column or second row).

To do so, you can use indexing.

If you’ve ever done indexing on Python lists or NumPy arrays, indexing in PyTorch with tensors is very similar.

# Create a tensor
import torch
x = torch.arange(1, 10).reshape(1, 3, 3)
x, x.shape
(tensor([[[1, 2, 3],
[4, 5, 6],
[7, 8, 9]]]), torch.Size([1, 3, 3]))
Indexing values goes outer dimension -> inner dimension (check out the square brackets).

# Let’s index bracket by bracket
print(f”First square bracket:\n{x[0]}”)
print(f”Second square bracket: {x[0][0]}”)
print(f”Third square bracket: {x[0][0][0]}”)
First square bracket:
tensor([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
Second square bracket: tensor([1, 2, 3])
Third square bracket: 1
You can also use : to specify “all values in this dimension” and then use a comma (,) to add another dimension.

# Get all values of 0th dimension and the 0 index of 1st dimension
x[:, 0]
tensor([[1, 2, 3]])
# Get all values of 0th & 1st dimensions but only index 1 of 2nd dimension
x[:, :, 1]
tensor([[2, 5, 8]])
# Get all values of the 0 dimension but only the 1 index value of the 1st and 2nd dimension
x[:, 1, 1]
tensor([5])
# Get index 0 of 0th and 1st dimension and all values of 2nd dimension
x[0, 0, :] # same as x[0][0]
tensor([1, 2, 3])
Indexing can be quite confusing to begin with, especially with larger tensors (I still have to try indexing multiple times to get it right). But with a bit of practice and following the data explorer’s motto (visualize, visualize, visualize), you’ll start to get the hang of it.

PyTorch tensors & NumPy
Since NumPy is a popular Python numerical computing library, PyTorch has functionality to interact with it nicely.

The two main methods you’ll want to use for NumPy to PyTorch (and back again) are:

torch.from_numpy(ndarray) – NumPy array -> PyTorch tensor.
torch.Tensor.numpy() – PyTorch tensor -> NumPy array.
Let’s try them out.

# NumPy array to tensor
import torch
import numpy as np
array = np.arange(1.0, 8.0)
tensor = torch.from_numpy(array)
array, tensor
(array([1., 2., 3., 4., 5., 6., 7.]),
tensor([1., 2., 3., 4., 5., 6., 7.], dtype=torch.float64))
Note: By default, NumPy arrays are created with the datatype float64 and if you convert it to a PyTorch tensor, it’ll keep the same datatype (as above).

However, many PyTorch calculations default to using float32.

So if you want to convert your NumPy array (float64) -> PyTorch tensor (float64) -> PyTorch tensor (float32), you can use tensor = torch.from_numpy(array).type(torch.float32).

Because we reassigned tensor above, if you change the tensor, the array stays the same.

# Change the array, keep the tensor
array = array + 1
array, tensor
(array([2., 3., 4., 5., 6., 7., 8.]),
tensor([1., 2., 3., 4., 5., 6., 7.], dtype=torch.float64))
And if you want to go from PyTorch tensor to NumPy array, you can call tensor.numpy().

# Tensor to NumPy array
tensor = torch.ones(7) # create a tensor of ones with dtype=float32
numpy_tensor = tensor.numpy() # will be dtype=float32 unless changed
tensor, numpy_tensor
(tensor([1., 1., 1., 1., 1., 1., 1.]),
array([1., 1., 1., 1., 1., 1., 1.], dtype=float32))
And the same rule applies as above, if you change the original tensor, the new numpy_tensor stays the same.

# Change the tensor, keep the array the same
tensor = tensor + 1
tensor, numpy_tensor
(tensor([2., 2., 2., 2., 2., 2., 2.]),
array([1., 1., 1., 1., 1., 1., 1.], dtype=float32))
Reproducibility (trying to take the random out of random)
As you learn more about neural networks and machine learning, you’ll start to discover how much randomness plays a part.

Well, pseudorandomness that is. Because after all, as they’re designed, a computer is fundamentally deterministic (each step is predictable) so the randomness they create are simulated randomness (though there is debate on this too, but since I’m not a computer scientist, I’ll let you find out more yourself).

How does this relate to neural networks and deep learning then?

We’ve discussed neural networks start with random numbers to describe patterns in data (these numbers are poor descriptions) and try to improve those random numbers using tensor operations (and a few other things we haven’t discussed yet) to better describe patterns in data.

In short:

start with random numbers -> tensor operations -> try to make better (again and again and again)

Although randomness is nice and powerful, sometimes you’d like there to be a little less randomness.

Why?

So you can perform repeatable experiments.

For example, you create an algorithm capable of achieving X performance.

And then your friend tries it out to verify you’re not crazy.

How could they do such a thing?

That’s where reproducibility comes in.

In other words, can you get the same (or very similar) results on your computer running the same code as I get on mine?

Let’s see a brief example of reproducibility in PyTorch.

We’ll start by creating two random tensors, since they’re random, you’d expect them to be different right?

import torch

# Create two random tensors
random_tensor_A = torch.rand(3, 4)
random_tensor_B = torch.rand(3, 4)

print(f”Tensor A:\n{random_tensor_A}\n”)
print(f”Tensor B:\n{random_tensor_B}\n”)
print(f”Does Tensor A equal Tensor B? (anywhere)”)
random_tensor_A == random_tensor_B
Tensor A:
tensor([[0.8016, 0.3649, 0.6286, 0.9663],
[0.7687, 0.4566, 0.5745, 0.9200],
[0.3230, 0.8613, 0.0919, 0.3102]])

Tensor B:
tensor([[0.9536, 0.6002, 0.0351, 0.6826],
[0.3743, 0.5220, 0.1336, 0.9666],
[0.9754, 0.8474, 0.8988, 0.1105]])

Does Tensor A equal Tensor B? (anywhere)
tensor([[False, False, False, False],
[False, False, False, False],
[False, False, False, False]])
Just as you might’ve expected, the tensors come out with different values.

But what if you wanted to created two random tensors with the same values.

As in, the tensors would still contain random values but they would be of the same flavour.

That’s where torch.manual_seed(seed) comes in, where seed is an integer (like 42 but it could be anything) that flavours the randomness.

Let’s try it out by creating some more flavoured random tensors.

import torch
import random

# # Set the random seed
RANDOM_SEED=42 # try changing this to different values and see what happens to the numbers below
torch.manual_seed(seed=RANDOM_SEED)
random_tensor_C = torch.rand(3, 4)

# Have to reset the seed every time a new rand() is called
# Without this, tensor_D would be different to tensor_C
torch.random.manual_seed(seed=RANDOM_SEED) # try commenting this line out and seeing what happens
random_tensor_D = torch.rand(3, 4)

print(f”Tensor C:\n{random_tensor_C}\n”)
print(f”Tensor D:\n{random_tensor_D}\n”)
print(f”Does Tensor C equal Tensor D? (anywhere)”)
random_tensor_C == random_tensor_D
Tensor C:
tensor([[0.8823, 0.9150, 0.3829, 0.9593],
[0.3904, 0.6009, 0.2566, 0.7936],
[0.9408, 0.1332, 0.9346, 0.5936]])

Tensor D:
tensor([[0.8823, 0.9150, 0.3829, 0.9593],
[0.3904, 0.6009, 0.2566, 0.7936],
[0.9408, 0.1332, 0.9346, 0.5936]])

Does Tensor C equal Tensor D? (anywhere)
tensor([[True, True, True, True],
[True, True, True, True],
[True, True, True, True]])
Nice!

It looks like setting the seed worked.

Resource: What we’ve just covered only scratches the surface of reproducibility in PyTorch. For more, on reproducbility in general and random seeds, I’d checkout:

The PyTorch reproducibility documentation (a good exericse would be to read through this for 10-minutes and even if you don’t understand it now, being aware of it is important).
The Wikipedia random seed page (this’ll give a good overview of random seeds and pseudorandomness in general).
Running tensors on GPUs (and making faster computations)
Deep learning algorithms require a lot of numerical operations.

And by default these operations are often done on a CPU (computer processing unit).

However, there’s another common piece of hardware called a GPU (graphics processing unit), which is often much faster at performing the specific types of operations neural networks need (matrix multiplications) than CPUs.

Your computer might have one.

If so, you should look to use it whenever you can to train neural networks because chances are it’ll speed up the training time dramatically.

There are a few ways to first get access to a GPU and secondly get PyTorch to use the GPU.

Note: When I reference “GPU” throughout this course, I’m referencing a Nvidia GPU with CUDA enabled (CUDA is a computing platform and API that helps allow GPUs be used for general purpose computing & not just graphics) unless otherwise specified.

1. Getting a GPU
You may already know what’s going on when I say GPU. But if not, there are a few ways to get access to one.

Method Difficulty to setup Pros Cons How to setup
Google Colab Easy Free to use, almost zero setup required, can share work with others as easy as a link Doesn’t save your data outputs, limited compute, subject to timeouts Follow the Google Colab Guide
Use your own Medium Run everything locally on your own machine GPUs aren’t free, require upfront cost Follow the PyTorch installation guidelines
Cloud computing (AWS, GCP, Azure) Medium-Hard Small upfront cost, access to almost infinite compute Can get expensive if running continually, takes some time to setup right Follow the PyTorch installation guidelines
There are more options for using GPUs but the above three will suffice for now.

Personally, I use a combination of Google Colab and my own personal computer for small scale experiments (and creating this course) and go to cloud resources when I need more compute power.

Resource: If you’re looking to purchase a GPU of your own but not sure what to get, Tim Dettmers has an excellent guide.

To check if you’ve got access to a Nvidia GPU, you can run !nvidia-smi where the ! (also called bang) means “run this on the command line”.

!nvidia-smi
Thu Feb 10 02:09:18 2022
+—————————————————————————–+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|——————————-+———————-+———————-+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE… Off | 00000000:00:04.0 Off | 0 |
| N/A 36C P0 28W / 250W | 0MiB / 16280MiB | 0% Default |
| | | N/A |
+——————————-+———————-+———————-+

+—————————————————————————–+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+—————————————————————————–+
If you don’t have a Nvidia GPU accessible, the above will output something like:

NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
In that case, go back up and follow the install steps.

If you do have a GPU, the line above will output something like:

Wed Jan 19 22:09:08 2022
+—————————————————————————–+
| NVIDIA-SMI 495.46 Driver Version: 460.32.03 CUDA Version: 11.2 |
|——————————-+———————-+———————-+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE… Off | 00000000:00:04.0 Off | 0 |
| N/A 35C P0 27W / 250W | 0MiB / 16280MiB | 0% Default |
| | | N/A |
+——————————-+———————-+———————-+

+—————————————————————————–+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+—————————————————————————–+
2. Getting PyTorch to run on the GPU
Once you’ve got a GPU ready to access, the next step is getting PyTorch to use for storing data (tensors) and computing on data (performing operations on tensors).

To do so, you can use the torch.cuda package.

Rather than talk about it, let’s try it out.

You can test if PyTorch has access to a GPU using torch.cuda.is_available().

# Check for GPU
import torch
torch.cuda.is_available()
True
If the above outputs True, PyTorch can see and use the GPU, if it outputs False, it can’t see the GPU and in that case, you’ll have to go back through the installation steps.

Now, let’s say you wanted to setup your code so it ran on CPU or the GPU if it was available.

That way, if you or someone decides to run your code, it’ll work regardless of the computing device they’re using.

Let’s create a device variable to store what kind of device is available.

# Set device type
device = “cuda” if torch.cuda.is_available() else “cpu”
device
‘cuda’
If the above output “cuda” it means we can set all of our PyTorch code to use the available CUDA device (a GPU) and if it output “cpu”, our PyTorch code will stick with the CPU.

Note: In PyTorch, it’s best practice to write device agnostic code. This means code that’ll run on CPU (always available) or GPU (if available).

If you want to do faster computing you can use a GPU but if you want to do much faster computing, you can use multiple GPUs.

You can count the number of GPUs PyTorch has access to using torch.cuda.device_count().

# Count number of devices
torch.cuda.device_count()
1
Knowing the number of GPUs PyTorch has access to is helpful incase you wanted to run a specific process on one GPU and another process on another (PyTorch also has features to let you run a process across all GPUs).

3. Putting tensors (and models) on the GPU
You can put tensors (and models, we’ll see this later) on a specific device by calling to(device) on them. Where device is the target device you’d like the tensor (or model) to go to.

Why do this?

GPUs offer far faster numerical computing than CPUs do and if a GPU isn’t available, because of our device agnostic code (see above), it’ll run on the CPU.

Note: Putting a tensor on GPU using to(device) (e.g. some_tensor.to(device)) returns a copy of that tensor, e.g. the same tensor will be on CPU and GPU. To overwrite tensors, reassign them:

some_tensor = some_tensor.to(device)

Let’s try creating a tensor and putting it on the GPU (if it’s available).

# Create tensor (default on CPU)
tensor = torch.tensor([1, 2, 3])

# Tensor not on GPU
print(tensor, tensor.device)

# Move tensor to GPU (if available)
tensor_on_gpu = tensor.to(device)
tensor_on_gpu
tensor([1, 2, 3]) cpu
tensor([1, 2, 3], device=’cuda:0′)
If you have a GPU available, the above code will output something like:

tensor([1, 2, 3]) cpu
tensor([1, 2, 3], device=’cuda:0′)
Notice the second tensor has device=’cuda:0′, this means it’s stored on the 0th GPU available (GPUs are 0 indexed, if two GPUs were available, they’d be ‘cuda:0’ and ‘cuda:1’ respectively, up to ‘cuda:n’).

4. Moving tensors back to the CPU
What if we wanted to move the tensor back to CPU?

For example, you’ll want to do this if you want to interact with your tensors with NumPy (NumPy does not leverage the GPU).

Let’s try using the torch.Tensor.numpy() method on our tensor_on_gpu.

# If tensor is on GPU, can’t transform it to NumPy (this will error)
tensor_on_gpu.numpy()
—————————————————————————
TypeError Traceback (most recent call last)
in ()
1 # If tensor is on GPU, can’t transform it to NumPy (this will error)
—-> 2 tensor_on_gpu.numpy()

TypeError: can’t convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.
Instead, to get a tensor back to CPU and usable with NumPy we can use Tensor.cpu().

This copies the tensor to CPU memory so it’s usable with CPUs.

# Instead, copy the tensor back to cpu
tensor_back_on_cpu = tensor_on_gpu.cpu().numpy()
tensor_back_on_cpu
array([1, 2, 3])
The above returns a copy of the GPU tensor in CPU memory so the original tensor is still on GPU.

tensor_on_gpu
tensor([1, 2, 3], device=’cuda:0′)
Exercises
Documentation reading – A big part of deep learning (and learning to code in general) is getting familiar with the documentation of a certain framework you’re using. We’ll be using the PyTorch documentation a lot throughout the rest of this course. So I’d recommend spending 10-minutes reading the following (it’s okay if you don’t get some things for now, the focus is not yet full understanding, it’s awareness):
The documentation on torch.Tensor.
The documentation on torch.cuda.
Create a random tensor with shape (7, 7).
Perform a matrix multiplication on the tensor from 2 with another random tensor with shape (1, 7) (hint: you may have to transpose the second tensor).
Set the random seed to 0 and do 2 & 3 over again. The output should be:
(tensor([[1.8542],
[1.9611],
[2.2884],
[3.0481],
[1.7067],
[2.5290],
[1.7989]]), torch.Size([7, 1]))
Speaking of random seeds, we saw how to set it with torch.manual_seed() but is there a GPU equivalent? (hint: you’ll need to look into the documentation for torch.cuda for this one)
If there is, set the GPU random seed to 1234.
Create two random tensors of shape (2, 3) and send them both to the GPU (you’ll need access to a GPU for this). Set torch.manual_seed(1234) when creating the tensors (this doesn’t have to be the GPU random seed). The output should be something like:
Device: cuda
(tensor([[0.0290, 0.4019, 0.2598],
[0.3666, 0.0583, 0.7006]], device=’cuda:0′),
tensor([[0.0518, 0.4681, 0.6738],
[0.3315, 0.7837, 0.5631]], device=’cuda:0′))
Perform a matrix multiplication on the tensors you created in 6 (again, you may have to adjust the shapes of one of the tensors).
Find the maximum and minimum values of the output of 7.
Find the maximum and minimum index values of the output of 7.
Make a random tensor with shape (1, 1, 1, 10) and then create a new tensor with all the 1 dimensions removed to be left with a tensor of shape (10). Set the seed to 7 when you create it and print out the first tensor and it’s shape as well as the second tensor and it’s shape. The output should look something like:
tensor([[[[0.5349, 0.1988, 0.6592, 0.6569, 0.2328, 0.4251, 0.2071, 0.6297,
0.3653, 0.8513]]]]) torch.Size([1, 1, 1, 10])
tensor([0.5349, 0.1988, 0.6592, 0.6569, 0.2328, 0.4251, 0.2071, 0.6297, 0.3653,
0.8513]) torch.Size([10])
Resource: To complete these exercises, see the exercises notebooks templates and potential solutions on the course GitHub.

Extra-curriculum
Spend 1-hour going through the PyTorch basics tutorial (I’d recommend the Quickstart and Tensors sections).
To learn more on how a tensor can represent data, see this video: What’s a tensor?

Apache Spark

1. Apache Spark Overview

Apache Spark is an open-source, distributed computing system designed for fast computation on large-scale data processing tasks. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

2. Spark Master

The Spark Master is the main node in a Spark cluster that manages the cluster’s resources and schedules tasks. It is responsible for allocating resources to different Spark applications and managing their execution. The Master keeps track of the state of the worker nodes and the applications running on them.

  • Role: Manages resources in the cluster.
  • Components Managed: Workers, applications.

3. Driver

The Driver program runs on a node in the cluster and is the entry point of the Spark application. It contains the application’s main function and is responsible for converting the user’s code into jobs and tasks that are executed by the Spark cluster.

  • Role: Manages the execution of the Spark application.
  • Responsibilities:
    • Converts user code into tasks.
    • Manages the job execution process.
    • Collects and displays output.

4. Cluster Manager

The Cluster Manager is a system that manages the resources of the cluster. Spark can work with several different types of cluster managers, including:

  • Standalone: Spark’s built-in cluster manager. It is simple to set up and is useful for smaller clusters.
  • Apache YARN: Used for larger clusters in environments where Hadoop is deployed.
  • Apache Mesos: A general-purpose cluster manager that can manage multiple types of distributed systems.
  • Kubernetes: Used for running Spark on Kubernetes clusters, allowing containerized Spark applications to be managed at scale.
  • Role: Allocates resources to Spark applications.
  • Responsibilities:
    • Manages the distribution of CPU, memory, and other resources.
    • Works with the Spark Master to schedule tasks.

5. Spark Cluster Architecture

In a Spark cluster, the Driver communicates with the Cluster Manager to request resources (e.g., executors on worker nodes). The Cluster Manager allocates the resources and informs the Master node, which in turn assigns tasks to worker nodes.

  • Driver: Runs the main Spark application.
  • Master: Schedules resources across the cluster.
  • Workers: Execute the tasks assigned by the Master.
  • Cluster Manager: Manages the distribution of resources.

This architecture allows Spark to handle large-scale data processing efficiently by distributing the workload across multiple nodes in the cluster.

Spark Architecture

Apache Spark follows a master/slave architecture with two main daemons and a cluster manager –

  • Master Daemon — (Master/Driver Process)
  • Worker Daemon –(Slave Process)
  • Cluster Manager

When you submit a PySpark job, the code execution is split between the Driver and the Worker Executors. Here’s how it works:

1. Driver

  • The Driver is the main program that controls the entire Spark application.
  • It is responsible for:
    • Converting the user-defined transformations and actions into a logical plan.
    • Breaking the logical plan into stages and tasks.
    • Scheduling tasks to be executed on the Worker nodes.
    • Collecting results from the workers if needed.

Driver Execution:

  • Any code that doesn’t involve transformations on distributed data (e.g., creating RDDs/DataFrames, defining transformations, and actions like collect, show, count) is executed in the Driver.
  • For example, commands like df.show(), df.collect(), or df.write.csv() are initially triggered in the Driver. The Driver then sends tasks to the Worker nodes to perform distributed computations.

2. Worker Executors

  • Executors are the processes running on the Worker nodes. They are responsible for executing the tasks that the Driver schedules.
  • Executors perform the actual data processing: reading data, performing transformations, and writing results.

Worker Execution:

  • All operations that involve transformations (e.g., map, filter, reduceByKey) on distributed datasets (RDDs or DataFrames) are executed on the Worker Executors.
  • The Driver sends tasks to the Executors, which operate on the partitions of the data. Each Executor processes its partition of the data independently.

Example Workflow:

Let’s consider an example to clarify:

pythonCopy codefrom pyspark.sql import SparkSession

# Initialize Spark Session
spark = SparkSession.builder.appName("example").getOrCreate()

# DataFrame creation (executed by the Driver)
data = [("Alice", 34), ("Bob", 45), ("Catherine", 29)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)

# Transformation (lazy, plan is created by the Driver but not executed)
df_filtered = df.filter(df["Age"] > 30)

# Action (Driver sends tasks to Executors to execute the filter and collect the results)
result = df_filtered.collect()  # Executed by Executors

# The results are collected back to the Driver
print(result)
pyspark.sql import SparkSession

# Initialize Spark Session
spark = SparkSession.builder.appName("example").getOrCreate()

# DataFrame creation (executed by the Driver)
data = [("Alice", 34), ("Bob", 45), ("Catherine", 29)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)

# Transformation (lazy, plan is created by the Driver but not executed)
df_filtered = df.filter(df["Age"] > 30)

# Action (Driver sends tasks to Executors to execute the filter and collect the results)
result = df_filtered.collect()  # Executed by Executors

# The results are collected back to the Driver
print(result)

https://sunscrapers.com/blog/building-a-scalable-apache-spark-cluster-beginner-guide/

https://medium.com/@patilmailbox4/install-apache-spark-on-ubuntu-ffa151e12e30

Linux NFS (server,client)

Step 1: Configure the NFS Server

sudo apt update
sudo apt install nfs-kernel-server -y
sudo mkdir -p /nfs
sudo chown nobody:nogroup /nfs
echo "/srv/nfs/share    *(rw,sync,no_subtree_check,no_root_squash)" >> /etc/exports
sudo exportfs -arv

Step 2: Mount the NFS Share on the Client

sudo apt update
sudo apt install nfs-common -y
sudo mkdir -p /mnt/nfs/share
sudo mount -t nfs <nfs-server-ip>:/srv/nfs/share /mnt/nfs/share

Step 3: Permanent Mount (Optional)

echo "<nfs-server-ip>:/srv/nfs/share /mnt/nfs/share nfs defaults 0 0" >> /etc/fstab
sudo mount -a

Pandas

vocabularies

Data StructureDimensionsDescription
Series11D labeled homogeneous array, sizeimmutable.
Data Frames2General 2D labeled, size-mutable tabular structure with potentially heterogeneously typed columns.
Panel3General 3D labeled, size-mutable array.


create data frame

data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

#load data into a DataFrame object:
df = pd.DataFrame(data)

#change columns names
df=pd.DataFrame([[1,2,3],[4,5,6]],columns=['a','b','c'],index=['A','B'])
df.columns = ['x','y','z']

#index rows instead of numbers
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])


#read csv
df = pd.read_csv('data.csv')

#read json 
df = pd.read_json('data.json')

#series
a = [1, 7, 2]
myvar = pd.Series(a, index = ["x", "y", "z"])

basic properties and functions

1axes Returns a list of the row axis labels
2dtype Returns the dtype of the object.
3empty Returns True if series is empty.
4ndim Returns the number of dimensions of the underlying data, by definition 1.
5size Returns the number of elements in the underlying data.
6values Returns the Series as ndarray.
7head() Returns the first n rows.
8tail() Returns the last n rows.
9columns:[] to edit or get columns

read data frame

1.loc()Label based
2.iloc()Integer based
3.ix()Both Label and Integer based
#by row index
df.loc[0]
print(df.loc[[0, 1]])
df =pd.DataFrame({'good':[1,2,3],'bad':[4,None,6]})
print(df.loc[1,"bad"])#nan object
print(df.iloc[0:2, 0:2])#all data frame
#ix
df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])

# Integer slicing
print df.ix[:4]
print df.ix[:,'A']

Iterators

#iterate by columns then inside Series(Values)
df=pd.DataFrame([[1,2,3],[4,5,6]],columns=['a','b','c'],index=['A','B'])
df.columns = ['x','y','z']
for col in df:
    print(col)
    print(df[col].dtypes)
    for val in df[col]:
        print(val)


#iterate by rows
for row_index,row in df.iterrows():
   print row_index,row

#iterate by tuples
for index,*values in df.itertuples():
    print(index)
    print(values[0])
    print(values[1])
    print(values[2])

sort

#index sort
import pandas as pd
import numpy as np

unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],colu
   mns = ['col2','col1'])

sorted_df=unsorted_df.sort_index()
print sorted_df

#values sort
import pandas as pd
import numpy as np

unsorted_df = pd.DataFrame({'col1':[2,1,1,1],'col2':[1,3,2,4]})
   sorted_df = unsorted_df.sort_values(by=['col1','col2'])

print sorted_df

edit data frame


df=pd.DataFrame({'A': [1, 2, 3, 4, 5],"B":[1, 2, 3, 4, 5]})
df['B']=df['B'].apply(lambda x: math.pow(x,2))
df["B"]=[[math.sqrt(x),x/2] for x in df["B"]]

merge

  • left − A DataFrame object.
  • right − Another DataFrame object.
  • on − Columns (names) to join on. Must be found in both the left and right DataFrame objects.
  • left_on − Columns from the left DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the DataFrame.
  • right_on − Columns from the right DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the DataFrame.
  • left_index − If True, use the index (row labels) from the left DataFrame as its join key(s). In case of a DataFrame with a MultiIndex (hierarchical), the number of levels must match the number of join keys from the right DataFrame.
  • right_index − Same usage as left_index for the right DataFrame.
  • how − One of ‘left’, ‘right’, ‘outer’, ‘inner’. Defaults to inner. Each method has been described below.
  • sort − Sort the result DataFrame by the join keys in lexicographical order. Defaults to True, setting to False will improve the performance substantially in many cases.
import pandas as pd
left = pd.DataFrame({
   'id':[1,2,3,4,5],
   'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
   'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame(
   {'id':[1,2,3,4,5],
   'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
   'subject_id':['sub2','sub4','sub3','sub6','sub5']})

Concatenation

one = pd.DataFrame({
   'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
   'subject_id':['sub1','sub2','sub4','sub6','sub5'],
   'Marks_scored':[98,90,87,69,78]},
   index=[1,2,3,4,5])

two = pd.DataFrame({
   'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
   'subject_id':['sub2','sub4','sub3','sub6','sub5'],
   'Marks_scored':[89,80,79,97,88]},
   index=[1,2,3,4,5])
print pd.concat([one,two],keys=['x','y'],ignore_index=True)

cut continues values to classes

df = pd.DataFrame({'number': [1,2,3,4,5,6,7,8,9,10,11]})
df['bins'] = pd.cut(df['number'], (0, 5, 8, 11), 
                    labels=['low', 'medium', 'high'])

oop in python

magic functions

class tt:
    text:str
    def __init__(self,value):
        self.text = value
    def __str__(self):#print(obj)
        return self.text
    def __repr__(self):#print(obj) return obj.text
        return self.text
    def __add__(self, other):#obj+other return obj.text+other.text
        return self.text+other.text
    def __len__(self):#len(obj)  return len(obj.text)
        return len(self.text)
    def __getattr__(self, item):#obj.item return item
        return item
    def __getitem__(self, item):#obj[item] return item
        return item
    def __call__(self, *args, **kwargs):#obj() return obj.text
        return self.text
    def __eq__(self, other):#obj==other return obj.text==other.text
        return self.text==other.text
    def __ne__(self, other):#obj!=other return obj.text!=other.text
        return self.text!=other.text
    def __iter__(self):#for i in obj: return obj.text
        return iter(self.text)
class A:
    def __getitem__(self, item):
        if(isinstance(item, slice)):
            print(item.start)
            print(item.stop)
            print(item.step)

a = A()
a[1:3:4]

generic class

from typing import TypeVar, Generic

T = TypeVar('T',int , float,complex,decimal.Decimal)
class Stack(Generic[T]):
    def __init__(self) -> None:
        # Create an empty list with items of type T
        self.items: list[T] = []

    def push(self, item: T) -> None:
        self.items.append(item)

    def pop(self) -> T:
        return self.items.pop()

    def empty(self) -> bool:
        return not self.items

predefined static properties

  • __dict__ − Dictionary containing the class’s namespace.
  • __doc__ − Class documentation string or none, if undefined.
  • __name__ − Class name.
  • __module__ − Module name in which the class is defined. This attribute is “__main__” in interactive mode.
  • __bases__ − A possibly empty tuple containing the base classes, in the order of their occurrence in the base class list.
class Employee:
   def __init__(self, name="Bhavana", age=24):
      self.name = name
      self.age = age
   def displayEmployee(self):
      print ("Name : ", self.name, ", age: ", self.age)

print ("Employee.__doc__:", Employee.__doc__)
print ("Employee.__name__:", Employee.__name__)
print ("Employee.__module__:", Employee.__module__)
print ("Employee.__bases__:", Employee.__bases__)
print ("Employee.__dict__:", Employee.__dict__ )

abstraction

from abc import ABC, abstractmethod
class demo(ABC):
   @abstractmethod
   def method1(self):
      print ("abstract method")
      return
   def method2(self):
      print ("concrete method")

access modifier

class Employee:
   def __init__(self, name, age, salary):
      self.name = name # public variable
      self.__age = age # private variable
      self._salary = salary # protected variable
   def displayEmployee(self):
      print ("Name : ", self.name, ", age: ", self.__age, ", salary: ", self._salary)

e1=Employee("Bhavana", 24, 10000)

print (e1.name)
print (e1._salary)
print (e1.__age)

enum

from enum import Enum

class subjects(Enum):
   ENGLISH = "E"
   MATHS = "M"
   GEOGRAPHY = "G"
   SANSKRIT = "S"
   
obj = subjects.SANSKRIT
print (type(obj), obj.name, obj.value)#<enum 'subjects'> SANSKRIT S

from enum import Enum, unique

@unique
class subjects(Enum):
   ENGLISH = 1
   MATHS = 2
   GEOGRAPHY = 3
   SANSKRIT = 2#error value duplicated

reflections

class test:
   pass
   
obj = test()
print (type(obj))#<class '__main__.test'>

print (isinstance(10, int))#true
print (isinstance(2.56, float))#true
print (isinstance(2+3j, complex))#true
print (isinstance("Hello World", str))#true

def test():
   pass
   
print (callable("Hello"))
print (callable(abs))#true
print (callable(list.clear([1,2])))
print (callable(test))#true

class test:
   def __init__(self):
      self.name = "Manav"
      
obj = test()
print (getattr(obj, "name"))

class test:
   def __init__(self):
      self.name = "Manav"
      
obj = test()
setattr(obj, "age", 20)
setattr(obj, "name", "Madhav")
print (obj.name, obj.age)

class test:
   def __init__(self):
      self.name = "Manav"
      
obj = test()
print (hasattr(obj, "age"))
print (hasattr(obj, "name"))

dir(object) # get list of attributes belong to the object

static methods

class A:
    value: int=14 #static variable you can access it by A.value
    @staticmethod
    def bar():#static method
        print("bar")

other way to call method

class A:
    def __init__(self, x, y):
        self.x = x
        self.y = y

    def mm(self):
        return self.x, self.y
fun=A.mm(A(1,2))
print(fun)  # (1, 2)

setter and getter

# employee.py

from datetime import date

class Employee:
    def __init__(self, name, birth_date, start_date):
        self.name = name
        self.birth_date = birth_date
        self.start_date = start_date

    @property
    def name(self):
        return self._name

    @name.setter
    def name(self, value):
        self._name = value.upper()

    @property
    def birth_date(self):
        return self._birth_date

    @birth_date.setter
    def birth_date(self, value):
        self._birth_date = date.fromisoformat(value)

    @property
    def start_date(self):
        return self._start_date

    @start_date.setter
    def start_date(self, value):
        self._start_date = date.fromisoformat(value)

collections and iterations

  • Set: Its unique feature is that items are either members or not. This means duplicates are ignored:
  • Mutable set: The set collection
  • Immutable set: The frozenset collection
  • Sequence: Its unique feature is that items are provided with an index position:
  • Mutable sequence: The list collection
  • Immutable sequence: The tuple collection
  • Mapping: Its unique feature is that each item has a key that refers to a value:
  • Mutable mapping: The dict collection.
  • Immutable mapping: Interestingly, there’s no built-in frozen mapping.
  • [:]: The start and stop are implied. The expression S[:] will create a copy of sequence S.
  • [:stop]: This makes a new list from the beginning to just before the stop value.
  • [start:]: This makes a new list from the given start to the end of the sequence.
  • [start:stop]: This picks a sublist, starting from the start index and stopping just before the stop index. Python works with half-open intervals. The start is included, while the end is not included.
  • [::step]: The start and stop are implied and include the entire sequence. The step—generally not equal to one—means we’ll skip through the list from the start using the step. For a given
  • [start::step]: The start is given, but the stop is implied. The idea is that the start is an offset, and the step applies to that offset. For a given start, a, step, s, and a list of size |L|.
  • [:stop:step]: This is used to prevent processing the last few items in a list. Since the step is given, processing begins with element zero.
  • [start:stop:step]: This will pick elements from a subset of the sequence. Items prior to start and at or after stop will not be used.
a = slice(1, 2, 3)#[start:stop:step]

List

list1 = [1, 2, 3, 4, 5]
list2 = [6, 7, 8, 9, 10]
#operations
print(list1 + list2)  # [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
print(list1 * 2)  # [1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
#append
list1.append(6)  # [1, 2, 3, 4, 5, 6]
list1.extend([7, 3, 3])  # [1, 2, 3, 4, 5, 6, 7, 3, 3]
#remove
del list1[0]  # [2, 3, 4, 5, 6, 7, 3, 3]
list1.remove(3)  # [2, 4, 5, 6, 7, 3, 3]
list1.pop(0)  # [4, 5, 6, 7, 3, 3]
#insert
list1.insert(0, 1)  # [1, 4, 5, 6, 7, 3, 3]
#sort
list1.sort()  # [1, 3, 3, 4, 5, 6, 7]
list1.sort(reverse=True)  # [7, 6, 5, 4, 3, 3, 1]
list1.sort(key=lambda x: x % 2)  # [6, 4, 7, 5, 3, 3, 1]
#reverse
list1.reverse()  # [1, 3, 3, 4, 5, 6, 7]
#count
print(list1.count(3))  # 2
#clear
list1.clear()  # []
#copy
list1 = [1, 2, 3, 4, 5]
list2 = list1.copy()
#unpack the list
list3=[*list1]
print(*arr)# print(1,2,3,4,5,6,7,8,9,10)
ls=list(range(10))
random.shuffle(ls)
mm:typing.Callable=ls.append
mm(10)
print(ls)  # [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

dect

Only a number, string or tuple can be used as key. All of them are immutable. You can use an object of any type as the value.

dict[key]Extract/assign the value mapped with keyprint (d1[‘b’]) retrieves 4d1[‘b’] = ‘Z’ assigns new value to key ‘b’
dict1|dict2Union of two dictionary objects, retuning new objectd3=d1|d2 ; print (d3){‘a’: 2, ‘b’: 4, ‘c’: 30, ‘a1’: 20, ‘b1’: 40, ‘c1’: 60}
dict1|=dict2Augmented dictionary union operatord1|=d2; print (d1){‘a’: 2, ‘b’: 4, ‘c’: 30, ‘a1’: 20, ‘b1’: 40, ‘c1’: 60}
d1=dict([('a', 100), ('b', 200)])
d2 = dict((('a', 'one'), ('b', 'two')))
d3=dict(a= 100, b=200)
d4 = dict(a='one', b='two')

update dictionary

marks = {"Savita":67, "Imtiaz":88, "Laxman":91, "David":49}
print ("marks dictionary before update: \n", marks)
marks1 = {"Sharad": 51, "Mushtaq": 61, "Laxman": 89}
marks.update(marks1)
print (marks)#{'Savita': 67, 'Imtiaz': 88, 'Laxman': 89, 'David': 49, 'Sharad': 51, 'Mushtaq': 61}

d1.update(k1=v1, k2=v2)

unpack dictionary

marks = {"Savita":67, "Imtiaz":88, "Laxman":91, "David":49}
marks1 = {"Sharad": 51, "Mushtaq": 61, "Laxman": 89}
newmarks = {**marks, **marks1}#{'Savita': 67, 'Imtiaz': 88, 'Laxman': 89, 'David': 49, 'Sharad': 51, 'Mushtaq': 61}
a,*b=[1,2,3]#1 [2, 3]

delete and remove

del dict['key']
val = dict.pop(key)
dict.clear()
1dict.clear()Removes all elements of dictionary dict.
2dict.copy()Returns a shallow copy of dictionary dict.
3dict.fromkeys()Create a new dictionary with keys from seq and values set to value.
4dict.get(key, default=None)For key key, returns value or default if key not in dictionary.
5dict.has_key(key)Returns true if a given key is available in the dictionary, otherwise it returns a false.
6dict.items()Returns a list of dict’s (key, value) tuple pairs.
7dict.keys()Returns list of dictionary dict’s keys.
8dict.pop()Removes the element with specified key from the collection
9dict.popitem()Removes the last inserted key-value pair
10dict.setdefault(key, default=None)Similar to get(), but will set dict[key]=default if key is not already in dict.
11dict.update(dict2)Adds dictionary dict2’s key-values pairs to dict.
12dict.values()Returns list of dictionary dict’s values.

syntaxes

for else statement

for i in arr:
    if i>3:
        break
    print(i)
else: #executes if the loop completes without a break
    print("done")
T1 = (10,20,30,40)
T2 = ('one', 'two', 'three', 'four')
L1, L2 = list(T1), list(T2)
L3 = tuple(y for x in [L1, L2] for y in x)#(10, 20, 30, 40, 'one', 'two', 'three', 'four')
mm=["a","b","c",1,2,3]
for m in filter(lambda x: isinstance(x, str), mm):
    print(m)
for m,v in vars(tt("a")).items():#vars return all attributes of object
    print(m,v)
ls=[1,2,3,4,5,6,7,8,9,10,11]
even= [i for i in ls if i%2==0]#only even numbers

builtin functions

numbers = [1.0, 1.5, 2.0, 2.5]
result_1 = map(lambda x: x.is_integer(), numbers)
result_2 = (x.is_integer() for x in numbers)
result_3 = map(float.is_integer, numbers)
# [True, False, True, False]

filter(lambda x: x>10, numbers)

len(numbers)

numericList = [12, 24, 36, 48, 60, 72, 84]
print(sorted(numericList, reverse=True))#84,72,60....

sum([1,2,3,4])#10
a,*b=[1,2,3]#1 [2, 3]

Iterable and Iterator

from typing import *
m=[1,2,3]
mm=iter(m)
print(next(mm))#1
print(next(m))#error
print(isinstance(iter(m), Iterator))#True
print(isinstance(m, Iterator))#False
print(isinstance(m, Iterable))#True
print(isinstance(iter(m), Iterable))#True

map object

ls=list(range(1,100,2))
ls2=[m for m in ls if m>10 & m<20]
m1=map(lambda x:x-100+3j,ls2)
ls3=list(m1)