Machine Learning

Date: August 2024
Reading time: 15 minutes

This serves as a collection of notes, examples, and tutorials on machine learning. As I work through the material, I will be documenting my learning process here. It is intended to be a helpful resource for others who are learning machine learning as well. The content is still in the early stages of development and will continue to be expanded in the future.

Running Example

Visit my Google Colab example (opens in a new tab) to see some of the supervised learning models in action.

Definitions

Term	Description
ML	Machine learning is a type of artificial intelligence that enables computers to learn from data. It focuses on algorithms without the need of explicit programming.
Fields
AI (Artificial Intelligence)	Enable computers to perform human-like tasks/behaviors
ML (Machine Learning)	A branch of artificial intelligence that enables computers to learn from data and improve their performance without being explicitly programmed.
DS (Data Science)	Draw insights from data - could use ML
Learning Types
Supervised Learning	Uses inputs with corresponding outputs to train - labeled inputs. E.g. Picture A is a dog, picture B a cat.
Unsupervised Learning	Learns about patterns and finds structures to cluster - unlabeled data. E.g. Picture A, D and E are of have something in common.
Reinforcement Learning	An agent takes actions in an interactive environment, in order to maximize a reward. It learns by trial and error and from the feedback (rewards and penalties) it receives. E.g. This chess move was successfull, maybe use it next time as well.
Common File Formats
CSV	Tabular with header row - `id,type,quantity\n0,books,3\n1,pens,5`
JSON	Tree-like with multiple layers - `{[{"id": 0, "type": "books", "quantity": 3}, {"id": 1, "type": "pens", "quantity": 5}]}`
Structure
Process	Input -> Model -> Output
Terms
Input/Feature/Feature Vector	Data which is fed into the model.
Model	Function which takes input data and gives a prediction.
Output/Prediction	Prediction made by the model, based on the input data.
Labels/Results	True values associated with input data.
Fitting	Process of adjusting the model's parameters to best explain the data.
Training/Learning	In order to learn, the labels (results) are extracted from the data. Then, each row is fed into the model, which comes up with a prediction. This prediction gets compared with the true value. Based on the loss - difference between prediction and actual result - the model makes adjustmens. That's what's called training.
Training Data	Data used to fit the model
Loss	The loss is the difference between prediciton and actual label. How far is the output from the truth? The smaller the loss, the better performing is the model.
Accuracy	Indicates what proportion of the predictions were correct.
MAE	Mean Absolute Error - average of the absolute differences between predictions and actual values.
X	Input data - features.
y	Labels/Results - true values associated with input data.
X_train	Training data - input data used to fit the model.
X_valid	Validation data - input data used to assess the model's performance.
y_train	Training labels/Results - true values associated with training data.
y_valid	Validation labels/Results - true values associated with validation data.

Input Types

Qualitative: Finite number of categories or groups
- Nominal data: No inherrent order
  E.g. Countries
  Country One-Hot Encoding
  Switzerland [1, 0, 0]
  USA [0, 1, 0]
  Italy [0, 0, 1]
- Ordinal data: Inherit order
  E.g. Age groups
  Baby is closer to child than adult
Quantitative: Numerical valued
- Continuous: Can be measured on a continuum or scale
  E.g. Temperature
- Descrete: Result of counting
  E.g. Number of heads in a sequence of coin tosses

Country	One-Hot Encoding
Switzerland	[1, 0, 0]
USA	[0, 1, 0]
Italy	[0, 0, 1]

Output Types

Classification
- Multiclass - e.g. Baseball/Basketball/Football
- Binary - e.g. spam/not
Regression
- Continuous values - e.g. Stock price

Loss Functions

A loss function is a mathematical function that measures the difference between the predicted output and the real output of a model. It is used to quantify the error between the expected and the actual results.
Here are three commonly used loss functions.

Mean Absolute Error (MAE)

The further off the prediction, the greater the loss:

L1 = Σ|y_real - y_predicted|

Mean Squared Error (MSE)

Measures the average squared difference between the predicted and actual values.
Minimal penalty for small misses, much higher loss for bigger ones:

L2 = Σ|y_real - y_predicted|²

Cross Entropy Loss

This loss function is used in classification problems where the target variable is categorical. It measures the difference between predicted probabilities and the true probabilities of each class. The formula is:

L3 = - Σ(y_true * log(y_predicted) + (1 - y_true) * log(1 - y_predicted))

Model

Basic Concept (Decision Tree)

A real astete eagent might say that he estimates houses by intuition. But on closer inspection, you can see that he recognizes price patterns and uses them to predict new houses.
Machine learning works the same way.

The Decision Tree is one of the many models.
It may not be as accurate in predicting as others, but easy to understand and the building block for some of the best models in data science.

This example groups houses into two groups (with price predictions).

    -----------------------> $1.100.000
  /
 / 3 or more
* bedrooms
 \ less than 3
  \
    -----------------------> $750.000

Training data is then used to fit the model. Which training/adjusting it so that the splits and endprices are as optimal as possible.
After that, it can be used to predict the prices of new data.

The results of the above tree would be rather vague.
By using deeper trees (with more splits) you can capture more factors.

								  ----> $1.300.000
								/
                               / yes
      ----------------------- * larger than 11500 square feet lot size
    /                          \ no
   /                            \
  /                               ----> $750.000
 / 3 or more
* bedrooms
 \ less than 3
  \                               ----> $850.000
   \                            /
    \						   / yes
	  ----------------------- * larger than 8500 square feet lot size
						       \ no
							    \
								  ----> $400.000

The point at the bottom, where the prediction is made, is called leaf.

Pandas

Pandas is the main tool for data scientists to explore and manipulate data. Pandas is often abbreviated as pd.

import pandas as pd

The library has powerful methods for most things that need to be done with data.
Its most important part is the DataFrame.

Print Data Summary

data_src = '../input/some-data.csv'
data = pd.read_csv(data_src)
data.describe()

Such a table can be interpreted like so:

Value	Description
count	Number of non-null objects
mean	Average
std	Standard deviation (how numerically spread out)
min	Minimum value
25%	First quartile of values (25th percentile)
50%	Second quartile of values (50th percentile/median)
75%	Third quartile of values (75th percentile)
max	Maximum value

Print First 5 Rows

data.head()

Prediction Target

Using the dot notation, you can select the prediction target (column you want to predict).
This is by convention called y.

y = data.Price

Choosing Features

You could use all columns, except the target, as features. But sometimes you'll be better off with fewer features.
Those can selected using a feature list.

features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']
X = data[features]

Building Model (Decision Tree)

scikit-learn is a popular library for modeling the data (typically stored in DataFrames).
Follow the steps below to create a model.

Define the type of model and its parameters
Fit it by capturing patterns from the data
Predict the results

from sklearn.tree import DecisionTreeRegressor
 
# 1
model = DecisionTreeRegressor(random_state=1) # random_state for consistent outcome across calls
 
# 2
model.fit(X, y)
 
# 3
prediction = model.predict(X)

Validating Model

The fourth step is to evaluate the model by inspecting the prediction accuracy.
Mean Absolute Error (MAE) is one of many metrics to determine a models quality.
The formula is simple: error=actual−predicted

So the metric shows how much the predictions are off on average.

# 4
mean_absolute_error(y, prediction)

Data Splitting

A model shouldn't be trained on all data, because that way you couldn't know how it performs on unseen data. It might perform great on training data, because it has seen it over and over again, but make bad assumptions on new information.
To assess how well the model can generalize, it is usually split up into 3 datasets:

Training: Model improves by calculating the loss and learning from it
Validation: Acts as a reality check, to see if the model can handle unseen data. The loss doesn't get fed back into it.
Testing: Last check on how the final chosen model performs, based on the loss

This is how the data can be split up in two pieces:

from sklearn.model_selection import train_test_split
 
 
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)
 
model = DecisionTreeRegressor()
 
# fit using training data
model.fit(train_X, train_y)
 
# predict validation data
val_predictions = model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

Underfitting and Overfitting

There are two problems that can decrease a models accuracy of future predictions.

Overfitting: So precisely tuned to the training set by capturing patterns that won't recur in the future
Underfitting: Failing to capture relevant patterns

The sweet spot in a decision tree can be found by testing it with different depths:

from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor
 
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)
 
for max_leaf_nodes in [10, 100, 1000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

Random Forest

Decision trees often do not perform as well due to under- or overfitting. Other models face the same problem, but many of those have ideas that can improve performance. One example is the random forest.

A random forest model uses many trees and averages their predictions in order to make a much more accurate prediction than a single tree could.

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
 
model = RandomForestRegressor(random_state=1)
model.fit(train_X, train_y)
melb_preds = model.predict(val_X)
print(mean_absolute_error(val_y, melb_preds))

Chosing Model

Using a function you can try out different models.

from sklearn.metrics import mean_absolute_error
 
def score_model(model, X_t=X_train, X_v=X_valid, y_t=y_train, y_v=y_valid):
    model.fit(X_t, y_t)
    preds = model.predict(X_v)
    return mean_absolute_error(y_v, preds)
 
for i in range(0, len(models)):
    mae = score_model(models[i])
    print("Model %d MAE: %d" % (i+1, mae))

Missing Values

There are multiple approaches to deal with missing values.

Option	Description	Benefit	Disadvantage
Drop	Drop columns with missing values	Easy to implement	Loses access to a lot of potentially useful information
Imputation (possibly better than dropping)	Fill in the missing values with some number	Leads to more accurate models	The imputed value won't be exactly right in most cases
Imputation Extension (standard approach)	Impute missing values and add a new column to show the location of the imputed values	Will possibly meaningfully improve results	More complex

Examples

Don't forget to always adjust both, the training and validation sets/DataFrames.

Drop

# get names of cols with missing values
cols_with_missing = [col for col in X_train.columns
                     if X_train[col].isnull().any()]
 
# drop cols those
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_valid = X_valid.drop(cols_with_missing, axis=1)
 
print("Drop MAE:")
print(score_dataset(reduced_X_train, reduced_X_valid, y_train, y_valid))

Imputation

from sklearn.impute import SimpleImputer
 
# imputation
my_imputer = SimpleImputer()
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))
 
# readd col names (imputation removed them)
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns
 
print("Imputation MAE:")
print(score_dataset(imputed_X_train, imputed_X_valid, y_train, y_valid))

Imputation with Extension

# make copy to avoid changing original data (when imputing)
X_train_plus = X_train.copy()
X_valid_plus = X_valid.copy()
 
# make new cols indicating what will be imputed
for col in cols_with_missing:
    X_train_plus[col + '_was_missing'] = X_train_plus[col].isnull()
    X_valid_plus[col + '_was_missing'] = X_valid_plus[col].isnull()
 
# imputation
my_imputer = SimpleImputer()
imputed_X_train_plus = pd.DataFrame(my_imputer.fit_transform(X_train_plus))
imputed_X_valid_plus = pd.DataFrame(my_imputer.transform(X_valid_plus))
 
# readd col names (imputation removed them)
imputed_X_train_plus.columns = X_train_plus.columns
imputed_X_valid_plus.columns = X_valid_plus.columns
 
print("Extensive Imputation MAE:")
print(score_dataset(imputed_X_train_plus, imputed_X_valid_plus, y_train, y_valid))

Categorial Variables

Categorial variables are enums like "Bad, OK, Good, or Great".
There are three approaches for handling them.

Drop

Just removing the columns from the dataset is maybe easier but will not work well if the columns contain useful information.

drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])
 
print("Drop MAE:")
print(score_dataset(drop_X_train, drop_X_valid, y_train, y_valid))

Ordinal Encoding

This encoding assigns replaces each uniqe value with a different integer, which is a useful approach for ordinal variables:
Bad (0) < OK (1) < Good (2) < Great (3)

from sklearn.preprocessing import OrdinalEncoder
 
# copy to avoid changing original data
label_X_train = X_train.copy()
label_X_valid = X_valid.copy()
 
# apply ordinal encoder to each col with categorical data
ordinal_encoder = OrdinalEncoder()
label_X_train[object_cols] = ordinal_encoder.fit_transform(X_train[object_cols])
label_X_valid[object_cols] = ordinal_encoder.transform(X_valid[object_cols])
 
print("Ordinal MAE:")
print(score_dataset(label_X_train, label_X_valid, y_train, y_valid))

One-Hot Encoding

One-hot encoding creates new columns to represent each unique value in the original data. This encoding typically performs best.

Before:

Color
Black
White
Blue
White
Blue

After:

Black	White	Blue
1	0	0
0	1	0
0	0	1
0	1	0
0	0	1

from sklearn.preprocessing import OneHotEncoder
 
# apply one-hot encoder to each col with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[object_cols]))
 
# readd index (one-hot encoding removed it)
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index
 
# remove categorical cols (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)
 
# add one-hot encoded cols to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)
 
# ensure all col have string type
OH_X_train.columns = OH_X_train.columns.astype(str)
OH_X_valid.columns = OH_X_valid.columns.astype(str)
 
print("One-Hote Encoding MAE:")
print(score_dataset(OH_X_train, OH_X_valid, y_train, y_valid))

Pipelines

Pipelines are a way to chain multiple data transformation and model steps together.
The data flows through the pipeline and the steps are applied in order.

Preprocessing Steps

ColumnTransformer is a class which is used to bundle together preprocessing steps.
The example below imputs missing values in numerical, and applies one-hot encoding to categorical data.

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
 
# preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')
 
# preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])
 
# bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

Model

Of course you need a model to train.

from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=100, random_state=0)

Create and Evaluate the Pipeline

Then the Pipeline class is used to define a pipeline. Using this pipeline, the preprocessing and fitting can be done in a single line of code, which makes it very readable and easy to use.

from sklearn.metrics import mean_absolute_error
 
# bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])
 
# preprocessing of training data, fit model
my_pipeline.fit(X_train, y_train)
 
# preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)
 
# evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE:', score)

Cross-Validation

Cross-validation is a technique where the modeling process is repeated on different subsets of the data.
The data is split into multiple subsets called "folds". E.g. 4 folds, which hold 25% each of the full data.
Each of the folds is then used once as the validation, and the other 3 times as the training set.

Advantage: Accurate measure of model quality
Disadvantage: Takes long to run

Use it for small datasets, which would run for a couple of minutes or less. Where you already have enough data, there is no need to re-use some of it.

from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score
 
my_pipeline = Pipeline(steps=[('preprocessor', SimpleImputer()),
                              ('model', RandomForestRegressor(n_estimators=50,
                                                              random_state=0))
                             ])
 
# -1 since sklearn calculates negative MAE
scores = -1 * cross_val_score(my_pipeline, X, y,
                              cv=5,
                              scoring='neg_mean_absolute_error')
 
print("MAE:\n", scores)

XGBoost

Gradient boosting is a very successful algorithm. This method goes through cycles and iteratively adds models to an ensemble.
The cycle looks like this:

An ensemble of models generates predictions for the dataset.
Loss function is calculated based on those predictions.
A new model gets fit based on the loss function.
This new model gets added to the ensemble.
Process is repeated.

from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error
 
model = XGBRegressor()
model.fit(X_train, y_train)
 
predictions = model.predict(X_valid)
print("MAE: " + str(mean_absolute_error(predictions, y_valid)))

Parameters

n_estimators

Defines cycle amount (be aware of under-/overfitting).

model = XGBRegressor(n_estimators=500)

(Typically values from 100-1000.)

early_stopping_rounds

Enables the model to find the ideal value for n_estimators by stopping early when the scores stop improving.

model.fit(X_train, y_train,
             early_stopping_rounds=10,
             eval_set=[(X_valid, y_valid)],
             verbose=False)

This can be combined with a higher n_estimators value to find the optimal cycle amount.

learning_rate

Predicitons from each model are not just added up, but multiplied by a small number called learning rate.
Therefore each tree has a smaller effect on the predictions, which can help prevent overfitting.

Small learning rates create more accurate models, but have longer training times due to the higher amount of iterations.

model = XGBRegressor(n_estimators=500, learning_rate=0.05) # default = 0.1

n_jobs

This paramater has no effect on the resulting model, but it can be used to speed up the training process. With large data the runtime can be decrased by using parallel processing. On small datasets, this will not have an impact.

model = XGBRegressor(n_estimators=500, learning_rate=0.05, n_jobs=4)

Data Leakage

It could happen that the model performs great on the training and even validation data, but bad in production.
This might be cause by data leakage. This happens, when the training set contains information about the target - which will not be avaliable when the model is used on real data.

Model Types

Supervised learning models can be categorized into two main types:

Classification: Categorial outputs - multiclass or binary
Regression: Continuous outputs - values

Classification

k-Nearest Neighbors (kNN)

Assumption: Objects that are near each other are similar.

Categorizes a datapoint based on the nearest neighbors - using a distance functions like euclidean, city block and more.

KNN decision surface animation By Paolo Bonfini - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=150465667 (opens in a new tab)

Naive Bayes

Assumption: A classes feature is independent of other features.

Classifies based on the highest probability of belonging to a class by calculating the probabilites all features.

By Sharpr for svg version. original work by kakau in a png - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=44059691 (opens in a new tab)

Discriminant Analysis

Regression

Linear Regression

Assumption: Target value is a linear combination of the features.

Uses a linear function for predicting.

By Krishnavedala - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=15462765 (opens in a new tab)

Nonlinear Regression
Generalized Linear Model
Gaussian Process Regression (GPR)

Classification or Regression

Support Vector Machine

Assumption: 2 classes can be separated by a divider.

Dataset is divided using a hyperplane, which should linearly separate the classes and have the largest margin between them.

By User:ZackWeinberg, based on PNG version by User:Cyc - This file was derived from: Svm separating hyperplanes.png, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=22877598 (opens in a new tab)

Neuronal Network

Assumption: A structure inspired by the human brain can relate the inputs to desired predictions.

A network consisting of interconnected and layered nodes/neurons are trained by iterative modification of the connection strengths.

By Mikael Häggström, M.D. Author info- Reusing images- Conflicts of interest:NoneMikael Häggström, M.D. - Own workReference: Ferrie, C., & Kaiser, S. (2019) Neural Networks for Babies, Sourcebooks ISBN: 1492671207., CC0, https://commons.wikimedia.org/w/index.php?curid=137892223 (opens in a new tab)

Decision Tree
Ensemble Trees
Generalized Additive Model (GAM)

Machine Learning

Running Example

Definitions

Input Types

Output Types

Loss Functions

Mean Absolute Error (MAE)

Mean Squared Error (MSE)

Cross Entropy Loss

Model

Basic Concept (Decision Tree)

Pandas

Print Data Summary

Print First 5 Rows

Prediction Target

Choosing Features

Building Model (Decision Tree)

Validating Model

Data Splitting

Underfitting and Overfitting

Random Forest

Chosing Model

Missing Values

Examples

Drop

Imputation

Imputation with Extension

Categorial Variables

Drop

Ordinal Encoding

One-Hot Encoding

Pipelines

Preprocessing Steps

Model

Create and Evaluate the Pipeline

Cross-Validation

XGBoost

Parameters

n_estimators

early_stopping_rounds

learning_rate

n_jobs

Data Leakage

Model Types

Classification

k-Nearest Neighbors (kNN)

Naive Bayes

More

Regression

Linear Regression

More

Classification or Regression

Support Vector Machine

Neuronal Network

More