2024
Machine Learning

Machine Learning

Date: August 2024
Reading time: 15 minutes


This serves as a collection of notes, examples, and tutorials on machine learning. As I work through the material, I will be documenting my learning process here. It is intended to be a helpful resource for others who are learning machine learning as well. The content is still in the early stages of development and will continue to be expanded in the future.

Running Example

Visit my Google Colab example (opens in a new tab) to see some of the supervised learning models in action.

Definitions

TermDescription
MLMachine learning is a type of artificial intelligence that enables computers to learn from data. It focuses on algorithms without the need of explicit programming.
Fields
AI (Artificial Intelligence)Enable computers to perform human-like tasks/behaviors
ML (Machine Learning)A branch of artificial intelligence that enables computers to learn from data and improve their performance without being explicitly programmed.
DS (Data Science)Draw insights from data - could use ML
Learning Types
Supervised LearningUses inputs with corresponding outputs to train - labeled inputs. E.g. Picture A is a dog, picture B a cat.
Unsupervised LearningLearns about patterns and finds structures to cluster - unlabeled data. E.g. Picture A, D and E are of have something in common.
Reinforcement LearningAn agent takes actions in an interactive environment, in order to maximize a reward. It learns by trial and error and from the feedback (rewards and penalties) it receives. E.g. This chess move was successfull, maybe use it next time as well.
Common File Formats
CSVTabular with header row - id,type,quantity\n0,books,3\n1,pens,5
JSONTree-like with multiple layers - {[{"id": 0, "type": "books", "quantity": 3}, {"id": 1, "type": "pens", "quantity": 5}]}
Structure
ProcessInput -> Model -> Output
Terms
Input/Feature/Feature VectorData which is fed into the model.
ModelFunction which takes input data and gives a prediction.
Output/PredictionPrediction made by the model, based on the input data.
Labels/ResultsTrue values associated with input data.
FittingProcess of adjusting the model's parameters to best explain the data.
Training/LearningIn order to learn, the labels (results) are extracted from the data. Then, each row is fed into the model, which comes up with a prediction. This prediction gets compared with the true value. Based on the loss - difference between prediction and actual result - the model makes adjustmens. That's what's called training.
Training DataData used to fit the model
LossThe loss is the difference between prediciton and actual label. How far is the output from the truth? The smaller the loss, the better performing is the model.
AccuracyIndicates what proportion of the predictions were correct.
MAEMean Absolute Error - average of the absolute differences between predictions and actual values.
XInput data - features.
yLabels/Results - true values associated with input data.
X_trainTraining data - input data used to fit the model.
X_validValidation data - input data used to assess the model's performance.
y_trainTraining labels/Results - true values associated with training data.
y_validValidation labels/Results - true values associated with validation data.

Input Types

  • Qualitative: Finite number of categories or groups
    • Nominal data: No inherrent order
      E.g. Countries
      CountryOne-Hot Encoding
      Switzerland[1, 0, 0]
      USA[0, 1, 0]
      Italy[0, 0, 1]
    • Ordinal data: Inherit order
      E.g. Age groups
      Baby is closer to child than adult
  • Quantitative: Numerical valued
    • Continuous: Can be measured on a continuum or scale
      E.g. Temperature
    • Descrete: Result of counting
      E.g. Number of heads in a sequence of coin tosses

Output Types

  • Classification
    • Multiclass - e.g. Baseball/Basketball/Football
    • Binary - e.g. spam/not
  • Regression
    • Continuous values - e.g. Stock price

Loss Functions

A loss function is a mathematical function that measures the difference between the predicted output and the real output of a model. It is used to quantify the error between the expected and the actual results.
Here are three commonly used loss functions.

Mean Absolute Error (MAE)

The further off the prediction, the greater the loss:

L1 = Σ|y_real - y_predicted|

Mean Squared Error (MSE)

Measures the average squared difference between the predicted and actual values.
Minimal penalty for small misses, much higher loss for bigger ones:

L2 = Σ|y_real - y_predicted|²

Cross Entropy Loss

This loss function is used in classification problems where the target variable is categorical. It measures the difference between predicted probabilities and the true probabilities of each class. The formula is:

L3 = - Σ(y_true * log(y_predicted) + (1 - y_true) * log(1 - y_predicted))

Model

Basic Concept (Decision Tree)

A real astete eagent might say that he estimates houses by intuition. But on closer inspection, you can see that he recognizes price patterns and uses them to predict new houses.
Machine learning works the same way.

The Decision Tree is one of the many models.
It may not be as accurate in predicting as others, but easy to understand and the building block for some of the best models in data science.

This example groups houses into two groups (with price predictions).

    -----------------------> $1.100.000
  /
 / 3 or more
* bedrooms
 \ less than 3
  \
    -----------------------> $750.000

Training data is then used to fit the model. Which training/adjusting it so that the splits and endprices are as optimal as possible.
After that, it can be used to predict the prices of new data.

The results of the above tree would be rather vague.
By using deeper trees (with more splits) you can capture more factors.

								  ----> $1.300.000
								/
                               / yes
      ----------------------- * larger than 11500 square feet lot size
    /                          \ no
   /                            \
  /                               ----> $750.000
 / 3 or more
* bedrooms
 \ less than 3
  \                               ----> $850.000
   \                            /
    \						   / yes
	  ----------------------- * larger than 8500 square feet lot size
						       \ no
							    \
								  ----> $400.000

The point at the bottom, where the prediction is made, is called leaf.

Pandas

Pandas is the main tool for data scientists to explore and manipulate data. Pandas is often abbreviated as pd.

import pandas as pd

The library has powerful methods for most things that need to be done with data.
Its most important part is the DataFrame.

Print Data Summary

data_src = '../input/some-data.csv'
data = pd.read_csv(data_src)
data.describe()

Such a table can be interpreted like so:

ValueDescription
countNumber of non-null objects
meanAverage
stdStandard deviation (how numerically spread out)
minMinimum value
25%First quartile of values (25th percentile)
50%Second quartile of values (50th percentile/median)
75%Third quartile of values (75th percentile)
maxMaximum value

Print First 5 Rows

data.head()

Prediction Target

Using the dot notation, you can select the prediction target (column you want to predict).
This is by convention called y.

y = data.Price

Choosing Features

You could use all columns, except the target, as features. But sometimes you'll be better off with fewer features.
Those can selected using a feature list.

features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']
X = data[features]

Building Model (Decision Tree)

scikit-learn is a popular library for modeling the data (typically stored in DataFrames).
Follow the steps below to create a model.

  1. Define the type of model and its parameters
  2. Fit it by capturing patterns from the data
  3. Predict the results
from sklearn.tree import DecisionTreeRegressor
 
# 1
model = DecisionTreeRegressor(random_state=1) # random_state for consistent outcome across calls
 
# 2
model.fit(X, y)
 
# 3
prediction = model.predict(X)

Validating Model

The fourth step is to evaluate the model by inspecting the prediction accuracy.
Mean Absolute Error (MAE) is one of many metrics to determine a models quality.
The formula is simple: error=actual−predicted

So the metric shows how much the predictions are off on average.

# 4
mean_absolute_error(y, prediction)

Data Splitting

A model shouldn't be trained on all data, because that way you couldn't know how it performs on unseen data. It might perform great on training data, because it has seen it over and over again, but make bad assumptions on new information.
To assess how well the model can generalize, it is usually split up into 3 datasets:

  • Training: Model improves by calculating the loss and learning from it
  • Validation: Acts as a reality check, to see if the model can handle unseen data. The loss doesn't get fed back into it.
  • Testing: Last check on how the final chosen model performs, based on the loss

This is how the data can be split up in two pieces:

from sklearn.model_selection import train_test_split
 
 
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)
 
model = DecisionTreeRegressor()
 
# fit using training data
model.fit(train_X, train_y)
 
# predict validation data
val_predictions = model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

Underfitting and Overfitting

There are two problems that can decrease a models accuracy of future predictions.

  • Overfitting: So precisely tuned to the training set by capturing patterns that won't recur in the future
  • Underfitting: Failing to capture relevant patterns

The sweet spot in a decision tree can be found by testing it with different depths:

from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor
 
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)
 
for max_leaf_nodes in [10, 100, 1000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

Random Forest

Decision trees often do not perform as well due to under- or overfitting. Other models face the same problem, but many of those have ideas that can improve performance. One example is the random forest.

A random forest model uses many trees and averages their predictions in order to make a much more accurate prediction than a single tree could.

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
 
model = RandomForestRegressor(random_state=1)
model.fit(train_X, train_y)
melb_preds = model.predict(val_X)
print(mean_absolute_error(val_y, melb_preds))

Chosing Model

Using a function you can try out different models.

from sklearn.metrics import mean_absolute_error
 
def score_model(model, X_t=X_train, X_v=X_valid, y_t=y_train, y_v=y_valid):
    model.fit(X_t, y_t)
    preds = model.predict(X_v)
    return mean_absolute_error(y_v, preds)
 
for i in range(0, len(models)):
    mae = score_model(models[i])
    print("Model %d MAE: %d" % (i+1, mae))

Missing Values

There are multiple approaches to deal with missing values.

OptionDescriptionBenefitDisadvantage
DropDrop columns with missing valuesEasy to implementLoses access to a lot of potentially useful information
Imputation (possibly better than dropping)Fill in the missing values with some numberLeads to more accurate modelsThe imputed value won't be exactly right in most cases
Imputation Extension (standard approach)Impute missing values and add a new column to show the location of the imputed valuesWill possibly meaningfully improve resultsMore complex

Examples

Don't forget to always adjust both, the training and validation sets/DataFrames.

Drop
# get names of cols with missing values
cols_with_missing = [col for col in X_train.columns
                     if X_train[col].isnull().any()]
 
# drop cols those
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_valid = X_valid.drop(cols_with_missing, axis=1)
 
print("Drop MAE:")
print(score_dataset(reduced_X_train, reduced_X_valid, y_train, y_valid))
Imputation
from sklearn.impute import SimpleImputer
 
# imputation
my_imputer = SimpleImputer()
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))
 
# readd col names (imputation removed them)
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns
 
print("Imputation MAE:")
print(score_dataset(imputed_X_train, imputed_X_valid, y_train, y_valid))
Imputation with Extension
# make copy to avoid changing original data (when imputing)
X_train_plus = X_train.copy()
X_valid_plus = X_valid.copy()
 
# make new cols indicating what will be imputed
for col in cols_with_missing:
    X_train_plus[col + '_was_missing'] = X_train_plus[col].isnull()
    X_valid_plus[col + '_was_missing'] = X_valid_plus[col].isnull()
 
# imputation
my_imputer = SimpleImputer()
imputed_X_train_plus = pd.DataFrame(my_imputer.fit_transform(X_train_plus))
imputed_X_valid_plus = pd.DataFrame(my_imputer.transform(X_valid_plus))
 
# readd col names (imputation removed them)
imputed_X_train_plus.columns = X_train_plus.columns
imputed_X_valid_plus.columns = X_valid_plus.columns
 
print("Extensive Imputation MAE:")
print(score_dataset(imputed_X_train_plus, imputed_X_valid_plus, y_train, y_valid))

Categorial Variables

Categorial variables are enums like "Bad, OK, Good, or Great".
There are three approaches for handling them.

Drop

Just removing the columns from the dataset is maybe easier but will not work well if the columns contain useful information.

drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])
 
print("Drop MAE:")
print(score_dataset(drop_X_train, drop_X_valid, y_train, y_valid))

Ordinal Encoding

This encoding assigns replaces each uniqe value with a different integer, which is a useful approach for ordinal variables:
Bad (0) < OK (1) < Good (2) < Great (3)

from sklearn.preprocessing import OrdinalEncoder
 
# copy to avoid changing original data
label_X_train = X_train.copy()
label_X_valid = X_valid.copy()
 
# apply ordinal encoder to each col with categorical data
ordinal_encoder = OrdinalEncoder()
label_X_train[object_cols] = ordinal_encoder.fit_transform(X_train[object_cols])
label_X_valid[object_cols] = ordinal_encoder.transform(X_valid[object_cols])
 
print("Ordinal MAE:")
print(score_dataset(label_X_train, label_X_valid, y_train, y_valid))

One-Hot Encoding

One-hot encoding creates new columns to represent each unique value in the original data. This encoding typically performs best.

Before:

Color
Black
White
Blue
White
Blue

After:

BlackWhiteBlue
100
010
001
010
001
from sklearn.preprocessing import OneHotEncoder
 
# apply one-hot encoder to each col with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[object_cols]))
 
# readd index (one-hot encoding removed it)
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index
 
# remove categorical cols (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)
 
# add one-hot encoded cols to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)
 
# ensure all col have string type
OH_X_train.columns = OH_X_train.columns.astype(str)
OH_X_valid.columns = OH_X_valid.columns.astype(str)
 
print("One-Hote Encoding MAE:")
print(score_dataset(OH_X_train, OH_X_valid, y_train, y_valid))

Pipelines

Pipelines are a way to chain multiple data transformation and model steps together.
The data flows through the pipeline and the steps are applied in order.

Preprocessing Steps

ColumnTransformer is a class which is used to bundle together preprocessing steps.
The example below imputs missing values in numerical, and applies one-hot encoding to categorical data.

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
 
# preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')
 
# preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])
 
# bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

Model

Of course you need a model to train.

from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=100, random_state=0)

Create and Evaluate the Pipeline

Then the Pipeline class is used to define a pipeline. Using this pipeline, the preprocessing and fitting can be done in a single line of code, which makes it very readable and easy to use.

from sklearn.metrics import mean_absolute_error
 
# bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])
 
# preprocessing of training data, fit model
my_pipeline.fit(X_train, y_train)
 
# preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)
 
# evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE:', score)

Cross-Validation

Cross-validation is a technique where the modeling process is repeated on different subsets of the data.
The data is split into multiple subsets called "folds". E.g. 4 folds, which hold 25% each of the full data.
Each of the folds is then used once as the validation, and the other 3 times as the training set.

Advantage: Accurate measure of model quality
Disadvantage: Takes long to run

Use it for small datasets, which would run for a couple of minutes or less. Where you already have enough data, there is no need to re-use some of it.

from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score
 
my_pipeline = Pipeline(steps=[('preprocessor', SimpleImputer()),
                              ('model', RandomForestRegressor(n_estimators=50,
                                                              random_state=0))
                             ])
 
# -1 since sklearn calculates negative MAE
scores = -1 * cross_val_score(my_pipeline, X, y,
                              cv=5,
                              scoring='neg_mean_absolute_error')
 
print("MAE:\n", scores)

XGBoost

Gradient boosting is a very successful algorithm. This method goes through cycles and iteratively adds models to an ensemble.
The cycle looks like this:

  1. An ensemble of models generates predictions for the dataset.
  2. Loss function is calculated based on those predictions.
  3. A new model gets fit based on the loss function.
  4. This new model gets added to the ensemble.
  5. Process is repeated.
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error
 
model = XGBRegressor()
model.fit(X_train, y_train)
 
predictions = model.predict(X_valid)
print("MAE: " + str(mean_absolute_error(predictions, y_valid)))

Parameters

n_estimators

Defines cycle amount (be aware of under-/overfitting).

model = XGBRegressor(n_estimators=500)

(Typically values from 100-1000.)

early_stopping_rounds

Enables the model to find the ideal value for n_estimators by stopping early when the scores stop improving.

model.fit(X_train, y_train,
             early_stopping_rounds=10,
             eval_set=[(X_valid, y_valid)],
             verbose=False)

This can be combined with a higher n_estimators value to find the optimal cycle amount.

learning_rate

Predicitons from each model are not just added up, but multiplied by a small number called learning rate.
Therefore each tree has a smaller effect on the predictions, which can help prevent overfitting.

Small learning rates create more accurate models, but have longer training times due to the higher amount of iterations.

model = XGBRegressor(n_estimators=500, learning_rate=0.05) # default = 0.1
n_jobs

This paramater has no effect on the resulting model, but it can be used to speed up the training process. With large data the runtime can be decrased by using parallel processing. On small datasets, this will not have an impact.

model = XGBRegressor(n_estimators=500, learning_rate=0.05, n_jobs=4)

Data Leakage

It could happen that the model performs great on the training and even validation data, but bad in production.
This might be cause by data leakage. This happens, when the training set contains information about the target - which will not be avaliable when the model is used on real data.

Model Types

Supervised learning models can be categorized into two main types:

  • Classification: Categorial outputs - multiclass or binary
  • Regression: Continuous outputs - values

Classification

k-Nearest Neighbors (kNN)

Assumption: Objects that are near each other are similar.

Categorizes a datapoint based on the nearest neighbors - using a distance functions like euclidean, city block and more.

KNN decision surface animation By Paolo Bonfini - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=150465667 (opens in a new tab)

Naive Bayes

Assumption: A classes feature is independent of other features.

Classifies based on the highest probability of belonging to a class by calculating the probabilites all features.

native bayes By Sharpr for svg version. original work by kakau in a png - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=44059691 (opens in a new tab)

More

  • Discriminant Analysis

Regression

Linear Regression

Assumption: Target value is a linear combination of the features.

Uses a linear function for predicting.

linear regression By Krishnavedala - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=15462765 (opens in a new tab)

More

  • Nonlinear Regression
  • Generalized Linear Model
  • Gaussian Process Regression (GPR)

Classification or Regression

Support Vector Machine

Assumption: 2 classes can be separated by a divider.

Dataset is divided using a hyperplane, which should linearly separate the classes and have the largest margin between them.

svn By User:ZackWeinberg, based on PNG version by User:Cyc - This file was derived from: Svm separating hyperplanes.png, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=22877598 (opens in a new tab)

Neuronal Network

Assumption: A structure inspired by the human brain can relate the inputs to desired predictions.

A network consisting of interconnected and layered nodes/neurons are trained by iterative modification of the connection strengths.

neuronal network By Mikael Häggström, M.D. Author info- Reusing images- Conflicts of interest:NoneMikael Häggström, M.D. - Own workReference: Ferrie, C., & Kaiser, S. (2019) Neural Networks for Babies, Sourcebooks ISBN: 1492671207., CC0, https://commons.wikimedia.org/w/index.php?curid=137892223 (opens in a new tab)

More

  • Decision Tree
  • Ensemble Trees
  • Generalized Additive Model (GAM)