Est. read time: 7 minutes | Last updated: July 17, 2024 by John Gentile


Contents

Open In Colab

A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.- Tom Mitchell (1997)

Machine Learning (ML) is all about making digital logic get better at some specific task or problem area by learning from data, rather than the traditional process of explicitly coding up rules or deriving closed form algorithms to find a solution. There are many different types of ML systems and learning models, but gathering good data is key overall.

These ML notes are based on the book Hands-on Machine Learning with Scikit-Learn, Keras and TensorFlow. Supporting notebooks on datasets and example code used can also be founded in the author’s GitHub repo ageron/handson-ml2.

When to Use Machine Learning

todo…

Taxonomy of Machine Learning Systems

ML systems can be broadly categorized by:

Supervised Learning

Here the training dataset you feed the ML model includes the desired solutions/outcomes which are called labels. Some typical applications of Supervised Learning are:

  • Classification: many image classification tasks fit this model, such as a passing a training data set of different looks of images, each labeled by a specific class (like “cat”, “dog”, “giraffe”, etc.)
  • Regression: for example a model can be given a set of features/attributes like mileage, age, brand, etc. (called predictors) to predict a target car price. The label would still be the attributed price given a certain configuration of predictors.
  • Logistic Regression: outputs a value, but can be used for classification, for instance outputting a probability of an input sample belonging to a given class.

Unsupervised Learning

Here the training dataset fed to the ML model is unlabeled, so the system must learn without an explicit “teacher”. Some examples of unsupervised learning applications are:

  • Clustering: here the model will try and “cluster” or group data sets for you. If using a hierarchical clustering algorithm, it also may subdivide each group into smaller groups.
  • Visualization: these algorithms are great for visualizing a 2D, or 3D, representation of your unlabeled dataset while maintaining clustering structure.
  • Dimensionality Reduction: here the goal is to simplify an input dataset without losing too much information by learning to merge correlated features into one, called feature extraction. Note that using this dimension reduction on very complex input training data is a great idea before feeding that dataset into another ML algorithm; the feature extraction can help to reduce the dataset size (e.g. to run faster) and may even improve performance of the ML algorithm.
  • Anomaly Detection: for applications like fraud detection (or anything where we want to detect “unusualness” in a dataset), we show mostly “normal” instances during training so when a “new instance” is seen, it can tell whether it looks like a normal one or an anomaly.
  • Association Rule Learning: here the goal is for the algorithm to discover new/interesting relations between attributes in a dataset.

Semisupervised Learning

Labeling data can be a costly activity so some semisupervised learning algorithms can deal with partially labeled datasets. Many are a combination of supervised and unsupervised learning algorithms. For instance for Google Photos, you can upload a bunch of pictures and the unsupervised part of the system will cluster faces/people, and then when you label a person (technically the supervised part), it is able to use that label across any other photos it has clustered already; this is common for searching for people across unlabeled photos.

Reinforcement Learning

Reinforcement Learning (RL) is very different compared to the (un)supervised learning methods; in this system, the learning model- called an agent- can observe the given environment, perform actions, and then get rewards (positive reinforcement) or penalties (negative reinforcement) in return. Based on the reinforcement, the system must then learn the best strategy to optimize positive rewards over time, called a policy; a policy is an action the agent should choose in a given situation.

RL is used often in how robots learn to walk, or famously how DeepMind’s AlphaGo program learned to play the complicated game of Go:

from IPython.display import IFrame
IFrame('https://www.youtube.com/embed/WXuK6gekU1Y', width=560, height=315)

Batch Learning

A system that requires batch learning cannot learn incrementally, it must be trained with all available data. Since this is compute/time intensive, the system is first trained “offline” (that is, not in a fielded/production use), and then once trained, it is deployed online where it runs without learning anymore. This is also known as offline learning. The implications of a batch learning system is that if any new data/features need to be learned, a new model with new data needs to be trained.

Online Learning

Conversely to a batch learning system, an online learning system can learn “online”; as in, after initial training and deployed, the system can continuously learn as new data comes in. This is useful for systems data need to adapt quickly to a continuous flow of data, such as a stock price predictor model.

Since the model can continuously learn, once data has gone through the system, it can be discarded (which saves lots of memory space). As well, this means online learning can be used with massive datasets that don’t all fit in a computing instance at one time, which is called out-of-core learning.

The learning rate is an important parameter for online systems; a faster learning rate will rapidly adapt to new data but conversely forget old data faster and be more sensitive to noise or outliers in the data set; this is very similar to loop bandwidth in Control Theory.

Since new data is constantly changing the system’s response and actions, one needs to be careful with a deployed online system that bad data can adversely affect the performance; monitoring the input dataset to react to abnormal data could be done by also using an anomaly detection algorithm.

Instance-Based Learning

In instance-based learning, the model is trained on a discrete set of data/features that the system “learns by heart”, and then generalizes (e.g. makes preditions) to new cases by using a measure of similarity between it and the learned data. This is the most basic form of learning.

Model-Based Learning

A model-based learning system generalizes from a set of given input examples, but builds a “model” of these examples to later make predictions. Similarly to how one would try and create a line-of-best-fit in a linear regression data problem, a model-based learning system can use a set of model parameters to optimize/best-fit to non-linear problems. The measure of performance/fit can be defined with either a utility/fitness function (e.g. a function which tells the system how good the model is doing) or a cost function (the logical inverse which measures how bad it’s doing); for instance in the traditional linear regression problem, a cost function could be defined as the distance between data points and the model’s predictive line, where the model wants to minimize the distance between all points.

The act of feeding training examples to a model and varying parameters to find a best fit (e.g. minimum cost) is called training. When a model is well trained, it well make good predictions. If it doesn’t make good predictions, one might need to use more attributes, get more/better quality input data, and/or select a different regression model. Applying a model to make predictions on new data is called inference.

Machine Learning Challenges

  • Not Enough Training Data: ML algorithms generally require a lot of data to learn, even for simple problems. It can be shown that accuracy goes up with more input data across algorithms, and that sometimes with enough input data, the accuracy/performance across algorithms converges. For instance, there are many examples and research papers that show that data matters more than algorithms for complex problems. It’s thus often a trade-off on spending more time/cost on culminating more valuable dataets rather than focusing on algorithm developments.
  • Nonrepresentative Training Data: both instance-based and model-based learning requires that training data is representative of data you expect to see/generalize-to in deployment. If the sample dataset is too small the model will have sampling noise (nonrepresentative results based on guess/chance), but even very large datasets can be bad if they do not accurately/fully represent the expected dataset (e.g. through a flawed sampling method) which causes sampling bias as the model is fit to technically a subset of features that are really in the wild.
    • Bad Quality Data: as a further extension of the importance of representative data, if your dataset is full of errors, outliers and noise (say due to just bad measurements), the system performance will be hindered. It’s another reason why spending the time to clean and groom your dataset is worth it; for instance, just discard instances of clear outliers or fix errors manually.
    • Irrelevant Features: you want to include enough relevant features to properly train a system, but not include too many irrelevant features that could screw up the process. This process called feature engineering involves:

* _Feature Selection:_ selecting the most useful features to train among all features * _Feature Extraction:_ combining existing features to produce a more useful one (for instance, the previously discussed dimensionality reduction algorithms). * Create new features by gathering more representative data.

  • Overfitting: over-generalizing conclusions is something we humans do a lot, but in an ML system this is due to overfitting; essentially when a model is too complex relative to the amount of noisiness of the training data, or not enough data was given, a deep neural net can jump to the wrong conclusions and draw associations for predictions that aren’t correct. To solve overfitting, one could:
    • Simplify the model by selecting fewer parameters/attributes in the training data. This constraining action is called regularization. The amount of regularization can be controlled by a hyperparameter which can help tune the relationship between overfitting vs adaptability to new data.

* Simplifying a data _too_ much can actually lead to _underfitting_, where the model's predictaions are then inaccurate, even on the training samples.

  • Gather more training data
  • Reduce the noise in the training data by fixing data errors and removing outliers

ML System Testing & Validation

One way to test the performance of your ML system is to split your data into two sets: a training set and a test set; the error rate on new cases is called the generalization error. If the training error is low (model makes few mistakes on the training set) but the generalization error is high (many mistakes on test/new data set), the model is clearly overfitted to the training data.

Data Sources

There is a plethora of datasets one could use for ML: