## Debugging A Learning Algorithm

- Get more training examples -> fix high variance
- Try small sets of features -> fix high variance
- Try getting additional features -> fix high bias
- Try adding polynomial features -> fix high bias
- Try decreasing lambda -> fix high bias
- Try increasing lambda -> fix high variance

## Parameter and Hyper-parameter Set

Machine learning problems can often be converted into a non-convex optimization problem, which requires gradient descent algorithm to be implemented to find the global/local optimum of a parameter set. Using improper hyper-parameter values or improper initial parameter values will prevent gradient descents from working (for example, large learning rate can generate dramatic parameter changes which increases the value of cost function). In the above, the learning algorithm takes too big steps between each parameter updates.

An effective and efficient learning algorithm requires the smooth coordination between all parameter values(initial parameter values and hyper-parameter values).

- Large learning rates should be avoided
- Initial parameter values should be consistent (constant) between different experiments for the convenience of debugging

## Learning Rate and Parameter Scale

Large parameter scale (weight scale) tend to require a smaller learning rate.

## Generalization Error Estimation

Training error, validation error and test error can all be an estimate of the generalization error of the model/hypothesis on the entire population. However, an estimate of the generalization error might not be an fair estimate (an estimate with minimum bias). Also, we don't usually care about the estimate of the generalization error of model that is less fit in the training set.

## Hypothesis Evaluation

- Plot the hypothesis function and all the data points, evaluate the hypothesis by manual evaluation of the fitting
- Evaluate the hypothesis function by the training/validation/testing error.

To evaluate the degree of fitness of the current model/hypothesis on our training set, we train (minimize the objective function) the model on the training set and evaluate the performance by the training error.

To evaluate the degree of fitness of the current model/hypothesis on our validation set, we train (minimize the objective function) the model on the training set and evaluate the performance by the evaluated error on the validation set.

To evaluate the degree of fitness of the current model/hypothesis on our test set, we train (minimize the objective function) the model on the training set and evaluate the performance by the evaluated error on the test set.

To select the best model/hypothesis on our training set, we train the model on the training set with different parameter set, select the model with the minimum training error.

To select the best model/hypothesis on our validation set, we train the model on the training set, select the model (parameter set) with the minimum evaluated error on the validation set.

To evaluate the generalization error of the best model/hypothesis on the validation set, we train the model on the training set, optimize (select) the model (parameter set) with the minimum evaluated error on the validation set, and then report the performance by the evaluated error on the test set.

In the above process, we eventually want to have the minimum evaluated error on the test set for the model with the minimum validation set. We should only optimize the model based on the feedback from the evaluated error on the validation set, and report (only) the performance of the model on the test set. The ultimate goal is to minimize the evaluate error on both the validation set and the test set.

In absolute sense, the generalization error is the evaluated error of the model/hypothesis on the entire population. In other words, it is the fitness of the model on the entire population (as compared to test set error, the fitness of the model on the test set).

Generalization error can be the objective of hypothesis/model optimization, but since the validation set is only a subset of the entire population, the evaluated error on the validation set can not be a fair estimate of the generalization error, rather the evaluated error on the validation set can be goal of optimization (instead of the entire population) and the test set error of the optimized validation set can be a faire estimate of the generalization error.

## Model Selection

Model selection refers to both the form of the hypothesis function and the parameter set of the hypothesis function. In fact, the choice of the form of the hypothesis function can be considered as a separate parameter of our hypothesis.

## Generalization Error

Generalization error = error caused by bias + error caused by variance + irreducible error

## Feature Scaling

The purpose of feature scaling is to accelerate the speed of gradient descent convergence. Mean normalization is commonly used together with feature scaling.

Rule of Thumb, scale all features into the range of [-1, 1].

## Learning Rate

Start from 0.0001, then increase by 10 times, 0.0001, 0.001, 0.01, 0.1, 1.

If not work, then increase by 3 times, 0.0001, 0.0003, 0.0006, ...

## Sampling Error

To avoid sampling error, the dataset should be drawn from the same distribution of the entire population.

## Taxonomy of Features

- Categorial
- Continuous
- Ordinal

## Feature Engineering for DNN

- Rescale bounded continuous features: All continuous input that are bounded, rescale them to [-1, 1] through x = (2x - max - min)/(max - min).
- Standardize all continuous features: All continuous input should be standardized and by this I mean, for every continuous feature, compute its mean (u) and standard deviation (s) and do x = (x - u)/s.
- Binarize categorical/discrete features: For all categorical features, represent them as multiple boolean features. For example, instead of having one feature called marriage_status, have 3 boolean features - married_status_single, married_status_married, married_status_divorced and appropriately set these features to 1 or -1. As you can see, for every categorical feature, you are adding k binary feature where k is the number of values that the categorical feature takes.