ML Notebook¶

Supervised¶

model: measured features <--> labels
then apply labels to new, unknown data

Classification: labels are discrete
- Naive Bayes
- Support Vector Machines
- Decision Trees and Random Forests

feature 1, feature 2, etc. $\to$ normalized counts of important words or phrases ("Viagra", "Nigerian prince", etc.)
label $\to$ "spam" or "not spam"

Regression: lables are continuous

feature 1, feature 2, etc. $\to$ brightness of each galaxy at one of several wavelengths or colors.
label $\to$ distance or redshift of the galaxy

Unsupervised¶

model features of a dataset without reference to any label
"let dataset speak for itself"

Clustering (ID groups)
- k-Means, kNN
Dimensionality reduction (more succinct representation of data)
..

Semi-supervised¶

often useful when only incomplete labels are available

Scikit-Learn¶

In [184]:

import seaborn as sns
iris = sns.load_dataset('iris')
iris.head()

Out[184]:

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

Features matrix, X, with shape [n_samples, n_features]

In [67]:

iris.shape

Out[67]:

(150, 5)

Label or target array, y. Usually 1-D, with length n_samples, is what you want to predict from the data. ie. Dependent Variable.

In [68]:

sns.pairplot(iris, hue='species', height=2.0);

In [69]:

X_iris = iris.drop('species', axis=1)
print(X_iris.shape)
X_iris.head()

(150, 4)

Out[69]:

	sepal_length	sepal_width	petal_length	petal_width
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

In [70]:

y_iris = iris['species']
y_iris.head()

Out[70]:

0    setosa
1    setosa
2    setosa
3    setosa
4    setosa
Name: species, dtype: object

Scikit-Learn's Estimator API¶

Consistency: All objects share a common interface drawn from a limited set of methods, with consistent documentation.
Inspection: All specified parameter values are exposed as public attributes.
Limited object hierarchy: Only algorithms are represented by Python classes; datasets are represented in standard formats (NumPy arrays, Pandas DataFrames, SciPy sparse matrices) and parameter names use standard Python strings.
Composition: Many machine learning tasks can be expressed as sequences of more fundamental algorithms, and Scikit-Learn makes use of this wherever possible.
Sensible defaults: When models require user-specified parameters, the library defines an appropriate default value.

How-To¶

Choose a class of model by importing the appropriate estimator class from Scikit-Learn.
Choose model hyperparameters by instantiating this class with desired values.
Arrange data into a features matrix and target vector following the discussion above.
Fit the model to your data by calling the fit() method of the model instance.
Apply the Model to new data:
- For supervised learning, often we predict labels for unknown data using the predict() method.
- For unsupervised learning, we often transform or infer properties of the data using the transform() (like for PCA) or predict() (like for GaussianMixture) method.

Example¶

In [261]:

import matplotlib.pyplot as plt
import numpy as np

rng = np.random.RandomState(42)
x = 10 * rng.rand(50)
y = 2 * x - 1 + rng.randn(50)
plt.scatter(x, y);

In [268]:

# 1. Choose a class of model
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# 2. Choose model hyperparameters
model = LinearRegression(fit_intercept=True)
model

Out[268]:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [34]:

# CHOICE of model is done. Only now..
# 3. Arrange data into a features matrix and target vector

Previously we detailed the Scikit-Learn data representation, which requires a two-dimensional features matrix and a one-dimensional target array. Here our target variable y is already in the correct form (a length-n_samples array), but we

need to massage the data `x` to make it a matrix of size `[n_samples, n_features]`.¶

In this case, this amounts to a simple reshaping of the one-dimensional array:

In [269]:

Out[269]:

array([3.74540119, 9.50714306, 7.31993942, 5.98658484, 1.5601864 ,
       1.5599452 , 0.58083612, 8.66176146, 6.01115012, 7.08072578,
       0.20584494, 9.69909852, 8.32442641, 2.12339111, 1.81824967,
       1.8340451 , 3.04242243, 5.24756432, 4.31945019, 2.9122914 ,
       6.11852895, 1.39493861, 2.92144649, 3.66361843, 4.56069984,
       7.85175961, 1.99673782, 5.14234438, 5.92414569, 0.46450413,
       6.07544852, 1.70524124, 0.65051593, 9.48885537, 9.65632033,
       8.08397348, 3.04613769, 0.97672114, 6.84233027, 4.40152494,
       1.22038235, 4.9517691 , 0.34388521, 9.09320402, 2.58779982,
       6.62522284, 3.11711076, 5.20068021, 5.46710279, 1.84854456])

In [270]:

x.shape

Out[270]:

(50,)

In [271]:

x[:, np.newaxis]

Out[271]:

array([[3.74540119],
       [9.50714306],
       [7.31993942],
       [5.98658484],
       [1.5601864 ],
       [1.5599452 ],
       [0.58083612],
       [8.66176146],
       [6.01115012],
       [7.08072578],
       [0.20584494],
       [9.69909852],
       [8.32442641],
       [2.12339111],
       [1.81824967],
       [1.8340451 ],
       [3.04242243],
       [5.24756432],
       [4.31945019],
       [2.9122914 ],
       [6.11852895],
       [1.39493861],
       [2.92144649],
       [3.66361843],
       [4.56069984],
       [7.85175961],
       [1.99673782],
       [5.14234438],
       [5.92414569],
       [0.46450413],
       [6.07544852],
       [1.70524124],
       [0.65051593],
       [9.48885537],
       [9.65632033],
       [8.08397348],
       [3.04613769],
       [0.97672114],
       [6.84233027],
       [4.40152494],
       [1.22038235],
       [4.9517691 ],
       [0.34388521],
       [9.09320402],
       [2.58779982],
       [6.62522284],
       [3.11711076],
       [5.20068021],
       [5.46710279],
       [1.84854456]])

In [272]:

X = x[:, np.newaxis]
X.shape

Out[272]:

(50, 1)

In [273]:

# 4. Fit model to your data
model.fit(X, y)

Out[273]:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In Scikit-Learn, by convention all model parameters that were learned during the fit() process have trailing underscores; for example in this linear model, we have the following:

In [274]:

model.coef_

Out[274]:

array([1.9776566])

In [275]:

model.intercept_

Out[275]:

-0.903310725531111

In [38]:

# 5. Predict labels for unknown data.

For the sake of this example, our "new data" will be a grid of x values, and we will ask what y values the model predicts:

In [276]:

xfit = np.linspace(-1, 11)

In [277]:

Xfit = xfit[:, np.newaxis]
yfit = model.predict(Xfit)

In [280]:

plt.scatter(x, y)
plt.plot(xfit, yfit);

In [284]:

r2_score(y, yfit)

Out[284]:

-1.652881857892614

Supervised learning: Iris classification¶

Q: Given a model trained on a portion of the Iris data, how well can we predict the remaining labels?

Use Gaussian Naive Bayes to start, because it is easy and fast.

We would like to evaluate the model on data it has not seen before, and so we will split the data into a training set and a testing set. This could be done by hand, but it is more convenient to use the train_test_split utility function:

In [83]:

from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X_iris, y_iris,
                                                random_state=1)

In [84]:

from sklearn.naive_bayes import GaussianNB  # 1. choose model class
model = GaussianNB()                        # 2. instantiate model
model.fit(Xtrain, ytrain)                   # 3. fit model to data
y_model = model.predict(Xtest)              # 4. predict on new data

In [85]:

from sklearn.metrics import accuracy_score
accuracy_score(ytest, y_model)

Out[85]:

0.9736842105263158

Unsupervised learning: Iris dimensionality¶

Reduce dimensionality to more easily visualize it.

In [86]:

X_iris.head()

Out[86]:

	sepal_length	sepal_width	petal_length	petal_width
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

In [87]:

from sklearn.decomposition import PCA

In [88]:

model = PCA(n_components=2)
model.fit(X_iris)
X_2D = model.transform(X_iris)

In [90]:

iris['PCA1'] = X_2D[:, 0]  #first dimension (column) of reduced dataset
iris['PCA2'] = X_2D[:, 1]  #second dimension (column) of reduced dataset
sns.lmplot('PCA1', 'PCA2', data=iris, hue='species', fit_reg=False);

In [ ]:

# from mlxtend.plotting import plot_decision_regions

# plot_decision_regions(X, y, clf=lr)
# plt.title('Softmax Regression in scikit-learn')
# plt.show()

We see that in the two-dimensional representation, the species are fairly well separated, even though the PCA algorithm had no knowledge of the species labels! This indicates to us that a relatively straightforward classification will probably be effective on the dataset, as we saw before.

In [91]:

from sklearn.mixture import GaussianMixture

In [92]:

model = GaussianMixture(n_components=3)
model.fit(X_iris)
yclassified = model.predict(X_iris)

In [93]:

iris['classified'] = yclassified
sns.lmplot('PCA1', 'PCA2', data=iris, hue='species', col='classified', fit_reg=False);

Hand-written digits¶

Q: ID hand-written digits using OCR

In [94]:

from sklearn.datasets import load_digits
digits = load_digits()
digits.images.shape

Out[94]:

(1797, 8, 8)

In [102]:

fig, axes = plt.subplots(10, 10, figsize=(8, 8),
                         subplot_kw={'xticks':[], 'yticks':[]})

for i, ax in enumerate(axes.flat):
    ax.imshow(digits.images[i], cmap='binary')
    ax.text(0.05, 0.05, str(digits.target[i]),
            color='green', transform=ax.transAxes)

For Sciket-Learn, we need [n_samples, n_features]. So an image is a sample, then each pixel in the image is a feature, so 64 features:

In [146]:

X = digits.data
X.shape

Out[146]:

(1797, 64)

In [109]:

X[0]

Out[109]:

array([ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.,  0.,  0., 13., 15., 10.,
       15.,  5.,  0.,  0.,  3., 15.,  2.,  0., 11.,  8.,  0.,  0.,  4.,
       12.,  0.,  0.,  8.,  8.,  0.,  0.,  5.,  8.,  0.,  0.,  9.,  8.,
        0.,  0.,  4., 11.,  0.,  1., 12.,  7.,  0.,  0.,  2., 14.,  5.,
       10., 12.,  0.,  0.,  0.,  0.,  6., 13., 10.,  0.,  0.,  0.])

In [119]:

plt.imshow(digits.images[0], cmap='binary');

In [147]:

y = digits.target
y.shape

Out[147]:

(1797,)

In [148]:

y[0]

Out[148]:

Visualize a 64-dimensional (because 64 features) space? Reduce dimensions to 2, using unsupervised method.. manifold learning algorith Isomap, to transform into 2-D:

In [124]:

from sklearn.manifold import Isomap

In [155]:

iso = Isomap()
iso.fit(X)
y_Iso = iso.transform(X)
y_Iso.shape

Out[155]:

(1797, 2)

In [156]:

fig, ax = plt.subplots(figsize=(12,8))
plt.scatter(y_Iso[:, 0], y_Iso[:, 1],
            c=y,
            cmap=plt.cm.get_cmap('jet', 10), edgecolors='None', alpha=0.5)
plt.colorbar(label='digit', ticks=range(10));

In [194]:

from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y)

In [201]:

from sklearn.naive_bayes import GaussianNB  # 1. choose model class
model = GaussianNB()                        # 2. instantiate model
classifier = model.fit(Xtrain, ytrain)      # 3. fit model to data
y_model = model.predict(Xtest)              # 4. predict on new data

In [202]:

from sklearn.metrics import accuracy_score
accuracy_score(ytest, y_model)

Out[202]:

0.8511111111111112

Where is model failing? Use `sklearn.metrics.confusion_matrix()`¶

In [197]:

from sklearn.metrics import plot_confusion_matrix

In [226]:

fig, ax = plt.subplots(figsize=(7,7))
plot_confusion_matrix(classifier, Xtest, ytest, normalize=None, 
                      cmap=plt.cm.Blues, ax = ax)
ax.grid(False)

Hyperparameters and Model Validation¶

In [246]:

from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target

from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=1)

In [247]:

from sklearn.model_selection import train_test_split

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=0,
                                                train_size=0.5)  # cross validation is basically
                                                                 # windowing which data is used for train_size 

model.fit(Xtrain, ytrain)

y2_model = model.predict(Xtest)
accuracy_score(ytest, y2_model)

Out[247]:

0.9066666666666666

In [248]:

y1 = model.fit(Xtrain, ytrain).predict(Xtest)
y2 = model.fit(Xtest, ytest).predict(Xtrain)
accuracy_score(ytest, y1), accuracy_score(ytrain, y2)

Out[248]:

(0.9066666666666666, 0.96)

Cross Validation¶

that was 2-fold cross-validation. Use cross_val_score() for ease:

In [245]:

from sklearn.model_selection import cross_val_score

In [249]:

cross_val_score(model, X, y, cv=5)

Out[249]:

array([0.96666667, 0.96666667, 0.93333333, 0.93333333, 1.        ])

In [250]:

cross_val_score(model, X, y, cv=5).mean()

Out[250]:

0.96

In [251]:

from sklearn.model_selection import LeaveOneOut

In [259]:

pass_fail = cross_val_score(model, X, y, cv=LeaveOneOut())  #use all data except 1, then return 0 or 1 for fail or pass
pass_fail

Out[259]:

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 0., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

In [260]:

pass_fail.mean()

Out[260]:

0.96

Model Selection and Bias/Variance¶

Q: If model is underperforming, what to do? Options:

more complicated model
less complicated model
more training samples
more features per sample

b-v-tradeoff

L: Needs more complexity. Data is intrinsically more complicated. Not enough model flexibility to account for all features.

$R^2$ relatively unchanged, High-Bias: performance of test set is similar to train set.

R: Plenty of flexibility for fine features, and accurately describes training data, but ignores underlying process that may have generated the data. So much flexibility in the model that it accounts for trend in data and random errors.

$R^2$ massively different, High-Variance: performance of test set is far worse than train set.

validation-curve

The training score is everywhere higher than the validation score. This is generally the case: the model will be a better fit to data it has seen than to data it has not seen.
For very low model complexity (a high-bias model), the training data is under-fit, which means that the model is a poor predictor both for the training data and for any previously unseen data.
For very high model complexity (a high-variance model), the training data is over-fit, which means that the model predicts the training data very well, but fails for any previously unseen data.
For some intermediate value, the validation curve has a maximum. This level of complexity indicates a suitable trade-off between bias and variance.