ML

Posted on Wed 10 January 2018 in Projects

ML Notebook

Supervised

model: measured features <--> labels
then apply labels to new, unknown data

  • Classification: labels are discrete
    • Naive Bayes
    • Support Vector Machines
    • Decision Trees and Random Forests

feature 1, feature 2, etc. $\to$ normalized counts of important words or phrases ("Viagra", "Nigerian prince", etc.)
label $\to$ "spam" or "not spam"

  • Regression: lables are continuous

feature 1, feature 2, etc. $\to$ brightness of each galaxy at one of several wavelengths or colors.
label $\to$ distance or redshift of the galaxy

Unsupervised

model features of a dataset without reference to any label
"let dataset speak for itself"

  • Clustering (ID groups)
    • k-Means, kNN
  • Dimensionality reduction (more succinct representation of data)
  • ..

Semi-supervised

often useful when only incomplete labels are available

Scikit-Learn

In [184]:
import seaborn as sns
iris = sns.load_dataset('iris')
iris.head()
Out[184]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

Features matrix, X, with shape [n_samples, n_features]

In [67]:
iris.shape
Out[67]:
(150, 5)

Label or target array, y. Usually 1-D, with length n_samples, is what you want to predict from the data. ie. Dependent Variable.

In [68]:
sns.pairplot(iris, hue='species', height=2.0);
In [69]:
X_iris = iris.drop('species', axis=1)
print(X_iris.shape)
X_iris.head()
(150, 4)
Out[69]:
sepal_length sepal_width petal_length petal_width
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
In [70]:
y_iris = iris['species']
y_iris.head()
Out[70]:
0    setosa
1    setosa
2    setosa
3    setosa
4    setosa
Name: species, dtype: object

Scikit-Learn's Estimator API

  • Consistency: All objects share a common interface drawn from a limited set of methods, with consistent documentation.

  • Inspection: All specified parameter values are exposed as public attributes.

  • Limited object hierarchy: Only algorithms are represented by Python classes; datasets are represented in standard formats (NumPy arrays, Pandas DataFrames, SciPy sparse matrices) and parameter names use standard Python strings.

  • Composition: Many machine learning tasks can be expressed as sequences of more fundamental algorithms, and Scikit-Learn makes use of this wherever possible.

  • Sensible defaults: When models require user-specified parameters, the library defines an appropriate default value.

How-To

  1. Choose a class of model by importing the appropriate estimator class from Scikit-Learn.
  2. Choose model hyperparameters by instantiating this class with desired values.
  3. Arrange data into a features matrix and target vector following the discussion above.
  4. Fit the model to your data by calling the fit() method of the model instance.
  5. Apply the Model to new data:
    • For supervised learning, often we predict labels for unknown data using the predict() method.
    • For unsupervised learning, we often transform or infer properties of the data using the transform() (like for PCA) or predict() (like for GaussianMixture) method.

Example

In [261]:
import matplotlib.pyplot as plt
import numpy as np

rng = np.random.RandomState(42)
x = 10 * rng.rand(50)
y = 2 * x - 1 + rng.randn(50)
plt.scatter(x, y);
In [268]:
# 1. Choose a class of model
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# 2. Choose model hyperparameters
model = LinearRegression(fit_intercept=True)
model
Out[268]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [34]:
# CHOICE of model is done. Only now..
# 3. Arrange data into a features matrix and target vector

Previously we detailed the Scikit-Learn data representation, which requires a two-dimensional features matrix and a one-dimensional target array. Here our target variable y is already in the correct form (a length-n_samples array), but we

need to massage the data x to make it a matrix of size [n_samples, n_features].

In this case, this amounts to a simple reshaping of the one-dimensional array:

In [269]:
x
Out[269]:
array([3.74540119, 9.50714306, 7.31993942, 5.98658484, 1.5601864 ,
       1.5599452 , 0.58083612, 8.66176146, 6.01115012, 7.08072578,
       0.20584494, 9.69909852, 8.32442641, 2.12339111, 1.81824967,
       1.8340451 , 3.04242243, 5.24756432, 4.31945019, 2.9122914 ,
       6.11852895, 1.39493861, 2.92144649, 3.66361843, 4.56069984,
       7.85175961, 1.99673782, 5.14234438, 5.92414569, 0.46450413,
       6.07544852, 1.70524124, 0.65051593, 9.48885537, 9.65632033,
       8.08397348, 3.04613769, 0.97672114, 6.84233027, 4.40152494,
       1.22038235, 4.9517691 , 0.34388521, 9.09320402, 2.58779982,
       6.62522284, 3.11711076, 5.20068021, 5.46710279, 1.84854456])
In [270]:
x.shape
Out[270]:
(50,)
In [271]:
x[:, np.newaxis]
Out[271]:
array([[3.74540119],
       [9.50714306],
       [7.31993942],
       [5.98658484],
       [1.5601864 ],
       [1.5599452 ],
       [0.58083612],
       [8.66176146],
       [6.01115012],
       [7.08072578],
       [0.20584494],
       [9.69909852],
       [8.32442641],
       [2.12339111],
       [1.81824967],
       [1.8340451 ],
       [3.04242243],
       [5.24756432],
       [4.31945019],
       [2.9122914 ],
       [6.11852895],
       [1.39493861],
       [2.92144649],
       [3.66361843],
       [4.56069984],
       [7.85175961],
       [1.99673782],
       [5.14234438],
       [5.92414569],
       [0.46450413],
       [6.07544852],
       [1.70524124],
       [0.65051593],
       [9.48885537],
       [9.65632033],
       [8.08397348],
       [3.04613769],
       [0.97672114],
       [6.84233027],
       [4.40152494],
       [1.22038235],
       [4.9517691 ],
       [0.34388521],
       [9.09320402],
       [2.58779982],
       [6.62522284],
       [3.11711076],
       [5.20068021],
       [5.46710279],
       [1.84854456]])
In [272]:
X = x[:, np.newaxis]
X.shape
Out[272]:
(50, 1)
In [273]:
# 4. Fit model to your data
model.fit(X, y)
Out[273]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In Scikit-Learn, by convention all model parameters that were learned during the fit() process have trailing underscores; for example in this linear model, we have the following:

In [274]:
model.coef_
Out[274]:
array([1.9776566])
In [275]:
model.intercept_
Out[275]:
-0.903310725531111
In [38]:
# 5. Predict labels for unknown data.

For the sake of this example, our "new data" will be a grid of x values, and we will ask what y values the model predicts:

In [276]:
xfit = np.linspace(-1, 11)
In [277]:
Xfit = xfit[:, np.newaxis]
yfit = model.predict(Xfit)
In [280]:
plt.scatter(x, y)
plt.plot(xfit, yfit);
In [284]:
r2_score(y, yfit)
Out[284]:
-1.652881857892614

Supervised learning: Iris classification

Q: Given a model trained on a portion of the Iris data, how well can we predict the remaining labels?

Use Gaussian Naive Bayes to start, because it is easy and fast.

We would like to evaluate the model on data it has not seen before, and so we will split the data into a training set and a testing set. This could be done by hand, but it is more convenient to use the train_test_split utility function:

In [83]:
from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X_iris, y_iris,
                                                random_state=1)
In [84]:
from sklearn.naive_bayes import GaussianNB  # 1. choose model class
model = GaussianNB()                        # 2. instantiate model
model.fit(Xtrain, ytrain)                   # 3. fit model to data
y_model = model.predict(Xtest)              # 4. predict on new data
In [85]:
from sklearn.metrics import accuracy_score
accuracy_score(ytest, y_model)
Out[85]:
0.9736842105263158

Unsupervised learning: Iris dimensionality

Reduce dimensionality to more easily visualize it.

In [86]:
X_iris.head()
Out[86]:
sepal_length sepal_width petal_length petal_width
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
In [87]:
from sklearn.decomposition import PCA
In [88]:
model = PCA(n_components=2)
model.fit(X_iris)
X_2D = model.transform(X_iris)
In [90]:
iris['PCA1'] = X_2D[:, 0]  #first dimension (column) of reduced dataset
iris['PCA2'] = X_2D[:, 1]  #second dimension (column) of reduced dataset
sns.lmplot('PCA1', 'PCA2', data=iris, hue='species', fit_reg=False);
In [ ]:
# from mlxtend.plotting import plot_decision_regions

# plot_decision_regions(X, y, clf=lr)
# plt.title('Softmax Regression in scikit-learn')
# plt.show()

We see that in the two-dimensional representation, the species are fairly well separated, even though the PCA algorithm had no knowledge of the species labels! This indicates to us that a relatively straightforward classification will probably be effective on the dataset, as we saw before.

In [91]:
from sklearn.mixture import GaussianMixture
In [92]:
model = GaussianMixture(n_components=3)
model.fit(X_iris)
yclassified = model.predict(X_iris)
In [93]:
iris['classified'] = yclassified
sns.lmplot('PCA1', 'PCA2', data=iris, hue='species', col='classified', fit_reg=False);

Hand-written digits

Q: ID hand-written digits using OCR

In [94]:
from sklearn.datasets import load_digits
digits = load_digits()
digits.images.shape
Out[94]:
(1797, 8, 8)
In [102]:
fig, axes = plt.subplots(10, 10, figsize=(8, 8),
                         subplot_kw={'xticks':[], 'yticks':[]})

for i, ax in enumerate(axes.flat):
    ax.imshow(digits.images[i], cmap='binary')
    ax.text(0.05, 0.05, str(digits.target[i]),
            color='green', transform=ax.transAxes)

For Sciket-Learn, we need [n_samples, n_features]. So an image is a sample, then each pixel in the image is a feature, so 64 features:

In [146]:
X = digits.data
X.shape
Out[146]:
(1797, 64)
In [109]:
X[0]
Out[109]:
array([ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.,  0.,  0., 13., 15., 10.,
       15.,  5.,  0.,  0.,  3., 15.,  2.,  0., 11.,  8.,  0.,  0.,  4.,
       12.,  0.,  0.,  8.,  8.,  0.,  0.,  5.,  8.,  0.,  0.,  9.,  8.,
        0.,  0.,  4., 11.,  0.,  1., 12.,  7.,  0.,  0.,  2., 14.,  5.,
       10., 12.,  0.,  0.,  0.,  0.,  6., 13., 10.,  0.,  0.,  0.])
In [119]:
plt.imshow(digits.images[0], cmap='binary');
In [147]:
y = digits.target
y.shape
Out[147]:
(1797,)
In [148]:
y[0]
Out[148]:
0

Visualize a 64-dimensional (because 64 features) space? Reduce dimensions to 2, using unsupervised method.. manifold learning algorith Isomap, to transform into 2-D:

In [124]:
from sklearn.manifold import Isomap
In [155]:
iso = Isomap()
iso.fit(X)
y_Iso = iso.transform(X)
y_Iso.shape
Out[155]:
(1797, 2)
In [156]:
fig, ax = plt.subplots(figsize=(12,8))
plt.scatter(y_Iso[:, 0], y_Iso[:, 1],
            c=y,
            cmap=plt.cm.get_cmap('jet', 10), edgecolors='None', alpha=0.5)
plt.colorbar(label='digit', ticks=range(10));
In [194]:
from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y)
In [201]:
from sklearn.naive_bayes import GaussianNB  # 1. choose model class
model = GaussianNB()                        # 2. instantiate model
classifier = model.fit(Xtrain, ytrain)      # 3. fit model to data
y_model = model.predict(Xtest)              # 4. predict on new data
In [202]:
from sklearn.metrics import accuracy_score
accuracy_score(ytest, y_model)
Out[202]:
0.8511111111111112

Where is model failing? Use sklearn.metrics.confusion_matrix()

In [197]:
from sklearn.metrics import plot_confusion_matrix
In [226]:
fig, ax = plt.subplots(figsize=(7,7))
plot_confusion_matrix(classifier, Xtest, ytest, normalize=None, 
                      cmap=plt.cm.Blues, ax = ax)
ax.grid(False)

Hyperparameters and Model Validation

In [246]:
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target

from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=1)
In [247]:
from sklearn.model_selection import train_test_split

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=0,
                                                train_size=0.5)  # cross validation is basically
                                                                 # windowing which data is used for train_size 

model.fit(Xtrain, ytrain)

y2_model = model.predict(Xtest)
accuracy_score(ytest, y2_model)
Out[247]:
0.9066666666666666
In [248]:
y1 = model.fit(Xtrain, ytrain).predict(Xtest)
y2 = model.fit(Xtest, ytest).predict(Xtrain)
accuracy_score(ytest, y1), accuracy_score(ytrain, y2)
Out[248]:
(0.9066666666666666, 0.96)

Cross Validation

that was 2-fold cross-validation. Use cross_val_score() for ease:

In [245]:
from sklearn.model_selection import cross_val_score
In [249]:
cross_val_score(model, X, y, cv=5)
Out[249]:
array([0.96666667, 0.96666667, 0.93333333, 0.93333333, 1.        ])
In [250]:
cross_val_score(model, X, y, cv=5).mean()
Out[250]:
0.96
In [251]:
from sklearn.model_selection import LeaveOneOut
In [259]:
pass_fail = cross_val_score(model, X, y, cv=LeaveOneOut())  #use all data except 1, then return 0 or 1 for fail or pass
pass_fail
Out[259]:
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 0., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
In [260]:
pass_fail.mean()
Out[260]:
0.96

Model Selection and Bias/Variance

Q: If model is underperforming, what to do? Options:

  • more complicated model
  • less complicated model
  • more training samples
  • more features per sample

b-v-tradeoff

L: Needs more complexity. Data is intrinsically more complicated. Not enough model flexibility to account for all features.

  • $R^2$ relatively unchanged, High-Bias: performance of test set is similar to train set.

R: Plenty of flexibility for fine features, and accurately describes training data, but ignores underlying process that may have generated the data. So much flexibility in the model that it accounts for trend in data and random errors.

  • $R^2$ massively different, High-Variance: performance of test set is far worse than train set.

validation-curve

  • The training score is everywhere higher than the validation score. This is generally the case: the model will be a better fit to data it has seen than to data it has not seen.
  • For very low model complexity (a high-bias model), the training data is under-fit, which means that the model is a poor predictor both for the training data and for any previously unseen data.
  • For very high model complexity (a high-variance model), the training data is over-fit, which means that the model predicts the training data very well, but fails for any previously unseen data.
  • For some intermediate value, the validation curve has a maximum. This level of complexity indicates a suitable trade-off between bias and variance.
In [ ]: