ML
Posted on Wed 10 January 2018 in Projects
ML Notebook¶
Supervised¶
model: measured features <--> labels
then apply labels to new, unknown data
- Classification: labels are discrete
- Naive Bayes
- Support Vector Machines
- Decision Trees and Random Forests
feature 1, feature 2, etc. $\to$ normalized counts of important words or phrases ("Viagra", "Nigerian prince", etc.)
label $\to$ "spam" or "not spam"
- Regression: lables are continuous
feature 1, feature 2, etc. $\to$ brightness of each galaxy at one of several wavelengths or colors.
label $\to$ distance or redshift of the galaxy
Unsupervised¶
model features of a dataset without reference to any label
"let dataset speak for itself"
- Clustering (ID groups)
- k-Means, kNN
- Dimensionality reduction (more succinct representation of data)
- ..
Semi-supervised¶
often useful when only incomplete labels are available
Scikit-Learn¶
import seaborn as sns
iris = sns.load_dataset('iris')
iris.head()
Features matrix, X
, with shape [n_samples, n_features]
iris.shape
Label or target array, y
. Usually 1-D, with length n_samples
, is what you want to predict from the data. ie. Dependent Variable.
sns.pairplot(iris, hue='species', height=2.0);
X_iris = iris.drop('species', axis=1)
print(X_iris.shape)
X_iris.head()
y_iris = iris['species']
y_iris.head()
Scikit-Learn's Estimator API¶
Consistency: All objects share a common interface drawn from a limited set of methods, with consistent documentation.
Inspection: All specified parameter values are exposed as public attributes.
Limited object hierarchy: Only algorithms are represented by Python classes; datasets are represented in standard formats (NumPy arrays, Pandas
DataFrame
s, SciPy sparse matrices) and parameter names use standard Python strings.Composition: Many machine learning tasks can be expressed as sequences of more fundamental algorithms, and Scikit-Learn makes use of this wherever possible.
Sensible defaults: When models require user-specified parameters, the library defines an appropriate default value.
How-To¶
- Choose a class of model by importing the appropriate estimator class from Scikit-Learn.
- Choose model hyperparameters by instantiating this class with desired values.
- Arrange data into a features matrix and target vector following the discussion above.
- Fit the model to your data by calling the
fit()
method of the model instance. - Apply the Model to new data:
- For supervised learning, often we predict labels for unknown data using the
predict()
method. - For unsupervised learning, we often transform or infer properties of the data using the
transform()
(like for PCA) orpredict()
(like for GaussianMixture) method.
- For supervised learning, often we predict labels for unknown data using the
Example¶
import matplotlib.pyplot as plt
import numpy as np
rng = np.random.RandomState(42)
x = 10 * rng.rand(50)
y = 2 * x - 1 + rng.randn(50)
plt.scatter(x, y);
# 1. Choose a class of model
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
# 2. Choose model hyperparameters
model = LinearRegression(fit_intercept=True)
model
# CHOICE of model is done. Only now..
# 3. Arrange data into a features matrix and target vector
Previously we detailed the Scikit-Learn data representation, which requires a two-dimensional features matrix and a one-dimensional target array.
Here our target variable y
is already in the correct form (a length-n_samples
array), but we
need to massage the data x
to make it a matrix of size [n_samples, n_features]
.¶
In this case, this amounts to a simple reshaping of the one-dimensional array:
x
x.shape
x[:, np.newaxis]
X = x[:, np.newaxis]
X.shape
# 4. Fit model to your data
model.fit(X, y)
In Scikit-Learn, by convention all model parameters that were learned during the fit()
process have trailing underscores; for example in this linear model, we have the following:
model.coef_
model.intercept_
# 5. Predict labels for unknown data.
For the sake of this example, our "new data" will be a grid of x values, and we will ask what y values the model predicts:
xfit = np.linspace(-1, 11)
Xfit = xfit[:, np.newaxis]
yfit = model.predict(Xfit)
plt.scatter(x, y)
plt.plot(xfit, yfit);
r2_score(y, yfit)
Supervised learning: Iris classification¶
Q: Given a model trained on a portion of the Iris data, how well can we predict the remaining labels?
Use Gaussian Naive Bayes to start, because it is easy and fast.
We would like to evaluate the model on data it has not seen before, and so we will split the data into a training set and a testing set.
This could be done by hand, but it is more convenient to use the train_test_split
utility function:
from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X_iris, y_iris,
random_state=1)
from sklearn.naive_bayes import GaussianNB # 1. choose model class
model = GaussianNB() # 2. instantiate model
model.fit(Xtrain, ytrain) # 3. fit model to data
y_model = model.predict(Xtest) # 4. predict on new data
from sklearn.metrics import accuracy_score
accuracy_score(ytest, y_model)
Unsupervised learning: Iris dimensionality¶
Reduce dimensionality to more easily visualize it.
X_iris.head()
from sklearn.decomposition import PCA
model = PCA(n_components=2)
model.fit(X_iris)
X_2D = model.transform(X_iris)
iris['PCA1'] = X_2D[:, 0] #first dimension (column) of reduced dataset
iris['PCA2'] = X_2D[:, 1] #second dimension (column) of reduced dataset
sns.lmplot('PCA1', 'PCA2', data=iris, hue='species', fit_reg=False);
# from mlxtend.plotting import plot_decision_regions
# plot_decision_regions(X, y, clf=lr)
# plt.title('Softmax Regression in scikit-learn')
# plt.show()
We see that in the two-dimensional representation, the species are fairly well separated, even though the PCA algorithm had no knowledge of the species labels! This indicates to us that a relatively straightforward classification will probably be effective on the dataset, as we saw before.
from sklearn.mixture import GaussianMixture
model = GaussianMixture(n_components=3)
model.fit(X_iris)
yclassified = model.predict(X_iris)
iris['classified'] = yclassified
sns.lmplot('PCA1', 'PCA2', data=iris, hue='species', col='classified', fit_reg=False);
Hand-written digits¶
Q: ID hand-written digits using OCR
from sklearn.datasets import load_digits
digits = load_digits()
digits.images.shape
fig, axes = plt.subplots(10, 10, figsize=(8, 8),
subplot_kw={'xticks':[], 'yticks':[]})
for i, ax in enumerate(axes.flat):
ax.imshow(digits.images[i], cmap='binary')
ax.text(0.05, 0.05, str(digits.target[i]),
color='green', transform=ax.transAxes)
For Sciket-Learn, we need [n_samples, n_features]
. So an image is a sample, then each pixel in the image is a feature, so 64 features:
X = digits.data
X.shape
X[0]
plt.imshow(digits.images[0], cmap='binary');
y = digits.target
y.shape
y[0]
Visualize a 64-dimensional (because 64 features) space? Reduce dimensions to 2, using unsupervised method.. manifold learning algorith Isomap, to transform into 2-D:
from sklearn.manifold import Isomap
iso = Isomap()
iso.fit(X)
y_Iso = iso.transform(X)
y_Iso.shape
fig, ax = plt.subplots(figsize=(12,8))
plt.scatter(y_Iso[:, 0], y_Iso[:, 1],
c=y,
cmap=plt.cm.get_cmap('jet', 10), edgecolors='None', alpha=0.5)
plt.colorbar(label='digit', ticks=range(10));
from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y)
from sklearn.naive_bayes import GaussianNB # 1. choose model class
model = GaussianNB() # 2. instantiate model
classifier = model.fit(Xtrain, ytrain) # 3. fit model to data
y_model = model.predict(Xtest) # 4. predict on new data
from sklearn.metrics import accuracy_score
accuracy_score(ytest, y_model)
Where is model failing? Use sklearn.metrics.confusion_matrix()
¶
from sklearn.metrics import plot_confusion_matrix
fig, ax = plt.subplots(figsize=(7,7))
plot_confusion_matrix(classifier, Xtest, ytest, normalize=None,
cmap=plt.cm.Blues, ax = ax)
ax.grid(False)
Hyperparameters and Model Validation¶
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=1)
from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=0,
train_size=0.5) # cross validation is basically
# windowing which data is used for train_size
model.fit(Xtrain, ytrain)
y2_model = model.predict(Xtest)
accuracy_score(ytest, y2_model)
y1 = model.fit(Xtrain, ytrain).predict(Xtest)
y2 = model.fit(Xtest, ytest).predict(Xtrain)
accuracy_score(ytest, y1), accuracy_score(ytrain, y2)
Cross Validation¶
that was 2-fold cross-validation. Use cross_val_score()
for ease:
from sklearn.model_selection import cross_val_score
cross_val_score(model, X, y, cv=5)
cross_val_score(model, X, y, cv=5).mean()
from sklearn.model_selection import LeaveOneOut
pass_fail = cross_val_score(model, X, y, cv=LeaveOneOut()) #use all data except 1, then return 0 or 1 for fail or pass
pass_fail
pass_fail.mean()
Model Selection and Bias/Variance¶
Q: If model is underperforming, what to do? Options:
- more complicated model
- less complicated model
- more training samples
- more features per sample
L: Needs more complexity. Data is intrinsically more complicated. Not enough model flexibility to account for all features.
- $R^2$ relatively unchanged, High-Bias: performance of test set is similar to train set.
R: Plenty of flexibility for fine features, and accurately describes training data, but ignores underlying process that may have generated the data. So much flexibility in the model that it accounts for trend in data and random errors.
- $R^2$ massively different, High-Variance: performance of test set is far worse than train set.
- The training score is everywhere higher than the validation score. This is generally the case: the model will be a better fit to data it has seen than to data it has not seen.
- For very low model complexity (a high-bias model), the training data is under-fit, which means that the model is a poor predictor both for the training data and for any previously unseen data.
- For very high model complexity (a high-variance model), the training data is over-fit, which means that the model predicts the training data very well, but fails for any previously unseen data.
- For some intermediate value, the validation curve has a maximum. This level of complexity indicates a suitable trade-off between bias and variance.