%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
We can use the RAMP utilities to load the public train and test sets.
import problem
get_train_data
loads the training data and returns an pandas
object (input) and a np.array
object (output).
X_train, y_train = problem.get_train_data()
X_train.head()
y_train[:5]
Thus, for each entry in in X_train
, we get the label of a cover tree in y_train
. We can quickly get some statisics regarding the data we are dealing with.
X_train.describe()
We can get additional information using the info()
function.
X_train.info()
For later analysis, we will work on a subset of data to speed-up the analysis.
from sklearn.model_selection import train_test_split
X_subset, _, y_subset, _ = train_test_split(
X_train, y_train, train_size=1/500,
stratify=y_train, random_state=42
)
We can check to the class distribution to check if we are dealing with an imbalanced datasets. This is always useful when we will assess the performance of our predictive model.
_ = pd.Series(y_subset).value_counts().sort_index().plot(kind="bar")
We see that the first and second classes are dominant which is something which we need to take into account.
We can look a the feature interaction between pair of feature to see if we can get intuition to derive some rules that a machine-learning predictive model could come with.
df = X_subset.copy()
df['Cover_Type'] = y_subset
features = [
'Elevation', 'Aspect', 'Slope',
'Horizontal_Distance_To_Hydrology',
'Hillshade_Noon', 'Soil_Type32',
'Cover_Type',
]
_ = sns.pairplot(
df[features], hue="Cover_Type"
)
If we focus on the interation between Horizontal_Distance_To_Hydrology
and Elevation
. Looking at this interaction, we can see that the cover type is linearly dependant with the elevation.
_ = df.plot(
kind='scatter', x='Horizontal_Distance_To_Hydrology',
y='Elevation', c='Cover_Type', s=25, cmap=plt.cm.Paired,
figsize=(12, 8)
)
X_train, y_train = X_subset, y_subset
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(
X_train, y_train, test_size=0.2, random_state=0
)
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score
clf = make_pipeline(
StandardScaler(), LogisticRegression()
)
scores = cross_val_score(
clf, X_train, y_train, cv=3, scoring='accuracy')
print("Accuracy: {:.4f} +/- {:.4f}".format(
np.mean(scores), np.std(scores))
)
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(
n_estimators=10, max_depth=10, max_features=10,
n_jobs=-1, random_state=42
)
scores = cross_val_score(
clf, X_train, y_train, cv=3, scoring='accuracy')
print("Accuracy: {:.4f} +/- {:.4f}".format(
np.mean(scores), np.std(scores))
)
We can now train a model and check which feature are chosen and the most discriminative for our problem.
from sklearn.inspection import permutation_importance
clf.fit(X_train, y_train)
feature_importances = permutation_importance(
clf, X_val, y_val, n_repeats=3
)
sorted_idx = feature_importances.importances_mean.argsort()
fig, ax = plt.subplots(figsize=(10, 10))
ax.boxplot(feature_importances.importances[sorted_idx].T,
vert=False, labels=X_val.columns[sorted_idx])
ax.set_title("Permutation Importances (train set)")
fig.tight_layout()
plt.show()
As we observed, the random-forest predictive model picked-up the Elevation
feature as the most important model.
To submit your code, you can refer to the online documentation.