TP: Overfit

f99a3189 · PLN (Algolia) · dee15b26 · f99a3189
Commit f99a3189 authored Jan 26, 2023 by PLN (Algolia)
Hide whitespace changes
Inline Side-by-side

Showing with 134 additions and 0 deletions

overfitting.py tp/02-train/overfitting.py +134 -0

No files found.
--- a/tp/02-train/overfitting.py
+++ b/tp/02-train/overfitting.py
+"""
+Adapted from scikit Learn Examples: https://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html
+"""
+import numpy as np
+import matplotlib.pyplot as plt
+import sklearn
+from sklearn.pipeline import Pipeline
+from sklearn.preprocessing import PolynomialFeatures
+from sklearn.linear_model import LinearRegression
+from sklearn.model_selection import cross_val_score, validation_curve
+def true_fun(X):
+    """This is the actual underlying distribution."""
+    return np.cos(1.5 * np.pi * X)
+np.random.seed(0)
+n_samples = 30
+# FIXME Lvl 2 jouez avec le degré de votre modèle polynomial :
+#  - 2.0 Quel est le degré qui matche au mieux la réalité décrite par true_fun ?
+#  - 2.1 Que pensez vous de la métrique "test_error" (à quel point la courbe de votre modèle colle aux Samples)
+#     pour évaluer quel degré a la meilleure prédiction ?
+#  - 2.2 Comment évalue la "vraie erreur" (la Mean Square Error calculée par cross-validation) en comparaison ?
+#  - 2.3 Quel est le problème si on va "plus loin"? Quel est le danger si on ne "regarde que l'erreur" ?
+#  - 2.4 BONUS montez le degré à un niveau absurde et commentez les résultats. Que vaut le modèle d'après vous ?
+degrees = [
+    1,
+    2,
+    # 3,
+    # 4,
+    # 15,
+]
+# We start with some datapoints: a variable X plotted against a variable Y, following a law of true_fun(), with noise
+X = np.sort(np.random.rand(n_samples))
+"""X = variables en X"""
+print(X)
+noise_factor = 0.1
+# FIXME Lvl 3 jouez avec le noise_factor:
+#  - 3.0 quelle est la valeur max que vous pouvez atteindre en gardant une erreur faible (disons MSE < 0.5) au degré 1?
+#  - 3.1 et au degré 2?
+#  - 3.2 Qu'est-ce que vous en concluez sur la résistance au bruit de votre modèle polynomial
+#  selon sa complexité (c'est à dire en fonction de son degré) ? Que remarquez vous quand le degré monte ou baisse?
+Y = true_fun(X) + np.random.randn(n_samples) * noise_factor
+print(Y)
+plt.figure(figsize=(18, 8))
+for i, degree in enumerate(degrees):
+    ax = plt.subplot(1, len(degrees), i + 1)
+    plt.setp(ax, xticks=(), yticks=())
+    # Let's create a polynomial model to match the datapoints
+    polynomial_features = PolynomialFeatures(degree=degree, include_bias=False)
+    linear_regression = LinearRegression()
+    # TODO Learn more about these Features:
+    #   - API Docs https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html
+    #   - In the user guide https://scikit-learn.org/stable/modules/preprocessing.html#polynomial-features
+    pipeline = Pipeline(
+        [
+            ("polynomial_features", polynomial_features),
+            ("linear_regression", linear_regression),
+        ]
+    )
+    # TODO Learn more about doing a Linear Regression from scratch yourself:
+    #  https://machinelearningmastery.com/polynomial-features-transforms-for-machine-learning/
+    # TODO FOR FUN: Wanna understand deeper the problem of 'too many variables' that we're seeing here?
+    #   Have a read of Overfitting and underfitting the Titanic:
+    #   https://www.kaggle.com/code/carlmcbrideellis/overfitting-and-underfitting-the-titanic/notebook
+    #   Where you'll find a lovely Fermi quote: "...with four parameters I can fit an elephant, and with five I can make him wiggle his trunk."
+    #   Which shows the risks of too complex models: the more complex, the more it could bullshit you by learning by heart the data!
+    pipeline.fit(X[:, np.newaxis], Y)
+    test_score = pipeline.score(X[:, np.newaxis], Y)
+    test_error = 1 - test_score
+    print(f"Polynomial Model of degree {degree} predicted with error of {test_error})")
+    # Evaluate the models using crossvalidation
+    scores = cross_val_score(
+        pipeline, X[:, np.newaxis], Y, scoring="neg_mean_squared_error", cv=10
+    )
+    # TODO Learn more about what CrossValidation is and how to use it yourself to evaluate your models
+    # https://scikit-learn.org/stable/modules/cross_validation.html
+    # Now let's use MatPlotLib to graph our samples, the true function, and how close our model is to that function
+    # TODO Read about this awesome, little library: https://matplotlib.org/stable/tutorials/introductory/pyplot.html
+    X_test = np.linspace(0, 1, 100)
+    plt.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")
+    plt.plot(X_test, true_fun(X_test), label="True function")
+    plt.scatter(X, Y, edgecolor="b", s=20, label="Samples")
+    plt.xlabel("x")
+    plt.ylabel("y")
+    plt.xlim((0, 1))
+    plt.ylim((-2, 2))
+    plt.legend(loc="best")
+    plt.title(
+        "Degree {}\nTraining Error={:.2}\nCrossVal Error = {:.2}(+/- {:.1})".format(
+            degree, test_error, -scores.mean(), scores.std()
+        )
+    )
+plt.show()
+print("Done")
+# FIXME Lvl 4 BONUS: Calculez l'erreur réelle de votre modèle au degré 1 et au meilleur degré trouvé plus haut,
+#  en mesurant l'erreur entre son output et la true_fun (qui décrit le monde réel que votre modèle veut comprendre).
+#   Quelques idées de directions :
+#   - Vous pourriez générer un jeu de données de validation ("validation set") à partir de true_fun et mesurer l'erreur du modèle dessus
+#   - Vous pourriez évaluer la valeur de true_fun sur un intervalle [par exemple np.arange(0, 1, 0.01)],
+#       et mesurer la Mean Square Error entre `prediction = pipeline.predict(X_val[:, np.newaxis])` et `ground_truth = true_fun(x)`
+# FIXME NIVEAU AVANCÉ:
+#   LVL 5.1 Suivez ce guide pour explorer l'underfitting et overfitting dans un modèle plus complexe, un neural network:
+#   https://thedatafrog.com/en/articles/overfitting-illustrated/
+#   5 : En quelques mots, qu'en concluez vous ? Quel point commun entre la complexité de notre modèle polynomial et la complexité d'un réseau de neurones ?
+#   5.1 : Avez vous remarqué une différence entre la "facilité" avec laquelle ces deux modèles overfittent?
+#   5.2 BONUS : Avez vous remarqué / saisi un lien entre "tendance à overfit" et "quantité de data"? Est-ce que changer l'une influence l'autre ?
+# FIXME NIVEAU EXPERT:
+#   LVL 6 Suivez cette doc de Google Cloud pour éviter l'overfitting : https://cloud.google.com/bigquery-ml/docs/preventing-overfitting?
+#   6.1 : Cherchez sur internet plus d'info sur l'Early Stopping. Quel intérêt ? Que propose SKLearn pour le faire, par exemple dans le modèle MLPRegressor ?
+#   6.2 : Cherchez sur le net plus d'info sur la Regularization.
+#           Pourquoi c'est important en pratique quand on développe un modèle ? Que propose SKlearn à ce propos ?
+#   6.3 BONUS : Suivez ce tutoriel : https://medium.com/coinmonks/regularization-of-linear-models-with-sklearn-f88633a93a2
+#      Quelle différence comprenez-vous entre la régularization L1 et L2?