From 334a442e35b679edea0f7fac639ea2f5eaf56530 Mon Sep 17 00:00:00 2001
From: igraf <igraf@cl.uni-heidelberg.de>
Date: Fri, 23 Feb 2024 21:05:51 +0100
Subject: [PATCH] Update decision tree & random forest

---
 project/README.md | 82 +++++++++++++++++++++++++++++++++++------------
 1 file changed, 62 insertions(+), 20 deletions(-)

diff --git a/project/README.md b/project/README.md
index de146ae..de541e6 100644
--- a/project/README.md
+++ b/project/README.md
@@ -323,9 +323,9 @@ We have conducted a series of experiments to evaluate the performance of differe
 
 To find the best hyperparameters for the Naive Bayes classifier, we have used the following parameter grid for the grid search, which tests 20 different values for the `var_smoothing` parameter:
 
-```json
+```
 { 
-    'var_smoothing': np.logspace(0,-20, num=20)
+    "var_smoothing": np.logspace(0,-20, num=20)
 }
 ```
 The best accuracy we achieved on the development set was **0.178** with the **HSV + Sobel filters** on 50x50 images. The best parameter for the `var_smoothing` was `4.2 * 10^-8`.
@@ -349,6 +349,7 @@ The following table shows an excerpt of the feature and size combination used.
 
 **Further findings:**
 - accuracy on the training set is also never higher than 0.20 :arrow_right: the classifier is not overfitting but also not learning anything
+- accuracy decreases with increasing `var_smoothing` after a threshold of about `4.2 * 10^-8` :arrow_right: the classifier is then smoothing too much, which leads to underfitting
 - for some classes, the diagonal in the confusion matrix below is quite bright (e.g. apricots and passion fruits) :arrow_right: the classifier is quite good at predicting these classes
 - but we also see that the classifier has a **strong bias** towards some classes (e.g. apricots, jostaberries and passion fruits and figs)
 
@@ -358,37 +359,79 @@ The following table shows an excerpt of the feature and size combination used.
 
 ### Decision Tree
 
-![Decision Tree Best Parameters](figures/decision_tree/grid_search_results_50x50_hsv_sobel_decision_tree_best_params.png)
+To find the best hyperparameters for the decision tree classifier, we have used the following parameter grid for the grid search:
+
+```
+{
+    "max_depth": list(range(10,81,10)) + [None], 
+    "max_features": ["sqrt", "log2"], 
+    "min_samples_leaf": [1,2,5,10,20], 
+    "min_samples_split": [2,5,10,20],
+    "criterion": ["gini", "entropy"]
+}
+```
+
+
+The best accuracy on the develepment set is **0.23** with the **HSV + Sobel filters** on 50x50 images. The best parameters for the decision tree classifier, which can also be seen in the plot below, are:
+
+```
+{
+    "max_depth": 60,
+    "max_features": "sqrt",
+    "min_samples_leaf": 1,
+    "min_samples_split": 2,
+    "criterion": "gini"
+}
+```
+
+<img  align="center" src="figures/decision_tree/grid_search_results_50x50_hsv_sobel_decision_tree_best_params.png" alt= "[Decision Tree Best Parameters" width="80%" height="auto">
+</figure>
+
 
 ### Random Forest
-**Feature Combinations:**
 
+To find the best hyperparameters for the random forest classifier, we have used the following parameter grid for the grid search:
 
-**50x50_** images
+```
+{
+    "max_depth": list(range(10,81,10)) + [None],
+    "n_estimators": [10,50,100],
+    "max_features": ["sqrt", "log2"],
+    "min_samples_leaf": [1,2,5,10,20],
+    "min_samples_split": [2,5,10,20]
+}
+```
 
-Ausgetestet: 
+The best accuracy on the development set is **0.469** with the **HSV filters** (and also with **HSV + Sobel filters**) on 50x50 images. The best parameters for the random forest classifier, which can also be seen in the plot below, are:
 
-```json
-param_grid = {
-                "max_depth": list(range(10,81,10)) + [None],
-                "n_estimators": [10,50,100],
-                "max_features": ["sqrt", "log2"],
-                "min_samples_leaf": [2,5,10,20],
-                "min_samples_split": [2,5,10,20]
-             }
 ```
+{
+    "max_depth": 40,
+    "max_features": "sqrt",
+    "min_samples_leaf": 2,
+    "min_samples_split": 2,
+    "n_estimators": 100
+}
+```
+<img  align="center" src="figures/random_forest/grid_search_results_50x50_hsv_random_forest_best_params.png" alt= "[Random Forest Best Parameter" width="80%" height="auto">
+</figure>
+
+
+
 **Optimization:**
 - *"no filters"* = RGB values as features
 
 | Resized | Features | Accuracy (Dev) | Best Parameters | Comments |
 | ------- | -------- | -------- | --------------- | ---- |
 | 50x50   | No filters (7500 Features) | 0.417 | `{'max_depth': 70, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 100}` |
-| 50x50   | HSV + Sobel (without normal pixel values) | 0.469 | `{'max_depth': 40, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 100}` | Time for optimization: 176 min
-| 50x50   | HSV only (without normal pixel values) | 0.469 |  `{'max_depth': 40, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 100}`| => no improvement compared to HSV + Sobel |
-| 50x50 |  Sobel only | 0.392 | `{'max_depth': 40, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 100}` | 
-| 50x50 | No filters + Sobel | 0.432 | `{'max_depth': 30, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 100}` | time for optimization: 653 min |
-| 125x125 | Canny only | 0.214 | `{'max_depth': 80, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 100}` | time for optimization: 94 min |
+| 50x50   | HSV only | 0.469 |  `{'max_depth': 40, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 100}`||
+| 50x50   | HSV + Sobel  | 0.469 | `{'max_depth': 40, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 100}` |  => no improvement compared to only HSV
+| 50x50 |  Sobel only | 0.392 | `{'max_depth': 40, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 100}` |  => a lot worse than HSV only
+| 50x50 | No filters + Sobel | 0.432 | `{'max_depth': 30, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 100}` | |
+| 125x125 | Canny only | 0.214 | `{'max_depth': 80, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 100}` | => poor results despite of many features |
+
 
+Because we had strong results with HSV + Sobel, we also used this feature combination for another round of optimization with a different picture size (namely 75x75). The best accuracy we achieved was **0.469**, thus no improvement compared to 50x50 images.
 
 ==> Best results for 50x50 images with HSV + Sobel filters & HSV only 
 ==> Therefore, we will use this feature combination for another round of optimization with a different picture size (namely 75x75)
@@ -439,7 +482,6 @@ Results for RandomForestClassifier classifier on 100x100_standard images:
         - :mag: the figure shows the accuracy when all parameters are fixed to their best value except for the one for which the accuracy is plotted (both for train and dev set)
 
 
-![Random Forest Best Parameters](figures/random_forest/grid_search_results_50x50_hsv_random_forest_best_params.png)
 
 
 Confusion Matrix -  No filters  - best parameters        |  Confusion Matrix -  HSV features - best parameters
-- 
GitLab