From e115e56242c26fb53dca6b3ee6214f946fd9a5c7 Mon Sep 17 00:00:00 2001
From: igraf <igraf@cl.uni-heidelberg.de>
Date: Fri, 23 Feb 2024 19:09:08 +0100
Subject: [PATCH] Update baseline part

---
 project/README.md | 54 +++++++++++++++++++++++++++++++++--------------
 1 file changed, 38 insertions(+), 16 deletions(-)

diff --git a/project/README.md b/project/README.md
index ffeab3a..14f9945 100644
--- a/project/README.md
+++ b/project/README.md
@@ -52,7 +52,7 @@ In our project, we have focused on **30** specific **classes** out of the 262 av
 The original dataset lacks a predefined split into training, development (validation), and testing sets. To tailor our dataset for effective model training and evaluation, we implemented a custom script that methodically divides the dataset into specific proportions.
 
 <figure>
-<img  align="left" src="figures/dataset_split.png" alt= "Dataset Split" width="45%" height="auto">
+<img  align="left" src="figures/dataset_split.png" alt= "Dataset Split" width="40%" height="auto">
 </figure>
 
 
@@ -76,13 +76,13 @@ The data partitioning script randomly segregates the images for each fruit class
 
 â†ªï¸ **To prepare the dataset for using it with this project please refer to the Data Preparation section in the [data folder](data/README.md).**
 
----
+
 
 ### Insights
 
 The dataset is much cleaner than other fruit datasets we found on Kaggle, which often contain images of cooked food or fruit juices. 
 
-<img align="right" width="60%" height="auto" src="figures/examples_from_dataset/bananas-different-stages.png" title="test">
+<img align="right" width="35%" height="auto" src="figures/examples_from_dataset/bananas-different-stages.png" title="test">
 
 The images in the dataset depict fruits in
 - varying stages of their life cycle
@@ -92,7 +92,6 @@ The images in the dataset depict fruits in
 
 This diversity in the dataset is beneficial for our project, as it allows our models to learn from a wide range of fruit images, making them more robust and adaptable to different real-world scenarios.
 
----
 
 ### Data Distribution Across Classes
 
@@ -127,15 +126,17 @@ We are setting our focus on the **accuracy** metric. The accuracy is a suitable
 
 ## Baseline
 ### Overview
-We have implemented two types of baseline models: Random and Majority. These are implemented both as custom models and using scikit-learn's `DummyClassifier`. Our dataset involves classifying one out of 30 classes, with a balanced dataset of about 26,500 data points.
+We have used two types of baseline models: **random** and **majority**. These are implemented both as custom models and utilizing scikit-learn's `DummyClassifier` for validation. 
+
+Our tasks involves predicting one out of 30 classes, with a mostly balanced dataset of about 29,500 data points. The random baseline therefore predicts a random class, while the majority baseline always predicts the most frequent class, which is apple in our case.
 
-It's noteworthy that the performance metrics for our baseline models are **consistent** across both the training and test sets. This uniformity suggests that our data splits are well-balanced and representative, reducing the likelihood of biased or skewed results due to data split anomalies.
+It's noteworthy that the performance metrics for our baseline models are **consistent** across both the training, development and test sets. This uniformity suggests well-balanced and representative data splits, reducing the likelihood of biased or skewed results due to data split irregularities.
 
-If you want to reproduce the results, please run the script [`baselines.py`](baselines.py).
+â†ªï¸ If you want to reproduce the results, please run the script [`classify_with_baseline.py`](src/classify_with_baseline.py).
 
 ### Results Table
 
-The following table summarizes the performance of different baseline models on the **test** set:
+The following table summarizes the performance of the different baseline models on the **test** set:
 
 | Baseline Model                 | Accuracy     | Macro Precision | Macro Recall | Macro F1-Score | Micro Precision | Micro Recall   | Micro F1-Score |
 |--------------------------------|--------------|-----------------|----------------|-----------------|--------------|----------------|----------------|
@@ -144,22 +145,43 @@ The following table summarizes the performance of different baseline models on t
 | Majority Baseline (Custom)     | 0.041        | 0.001           | 0.033          | 0.003           | 0.041        | 0.041          | 0.041          |
 | Majority Baseline (Sklearn)    | 0.041        | 0.001           | 0.033          | 0.003           | 0.041        | 0.041          | 0.041          |
 
+And here are the results for **all dataset splits**:
 ![Baseline Results](figures/baselines.png)
 
-### Random Baseline (Custom & Scikit-learn):
 
-- Macro Average Precision, Recall, F1-Score around 0.033-0.034: These scores are consistent with what you'd expect from a random classifier in a balanced multi-class setting. With 30 classes, a random guess would be correct about 1/30 times, or approximately 0.033. The consistency in results between our custom implementation and scikit-learn's version reinforces the correctness of our implementation.
-- Micro Average Precision, Recall, F1-Score around 0.034-0.035: Micro averages aggregate the contributions of all classes to compute the average metric. In a balanced dataset, micro and macro averages tend to be similar, as seen here.
+### Random Baseline
+
+<details> <summary> Macro Average Precision, Recall, F1-Score around 0.033-0.034 </summary>
+
+- These scores are consistent with what you'd expect from a random classifier in a balanced multi-class setting. With 30 classes, a random guess would be correct about 1/30 times, or approximately 0.033. The consistency in results between our custom implementation and scikit-learn's version reinforces the correctness of our implementation.
+</details>
+
+<details> <summary> Micro Average Precision, Recall, F1-Score around 0.034-0.035 </summary>
+
+- Micro averages aggregate the contributions of all classes to compute the average metric. In a balanced dataset, micro and macro averages tend to be similar, as seen here.
+
+</details>
+
+
+### Majority Baseline
+
+<details> <summary> Macro Average Precision, Recall, F1-Score around 0.001, 0.033, 0.003 </summary>
+
+-  The precision here is particularly low because the majority classifier always predicts the same class (= apple). In a balanced dataset with 30 classes, this means it will be correct only 1/30 times, but precision penalizes it for the other 29/30 times it is incorrect. Recall remains constant as it's just the hit rate of the single majority class.
+</details>
 
-### Majority Baseline (Custom & Scikit-learn)
-- Macro Average Precision, Recall, F1-Score around 0.001, 0.033, 0.003: The precision here is particularly low because the majority classifier always predicts the same class. In a balanced dataset with 30 classes, this means it will be correct only 1/30 times, but precision penalizes it for the other 29/30 times it is incorrect. Recall remains constant as it's just the hit rate of the single majority class.
-- Micro Average Precision, Recall, F1-Score around 0.041: The micro average is slightly higher because it accounts for the overall success rate across all classes. Since one class is always predicted, its success rate dominates this calculation.
+<details> <summary> Micro Average Precision, Recall, F1-Score around 0.041 </summary>
+
+- The micro average is slightly higher because it accounts for the overall success rate across all classes. Since one class is always predicted and accounts for more instances than the other classes, its success rate dominates this calculation.
+
+
+</details>
 
 ### Additional Interpretations
 
-- Performance Lower Than Random Baseline: If a machine learning model performs worse than the random baseline, it suggests that the model is not learning effectively from the data. It could be due to several factors like poor feature selection, overfitting, or an issue with the training process.
+- Performance lower than random baseline: If a machine learning model performs worse than the random baseline, it suggests that the model is not learning effectively from the data. It could be due to several factors like poor feature selection, overfitting, or an issue with the training process.
 
-- Performance Lower Than Majority Baseline: This scenario is more alarming because the majority baseline is a very naive model. If a model performs worse than this, it might indicate that the model is doing worse than a naive guess of the most frequent class. This could be due to incorrect model architecture, data preprocessing errors, or other significant issues in the training pipeline.
+- Performance lower than majority baseline: This scenario is even more alarming because the majority baseline is a very naive model. If a model performs worse than this, it might indicate that the model is doing worse than a naive guess of the most frequent class. This could be due to incorrect model architecture, data preprocessing errors, or other significant issues in the training pipeline.
 
 
 ## Classifiers
-- 
GitLab