Update data preprocessing instructions

97bf4eea · F1nnH · 93605eea · 97bf4eea · 97bf4eea
Commit 97bf4eea authored 1 year ago by F1nnH
--- a/project/data/README.md
+++ b/project/data/README.md
@@ -21,27 +21,8 @@ Dataset Acquisition:
 Extraction: 
 - Once downloaded, extract the dataset into the data folder within the project's directory (`project/data/Fruit-262`).

-**Step 2: Running the Preprocessing Script**
+Your data is downloaded! :sparkles:

-Run the `fruit_dataset_splitter.py` script found [here](data_preprocessing/fruit_dataset_splitter.py). This will filter the dataset to the 30 selected fruit classes and divide the data into training, validation, and test sets. First create a virtual environment and install the required packages:
+Now you can proceed to preparing the data for training and evaluation. 

-```bash
-cd data_preprocessing
-python3.11 -m venv venv
-source venv/bin/activate
-pip install -r requirements.txt
-```
-
-And then run the script:
-
-```bash
-python fruit_dataset_splitter.py
-```
-
-Your data is ready! :sparkles:
-
-To find out more about the data, running the [`fruit_dataset_analyze`](data_preprocessing/fruit_dataset_analyze.py) script generates a histogram and counts the datapoints:
-
-```bash
-python fruit_dataset_analyze.py
-```
\ No newline at end of file
+Next Step: [Data Preprocessing](data_preprocessing/README.md)
\ No newline at end of file
--- a/project/data/data_preprocessing/README.md
+++ b/project/data/data_preprocessing/README.md
 # Data Preprocessing

+First create a virtual environment and install the required packages:
+
+```bash
+python3.11 -m venv venv
+source venv/bin/activate
+pip install -r requirements.txt
+```
+
+### And then run the script:
+
+
 ## `fruit_dataset_splitter.py`

 The script `fruit_dataset_splitter.py` is designed to split an image dataset into training, development, and test subsets. It filters the dataset to include only specified fruit classes, and then randomly divides the images into the three subsets.
@@ -17,9 +28,10 @@ Each run of the script will:
 - Copy images into these directories, maintaining the class-wise structure.
 - Ensure reproducibility by using a fixed random state in splitting.

-This script is particularly useful for preparing datasets for machine learning tasks where a specific subset of classes is needed.
+---
+

--- 
+### Find out more about the data:

 ## `fruit_dataset_analyze.py`

@@ -41,5 +53,4 @@ Each run of the script will:

 The script will also print the DataFrame, the total number of images in the dataset, and the file path of the saved histogram.

---