- Once downloaded, extract the dataset into the data folder within the project's directory (`project/data/Fruit-262`).
**Step 2: Running the Preprocessing Script**
Your data is downloaded! :sparkles:
Run the `fruit_dataset_splitter.py` script found [here](data_preprocessing/fruit_dataset_splitter.py). This will filter the dataset to the 30 selected fruit classes and divide the data into training, validation, and test sets. First create a virtual environment and install the required packages:
Now you can proceed to preparing the data for training and evaluation.
```bash
cd data_preprocessing
python3.11 -m venv venv
source venv/bin/activate
pip install-r requirements.txt
```
And then run the script:
```bash
python fruit_dataset_splitter.py
```
Your data is ready! :sparkles:
To find out more about the data, running the [`fruit_dataset_analyze`](data_preprocessing/fruit_dataset_analyze.py) script generates a histogram and counts the datapoints:
```bash
python fruit_dataset_analyze.py
```
\ No newline at end of file
Next Step: [Data Preprocessing](data_preprocessing/README.md)
First create a virtual environment and install the required packages:
```bash
python3.11 -m venv venv
source venv/bin/activate
pip install-r requirements.txt
```
### And then run the script:
## `fruit_dataset_splitter.py`
The script `fruit_dataset_splitter.py` is designed to split an image dataset into training, development, and test subsets. It filters the dataset to include only specified fruit classes, and then randomly divides the images into the three subsets.
...
...
@@ -17,9 +28,10 @@ Each run of the script will:
- Copy images into these directories, maintaining the class-wise structure.
- Ensure reproducibility by using a fixed random state in splitting.
This script is particularly useful for preparing datasets for machine learning tasks where a specific subset of classes is needed.
---
---
### Find out more about the data:
## `fruit_dataset_analyze.py`
...
...
@@ -41,5 +53,4 @@ Each run of the script will:
The script will also print the DataFrame, the total number of images in the dataset, and the file path of the saved histogram.