Skip to content
Snippets Groups Projects
Commit 97bf4eea authored by F1nnH's avatar F1nnH
Browse files

Update data preprocessing instructions

parent 93605eea
No related branches found
No related tags found
No related merge requests found
......@@ -21,27 +21,8 @@ Dataset Acquisition:
Extraction:
- Once downloaded, extract the dataset into the data folder within the project's directory (`project/data/Fruit-262`).
**Step 2: Running the Preprocessing Script**
Your data is downloaded! :sparkles:
Run the `fruit_dataset_splitter.py` script found [here](data_preprocessing/fruit_dataset_splitter.py). This will filter the dataset to the 30 selected fruit classes and divide the data into training, validation, and test sets. First create a virtual environment and install the required packages:
Now you can proceed to preparing the data for training and evaluation.
```bash
cd data_preprocessing
python3.11 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
```
And then run the script:
```bash
python fruit_dataset_splitter.py
```
Your data is ready! :sparkles:
To find out more about the data, running the [`fruit_dataset_analyze`](data_preprocessing/fruit_dataset_analyze.py) script generates a histogram and counts the datapoints:
```bash
python fruit_dataset_analyze.py
```
\ No newline at end of file
Next Step: [Data Preprocessing](data_preprocessing/README.md)
\ No newline at end of file
# Data Preprocessing
First create a virtual environment and install the required packages:
```bash
python3.11 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
```
### And then run the script:
## `fruit_dataset_splitter.py`
The script `fruit_dataset_splitter.py` is designed to split an image dataset into training, development, and test subsets. It filters the dataset to include only specified fruit classes, and then randomly divides the images into the three subsets.
......@@ -17,9 +28,10 @@ Each run of the script will:
- Copy images into these directories, maintaining the class-wise structure.
- Ensure reproducibility by using a fixed random state in splitting.
This script is particularly useful for preparing datasets for machine learning tasks where a specific subset of classes is needed.
---
---
### Find out more about the data:
## `fruit_dataset_analyze.py`
......@@ -41,5 +53,4 @@ Each run of the script will:
The script will also print the DataFrame, the total number of images in the dataset, and the file path of the saved histogram.
---
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment