Skip to content
Snippets Groups Projects
Commit b456e676 authored by F1nnH's avatar F1nnH
Browse files

Add README for preprocessing scripts

parent df7dfea8
No related branches found
No related tags found
No related merge requests found
# Data Preprocessing
## `fruit_dataset_splitter.py`
The script `fruit_dataset_splitter.py` is designed to split an image dataset into training, development, and test subsets. It filters the dataset to include only specified fruit classes, and then randomly divides the images into the three subsets.
### 💻 Usage
```bash
python fruit_dataset_splitter.py
```
### 📊 Outputs
Each run of the script will:
- Create directories for training, development, and test subsets in the specified output directory.
- Copy images into these directories, maintaining the class-wise structure.
- Ensure reproducibility by using a fixed random state in splitting.
This script is particularly useful for preparing datasets for machine learning tasks where a specific subset of classes is needed.
---
## `fruit_dataset_analyze.py`
The script `fruit_dataset_analyze.py` analyzes a dataset of images, providing insights into the class distribution across training, development, and test subsets. It counts the number of images per class and visualizes this distribution in a histogram.
### 💻 Usage
```bash
python fruit_dataset_analyze.py
```
### 📊 Outputs
Each run of the script will:
- Generate a DataFrame containing the counts of images per class across the entire dataset.
- Save this DataFrame as a CSV file to `../class_counts.csv`.
- Create a histogram showing the distribution of images across different classes in the dataset.
- Save the histogram plot to `../../figures/class_distribution_histogram-2.png`.
The script will also print the DataFrame, the total number of images in the dataset, and the file path of the saved histogram.
---
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment