diff --git a/project/data/data_preprocessing/README.md b/project/data/data_preprocessing/README.md new file mode 100644 index 0000000000000000000000000000000000000000..023bbeb55d598aadf80fbde95c499d8a7bd04769 --- /dev/null +++ b/project/data/data_preprocessing/README.md @@ -0,0 +1,45 @@ +# Data Preprocessing + +## `fruit_dataset_splitter.py` + +The script `fruit_dataset_splitter.py` is designed to split an image dataset into training, development, and test subsets. It filters the dataset to include only specified fruit classes, and then randomly divides the images into the three subsets. + +### 💻 Usage + +```bash +python fruit_dataset_splitter.py +``` + +### 📊 Outputs + +Each run of the script will: +- Create directories for training, development, and test subsets in the specified output directory. +- Copy images into these directories, maintaining the class-wise structure. +- Ensure reproducibility by using a fixed random state in splitting. + +This script is particularly useful for preparing datasets for machine learning tasks where a specific subset of classes is needed. + +--- + +## `fruit_dataset_analyze.py` + +The script `fruit_dataset_analyze.py` analyzes a dataset of images, providing insights into the class distribution across training, development, and test subsets. It counts the number of images per class and visualizes this distribution in a histogram. + +### 💻 Usage + +```bash +python fruit_dataset_analyze.py +``` + +### 📊 Outputs + +Each run of the script will: +- Generate a DataFrame containing the counts of images per class across the entire dataset. +- Save this DataFrame as a CSV file to `../class_counts.csv`. +- Create a histogram showing the distribution of images across different classes in the dataset. +- Save the histogram plot to `../../figures/class_distribution_histogram-2.png`. + +The script will also print the DataFrame, the total number of images in the dataset, and the file path of the saved histogram. + +--- +