diff --git a/project/data/README.md b/project/data/README.md index 436db2dd2e3d8e16d05d2b3ff106373ff584d6e0..0bb33b184c0e5104985e34b435ddaac5e5645eb2 100644 --- a/project/data/README.md +++ b/project/data/README.md @@ -21,27 +21,8 @@ Dataset Acquisition: Extraction: - Once downloaded, extract the dataset into the data folder within the project's directory (`project/data/Fruit-262`). -**Step 2: Running the Preprocessing Script** +Your data is downloaded! :sparkles: -Run the `fruit_dataset_splitter.py` script found [here](data_preprocessing/fruit_dataset_splitter.py). This will filter the dataset to the 30 selected fruit classes and divide the data into training, validation, and test sets. First create a virtual environment and install the required packages: +Now you can proceed to preparing the data for training and evaluation. -```bash -cd data_preprocessing -python3.11 -m venv venv -source venv/bin/activate -pip install -r requirements.txt -``` - -And then run the script: - -```bash -python fruit_dataset_splitter.py -``` - -Your data is ready! :sparkles: - -To find out more about the data, running the [`fruit_dataset_analyze`](data_preprocessing/fruit_dataset_analyze.py) script generates a histogram and counts the datapoints: - -```bash -python fruit_dataset_analyze.py -``` \ No newline at end of file +Next Step: [Data Preprocessing](data_preprocessing/README.md) \ No newline at end of file diff --git a/project/data/data_preprocessing/README.md b/project/data/data_preprocessing/README.md index 023bbeb55d598aadf80fbde95c499d8a7bd04769..931095254c0a32bab946b78a797b0c57def0ed94 100644 --- a/project/data/data_preprocessing/README.md +++ b/project/data/data_preprocessing/README.md @@ -1,5 +1,16 @@ # Data Preprocessing +First create a virtual environment and install the required packages: + +```bash +python3.11 -m venv venv +source venv/bin/activate +pip install -r requirements.txt +``` + +### And then run the script: + + ## `fruit_dataset_splitter.py` The script `fruit_dataset_splitter.py` is designed to split an image dataset into training, development, and test subsets. It filters the dataset to include only specified fruit classes, and then randomly divides the images into the three subsets. @@ -17,9 +28,10 @@ Each run of the script will: - Copy images into these directories, maintaining the class-wise structure. - Ensure reproducibility by using a fixed random state in splitting. -This script is particularly useful for preparing datasets for machine learning tasks where a specific subset of classes is needed. +--- + ---- +### Find out more about the data: ## `fruit_dataset_analyze.py` @@ -41,5 +53,4 @@ Each run of the script will: The script will also print the DataFrame, the total number of images in the dataset, and the file path of the saved histogram. ----