From 97bf4eeab5e974af363bc622e9dbe8e62f3b086e Mon Sep 17 00:00:00 2001 From: F1nnH <finn@hillengass.de> Date: Fri, 23 Feb 2024 16:33:06 +0100 Subject: [PATCH] Update data preprocessing instructions --- project/data/README.md | 25 +++-------------------- project/data/data_preprocessing/README.md | 17 ++++++++++++--- 2 files changed, 17 insertions(+), 25 deletions(-) diff --git a/project/data/README.md b/project/data/README.md index 436db2d..0bb33b1 100644 --- a/project/data/README.md +++ b/project/data/README.md @@ -21,27 +21,8 @@ Dataset Acquisition: Extraction: - Once downloaded, extract the dataset into the data folder within the project's directory (`project/data/Fruit-262`). -**Step 2: Running the Preprocessing Script** +Your data is downloaded! :sparkles: -Run the `fruit_dataset_splitter.py` script found [here](data_preprocessing/fruit_dataset_splitter.py). This will filter the dataset to the 30 selected fruit classes and divide the data into training, validation, and test sets. First create a virtual environment and install the required packages: +Now you can proceed to preparing the data for training and evaluation. -```bash -cd data_preprocessing -python3.11 -m venv venv -source venv/bin/activate -pip install -r requirements.txt -``` - -And then run the script: - -```bash -python fruit_dataset_splitter.py -``` - -Your data is ready! :sparkles: - -To find out more about the data, running the [`fruit_dataset_analyze`](data_preprocessing/fruit_dataset_analyze.py) script generates a histogram and counts the datapoints: - -```bash -python fruit_dataset_analyze.py -``` \ No newline at end of file +Next Step: [Data Preprocessing](data_preprocessing/README.md) \ No newline at end of file diff --git a/project/data/data_preprocessing/README.md b/project/data/data_preprocessing/README.md index 023bbeb..9310952 100644 --- a/project/data/data_preprocessing/README.md +++ b/project/data/data_preprocessing/README.md @@ -1,5 +1,16 @@ # Data Preprocessing +First create a virtual environment and install the required packages: + +```bash +python3.11 -m venv venv +source venv/bin/activate +pip install -r requirements.txt +``` + +### And then run the script: + + ## `fruit_dataset_splitter.py` The script `fruit_dataset_splitter.py` is designed to split an image dataset into training, development, and test subsets. It filters the dataset to include only specified fruit classes, and then randomly divides the images into the three subsets. @@ -17,9 +28,10 @@ Each run of the script will: - Copy images into these directories, maintaining the class-wise structure. - Ensure reproducibility by using a fixed random state in splitting. -This script is particularly useful for preparing datasets for machine learning tasks where a specific subset of classes is needed. +--- + ---- +### Find out more about the data: ## `fruit_dataset_analyze.py` @@ -41,5 +53,4 @@ Each run of the script will: The script will also print the DataFrame, the total number of images in the dataset, and the file path of the saved histogram. ---- -- GitLab