diff --git a/data/README.md b/data/README.md index c2d57ba7bc65b37dbde505e8afae2944953744e9..172a3eede485c359875ce286a69baea461aa897e 100644 --- a/data/README.md +++ b/data/README.md @@ -1,13 +1,10 @@ # Data Folder README ## Overview -This folder contains the original and supplementary data used in our project. It includes a diverse set of resources ranging from dialogue datasets to geographic information, and custom-generated content. The data is organized into distinct sub-folders, each with its own specific README for detailed descriptions. +This folder contains the original and supplementary data used in our project. It includes the MultoWOZ dialogue dataset, the content of our self crafted database and all the cutom-generated dialogues and annotations. The data is organized into distinct sub-folders, each with its own specific README for detailed descriptions. ## Contents **multiwoz/** - Contains the MultiWOZ dialogue dataset, a rich source of conversational exchanges across multiple domains. -**osm/** - Houses data extracted from OpenStreetMap. This data provides geographic and locational context relevant to our project. (Refer to the separate README in the osm/ folder for more details). +**osm/** - Data extracted from OpenStreetMap. This data provides the content of our own Database. -**own_data/** - Includes dialogues and domain knowledge created by our team, specifically tailored for our project's needs. (Refer to the separate README in the own_data/ folder for more details). - -## MultiWOZ Dialogue Dataset -The MultiWOZ dataset is a large-scale, multi-domain Wizard-of-Oz style dataset. It's widely used for training and evaluating dialogue systems. This dataset includes conversations spanning various topics and scenarios. \ No newline at end of file +**own_data/** - Includes domain knowledge, dialogues and annotations created by us. These are the results of our one-shot and multiagent approaches. diff --git a/data/osm/README.md b/data/osm/README.md index 6c349017649ab9846452829da4d4220cd0f5e472..b631aced2da9ee07aa4cf0dc4b3a0ed354af2ca9 100644 --- a/data/osm/README.md +++ b/data/osm/README.md @@ -14,7 +14,7 @@ This folder contains OpenStreetMap (OSM) data specifically curated for Heidelber - attractions/ - Unannotated data for attractions. ## Usage -The data in these folders populate a database integral to our dialogue system. This structured information allows for the generation of contextually rich and accurate dialogues, leveraging real-world data about Heidelberg. +The data in these folders populate a database integral to our dialogue system. This structured information is used to generate and receive facts for dialogue creation, based on real-world data about Heidelberg. ## Data Format Each JSON file in this folder corresponds to different aspects (restaurants, hotels, attractions) of Heidelberg. The files are structured to include various data points, such as names, locations, and other unique characteristics relevant to the category. diff --git a/data/own_data/README.md b/data/own_data/README.md index 5b3891913508ab2d3595b7497fd697eedf245d1a..ac19cd0c861ec35c464301b447ccbd31441e1312 100644 --- a/data/own_data/README.md +++ b/data/own_data/README.md @@ -1,18 +1,7 @@ # Own Data -## Overview -This folder, own_data, is a central repository for our project-specific data. It includes two primary types of content: custom-generated dialogues and domain knowledge, both derived and reformatted from the MultiWOZ dataset. This data is essential for our dialogue system, providing unique content and context that enhance its functionality. -## Contents -generated_dialogues/ - Contains dialogues generated by our team. These dialogues are crafted to simulate realistic and diverse conversational scenarios. -domain_knowledge/ - Includes structured domain knowledge extracted and reformatted from the MultiWOZ dataset. This knowledge base covers various subjects and is pivotal for our dialogue system's contextual understanding. -Generated Dialogues -The generated_dialogues folder contains dialogues created by our team. These dialogues are designed to reflect a wide range of conversational contexts and styles, aiding the system in handling diverse interactions. - -## Usage -These dialogues are utilized for training and fine-tuning our dialogue system. They provide practical examples of conversational flows, helping to improve the system's ability to engage in natural and contextually relevant discussions. +This folder contains all data created by us. These are the results of our one-shot and multiagent approaches. -## Domain Knowledge -The domain_knowledge folder houses a curated set of information, reformatted and extracted from the MultiWOZ dataset. This knowledge base encompasses various domains, offering a rich source of facts and details. -This domain knowledge is crucial for the dialogue system to provide accurate and relevant information in conversations. It enhances the system's capability to understand and respond to queries based on real-world data and scenarios. - -## Data Format +## Contents +- one-shot/ - Contains the domain knowledge, dialogues and annotations created by the one-shot approach. +- multiagent/ - (TODO) \ No newline at end of file diff --git a/data/own_data/one-shot/README.md b/data/own_data/one-shot/README.md new file mode 100644 index 0000000000000000000000000000000000000000..f0d838e9dc3484fab9756085101ef16cf84850dd --- /dev/null +++ b/data/own_data/one-shot/README.md @@ -0,0 +1,12 @@ +# one-shot data + +This folder contains the domain knowledge, dialogues and annotations created by the one-shot approach. + +## Contents +- domain_knowledge/ - Contains the domain knowledge extracted from the MultiWOZ dataset and being used to give context to the dialogues. +- dialogues/ - Contains the dialogues created by the one-shot approach. +- annotations/ - Contains the annotations created for the dialogues. +- dst_manual_annotations/ - Contains manual annotations for some dialogues to check the correctness of the annotation generation. + +## Data Format +The domain knowledge, the dialogues and the annotations are all stored in a Dataframe format and saved as CSV files. The Dataframes are structured to include data points, such as dialogue IDs, the original dialogues from the multiwoz dataset, the extracted domain knowledge, the generated dialogues and the generated annotations. This format allows for easy access and manipulation of the data and keeps the data organized and consistent. \ No newline at end of file diff --git a/src/README.md b/src/README.md new file mode 100644 index 0000000000000000000000000000000000000000..78545b8067946a77781a56b8f66cc18bfbee40ff --- /dev/null +++ b/src/README.md @@ -0,0 +1,6 @@ +# Scripts + +This directory is organized into two main approaches for handling dialogues and annotations, each located in its respective subfolder: one-shot and multiagent. + +## ✔ Setup and Usage +Each script in the respective folders comes with its own detailed usage instructions and requirements. Please refer to the individual documentation within both folders for specific setup and execution details. \ No newline at end of file diff --git a/src/one-shot/README.md b/src/one-shot/README.md new file mode 100644 index 0000000000000000000000000000000000000000..32134082b385555613dd0dd650cde2a47aede1a6 --- /dev/null +++ b/src/one-shot/README.md @@ -0,0 +1,110 @@ +# Scripts one-shot approach + +[[_TOC_]] + +## 📠Overview + +This directory contains the scripts used for the one-shot approach of the project: + +- `extract_domain_knowledge.py` - Extracts domain knowledge from the MultiWOZ data and saves it to a Dataframe. +- `generate_dialogues.py` - Generates dialogues from the domain knowledge and saves them to a Dataframe. +- `generate_annotations.py` - Generates annotations for the dialogues and saves them to a Dataframe. + - `schema.py` - Contains the schema for the annotations. + +## ✔ Setup + +Tested with Python 3.9 (other versions might work as well) +- create a virtual environment and install the packages from the requirements file. + + ```bash python -m venv .venv``` + + ```bash source venv/bin/activate``` + + ```bash pip install -r requirements.txt``` + +- The generation scripts (dialogues and annotations) require a running API. We are using vllm with outlines. The API can be started with the following command: + + ```python -m outlines.serve.serve --model <model_name>``` + + For more detailed information on the API, please refer to the [outlines documentation](https://outlines-dev.github.io/outlines/reference/vllm/) + + +## `extract_domain_knowledge.py` + +The script `extract_domain_knowledge.py` is designed to process dialogues from the MultiWOZ dataset. It reads JSON files containing dialogue data, extracts and formats dialogue information including service details, and writes the processed information to a new JSON file. + +### 💻 Usage + +```bash +python extract_domain_knowledge.py [--quantity <number or 'FULL'>] +``` + +| Parameter | Description | +|----------------------|-------------------------------------------------------------------| +| `--quantity` `-q` | Number of dialogues to process. Use 'FULL' to process all dialogues, or an integer for a specific number. Max: 8192. | + +### 📊 Outputs + +- **Processed Dialogue Information**: The script processes dialogues and generates detailed information for each, including dialogue ID, complete dialogue text, and service details. +- **JSON Output File**: The output is a JSON file named `domain_knowledge_<quantity>.json`, where `<quantity>` is either the specified number of dialogues or 'ALL' for all dialogues. +- **File Location**: The JSON file is saved to `../../data/own_data/one-shot/domain_knowledge`. +- **Data Content**: Each entry in the output file includes dialogue ID, the full dialogue, and structured information about the services involved, such as actions and preferences related to each service. + +## `generate_dialogues.py` + +This script, `generate_dialogues.py`, is utilized for generating dialogues based on domain knowledge. It inputs the number of dialogues to be generated and verifies the existence of the corresponding domain knowledge file. If the file is not present, the script generates it using a separate domain knowledge extraction script. Finally, it processes and saves the dialogues in a structured CSV format. + +### 💻 Usage + +```bash +python generate_dialogues.py --host <host> --port <port> --num_dialogues <number> [--stream] +``` + +| Parameter | Description | +|-------------------|-------------| +| `--host` | Hostname of the server where the dialogue generation API is running. Default is `localhost`. | +| `--port` | Port number for the dialogue generation API. Default is `8000`. | +| `--num_dialogues` | Number of dialogues to generate. Domain knowledge must be available for this number. | +| `--stream` | Optional. Use this flag to enable streaming of responses from the API. | + +### 📊 Outputs + +- **CSV File with Dialogues**: The script saves the generated dialogues in a CSV file, structured with dialogue IDs, original dialogues, prompts, domain knowledge, and both separated and joined utterances. +- **File Location**: The output CSV file is named `output_dialogues_<num_dialogues>.csv` and is stored in the directory `../../data/own_data/one-shot/dialogues`. +- **Dialogue Structure**: Each dialogue in the CSV file includes detailed information, such as the dialogue ID, the complete original dialogue, the prompt used for generation, structured domain knowledge, and the processed dialogue in both separated and joined formats. + +## `generate_annotations.py` with `schema.py` + +The script `generate_annotations.py` is designed for annotating dialogues using an API and is dependent on a schema defined in `schema.py`. This schema outlines the structure of annotations based on the MultiWOZ dataset. + +### 💻 Usage + +```bash +python generate_annotations.py --host <host> --port <port> --input_name <input_csv> --output_name <output_csv> +``` + +| Parameter | Description | +|-----------------|-------------| +| `--host` | Hostname of the server where the annotation API is running. Default is `localhost`. | +| `--port` | Port number for the annotation API. Default is `8000`. | +| `--input_name` | Name of the input CSV file containing the dialogues to be annotated. | +| `--output_name` | Name of the output CSV file where the annotated dialogues will be saved. | + +### 📚 Schema (`schema.py`) + +The `schema.py` file defines a `MultiWOZ` schema class for annotations, based on the MultiWOZ dataset. The class includes optional fields for various entities such as hotels, restaurants, taxis, trains, and attractions. Each field can take specific values or types: + +- Hotel-related fields (e.g., `hotelArea`, `hotelName`, `hotelStars`) +- Restaurant-related fields (e.g., `restaurantArea`, `restaurantName`, `restaurantFood`) +- Taxi-related fields (e.g., `taxiDeparture`, `taxiDestination`, `taxiArriveby`) +- Train-related fields (e.g., `trainDay`, `trainDeparture`, `trainDestination`) +- Attraction-related fields (e.g., `attractionArea`, `attractionName`, `attractionType`) + +Each field in the schema is optional and includes specific types or predefined literal values to ensure data consistency. + +### 📊 Outputs + +- **Annotated Dialogues CSV**: The script generates annotations for each dialogue in the input CSV and saves the results in the output CSV file. +- **Output File Location**: The annotated dialogues are saved to `../../data/own_data/one-shot/annotations/<output_name>.csv`. +- **Annotation Details**: Each dialogue in the output file is annotated with original and generated annotations, saved in separate columns. +- **Progress Tracking**: The script provides progress updates and saves checkpoints every 50 iterations to ensure data is not lost in case of interruption. \ No newline at end of file