This project explores various named entity recognition (NER) approaches, focusing on
named entity classification (NEC).
These methods include:
- NLI
-MLM (entity/class masking)
-Word2Vec
- Natural language inference (NLI) with T5
-Two different approaches for masked language modeling (MLM) with entity and class masking
-Classification based on Word2Vec embeddings
- LLM Prompting
More about these in the report (link report here)
## Setup
1. Run `pip install requirements.txt`
2. If you want to use DeepSeek-R1 (via HuggingFace), follow the instructions in [`.env-example`](.env-example).
3. Execute whatever you want to execute - the models and datasets will be downloaded automatically.
More information about these methods can be found in our project report which is included in this repository.
## Project Structure
Testcases for models, datasets and individual approaches for debugging can be found in [`/tests`](tests).
Testcases for implemented models, datasets and individual approaches for debugging can be found in [`/tests`](tests).
Models are defined in [`/models`](src/models) and are accessed via the [`common_interface`](src/common_interface.py).
Datasets can be found in [`/data`](data) and are accessed via the [`data manager`](data/data_manager.py).
Scripts for executing code on the Cluster are in [`/scripts`](scripts).
Scripts for executing code on the Computerlinguistik cluster or BwUniCluster are located in [`/scripts`](scripts).
The experiments conducted as part of this project and some of their results are located in [`/src/experiments`](src/experiments).
## Setup and Requirements
Note: A CUDA-enabled GPU is required to run the finetuning and LLM-based experiments
1. Ensure you have installed Python 3.8 or newer
2. Run `pip install requirements.txt` to install necessary dependencies
3. To run an experiment locally check section "Running locally".
4. To run an experiment on the Computerlinguistik cluster or BwUniCluster cluster check section "Running on a cluster". The required models and datasets will be downloaded and preprocessed automatically.
4. If you want to use DeepSeek-R1 (via HuggingFace), follow the instructions in [`.env-example`](.env-example).
## Running locally
For correct module loading, all experiments must be run from the projects root folder.
### Example: Running an experiment
The following command executes the "NEC_evaluation" experiment which computes NEC prediction accuracies for all implemented model and dataset combinations. The required models and datasets will be downloaded and preprocessed automatically.
The following command executes the "test_NEC" test which tests all implemented models with an example sentence. The required models will be downloaded automatically.
`python3 -m src.tests.test_NEC`
### Example: Using a finetuned model
1. First start the finetuning for the desired model using one of the experiment under [`/src/experiments/finetune_T5`](src/experiments/finetune_T5/)
2. Model checkpoints will be saved in a subfolder under [`/models`](src/models) depending on the model
3. Modify the last line of the model implementation [`/models`](src/models) to load the desired checkpoint instead of the base model (see comment in source)
4. Running the desired experiment will now use the finetuned model
## Running on a cluster
Slurm scripts are provided for most of the experiments and tests in the [`/scripts`](scripts) folder.
The scripts ending in "_cl" are intended for execution on the Computerlinguistik cluster and the scripts ending in "_bwuni" are intended for execution on the BwUniCluster.
The scripts must be executed from the projects root folder.
### Example command
`sbatch scripts/NEC_eval_cl.sh`
The experiments conducted during this project and some of their results are in [`/src/experiments`](src/experiments).
3. Run the desired experiment either locally or on either the Computerlinguistik Cluster or BwUniCluster cluster