Evaluation of generative AI using human judgment and automatic metrics
## Description
## Project Description
3 Models were used to generate texts: GPT2, OPT and GPT4o. Text were generated in 3 categories: Poems, science-related topics and sport summaries. Similar prompts were used on all the models.
In a survey these texts were compared to human written texts. The poems were gathered from PoetryFoundation dataset, science-related texts were gathered from wikipedia and sport summaries are taken from BBC sports.
Participants were asked to identify the text generated by the LLM and rate both the human and LLM generated texts on 4 parameters: Coherence, Conciseness, Creativity and Clarity of Concept. Creativity was only asked for the poems and Clarity of Concept only for the science-related texts.
For the automatic metrics FRE, PMI, TF-IDF and TTR were used.
## Folder Descriptions
The "Data" folder contains .txt files with the Outputs created by the models used for this project togheter with the prompts. Aswell as the extracted human texts. Inside the .txt files some lines are marked with an "X". This is a marker for the text which was used in the survey.
The "Results" folder contains the unproccessed survey data as a .csv file and the processed data in form of .png files.
The "src" folder contains all the code used for this project.