Skip to content
Snippets Groups Projects
Commit 0810b7fe authored by chrysanthopoulou's avatar chrysanthopoulou
Browse files

Add some metadata calculations

parent f9fe301a
Branches master
No related tags found
No related merge requests found
# ignore venv
stylo_venv/
fanfic_venv/
mic_venv/
\ No newline at end of file
mic_venv/
working_venv/
venv/
\ No newline at end of file
Source diff could not be displayed: it is stored in LFS. Options to address this: view the blob.
#!/bin/bash
#
#SBATCH --job-name=metadata
#SBATCH --output=metadata.out
#SBATCH --time=20:00
#SBATCH --mem=64000mb
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --partition=single
#SBATCH --mail-user=chrysanthopoulou@cl.uni-heidelberg.de
#SBATCH --mail-type=ALL
python3 code/acquire_metadata.py
\ No newline at end of file
from general_funcs_for_stylo import stylo_funcs
import os
universes = ['call_me_by_your_name', 'cosmere', 'divergent', 'grishaverse', 'maze_runner', 'murderbot', 'percy', 'red_white_royal_blue', 'school_for_good_and_evil', 'simonverse', 'song_of_achilles', 'throne_of_glass',]
#universes = ['call_me_by_your_name']
canon_tokens = {}
for universe in universes:
texts = os.listdir(f"universes/{universe}/data/canon_works")
num_tokens_texts = 0
for text in texts:
with open(f"universes/{universe}/data/canon_works/{text}", "r", encoding='utf-8') as f:
text = f.read()
clean_tokens = stylo_funcs.tokenize_and_clean_text(text)
num_tokens_texts += len(clean_tokens)
canon_tokens[universe] = num_tokens_texts
with open("canon_tokens.txt", "w", encoding='utf-8') as f:
f.write(str(canon_tokens))
\ No newline at end of file
File added
File added
File added
File added
[nltk_data] Downloading package punkt to
[nltk_data] /home/hd/hd_hd/hd_vm255/nltk_data...
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] /home/hd/hd_hd/hd_vm255/nltk_data...
[nltk_data] Package averaged_perceptron_tagger is already up-to-
[nltk_data] date!
============================= JOB FEEDBACK =============================
NodeName=uc2n405
Job ID: 25088837
Cluster: uc2
User/Group: hd_vm255/hd_hd
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 2
CPU Utilized: 00:01:41
CPU Efficiency: 34.35% of 00:04:54 core-walltime
Job Wall-clock time: 00:02:27
Memory Utilized: 153.11 MB
Memory Efficiency: 0.24% of 62.50 GB
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment