pax_global_header 0000666 0000000 0000000 00000000064 13354251234 0014514 g ustar 00root root 0000000 0000000 52 comment=3dad7e90d50354b96077d16b19c782cfa689a525
Allzweckmesser-master/ 0000775 0000000 0000000 00000000000 13354251234 0015344 5 ustar 00root root 0000000 0000000 Allzweckmesser-master/.gitignore 0000664 0000000 0000000 00000003253 13354251234 0017337 0 ustar 00root root 0000000 0000000 # Project-specific files
azm.db
morpheus
### Python ###
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
.hypothesis/
.pytest_cache/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
db.sqlite3
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
target/
# Jupyter Notebook
.ipynb_checkpoints
# pyenv
.python-version
# celery beat schedule file
celerybeat-schedule
# SageMath parsed files
*.sage.py
# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
### Python Patch ###
.venv/
### Python.VirtualEnv Stack ###
# Virtualenv
# http://iamzed.com/2009/05/07/a-primer-on-virtualenv/
pyvenv.cfg
pip-selfcheck.json
# End of https://www.gitignore.io/api/python
# Compiled source #
*.com
*.class
*.dll
*.exe
*.o
*.so
# Logs and databases #
*.log
*.sql
*.sqlite
# OS generated files #
.DS_Store
.DS_Store?
._*
.Spotlight-V100
.Trashes
ehthumbs.db
Thumbs.db
# Editor swap and backup files #
*.swp
*~
Allzweckmesser-master/LICENSE 0000664 0000000 0000000 00000043202 13354251234 0016352 0 ustar 00root root 0000000 0000000 GNU GENERAL PUBLIC LICENSE
Version 2, June 1991
Copyright (C) 1989, 1991 Free Software Foundation, Inc.,
51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
Everyone is permitted to copy and distribute verbatim copies
of this license document, but changing it is not allowed.
Preamble
The licenses for most software are designed to take away your
freedom to share and change it. By contrast, the GNU General Public
License is intended to guarantee your freedom to share and change free
software--to make sure the software is free for all its users. This
General Public License applies to most of the Free Software
Foundation's software and to any other program whose authors commit to
using it. (Some other Free Software Foundation software is covered by
the GNU Lesser General Public License instead.) You can apply it to
your programs, too.
When we speak of free software, we are referring to freedom, not
price. Our General Public Licenses are designed to make sure that you
have the freedom to distribute copies of free software (and charge for
this service if you wish), that you receive source code or can get it
if you want it, that you can change the software or use pieces of it
in new free programs; and that you know you can do these things.
To protect your rights, we need to make restrictions that forbid
anyone to deny you these rights or to ask you to surrender the rights.
These restrictions translate to certain responsibilities for you if you
distribute copies of the software, or if you modify it.
For example, if you distribute copies of such a program, whether
gratis or for a fee, you must give the recipients all the rights that
you have. You must make sure that they, too, receive or can get the
source code. And you must show them these terms so they know their
rights.
We protect your rights with two steps: (1) copyright the software, and
(2) offer you this license which gives you legal permission to copy,
distribute and/or modify the software.
Also, for each author's protection and ours, we want to make certain
that everyone understands that there is no warranty for this free
software. If the software is modified by someone else and passed on, we
want its recipients to know that what they have is not the original, so
that any problems introduced by others will not reflect on the original
authors' reputations.
Finally, any free program is threatened constantly by software
patents. We wish to avoid the danger that redistributors of a free
program will individually obtain patent licenses, in effect making the
program proprietary. To prevent this, we have made it clear that any
patent must be licensed for everyone's free use or not licensed at all.
The precise terms and conditions for copying, distribution and
modification follow.
GNU GENERAL PUBLIC LICENSE
TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
0. This License applies to any program or other work which contains
a notice placed by the copyright holder saying it may be distributed
under the terms of this General Public License. The "Program", below,
refers to any such program or work, and a "work based on the Program"
means either the Program or any derivative work under copyright law:
that is to say, a work containing the Program or a portion of it,
either verbatim or with modifications and/or translated into another
language. (Hereinafter, translation is included without limitation in
the term "modification".) Each licensee is addressed as "you".
Activities other than copying, distribution and modification are not
covered by this License; they are outside its scope. The act of
running the Program is not restricted, and the output from the Program
is covered only if its contents constitute a work based on the
Program (independent of having been made by running the Program).
Whether that is true depends on what the Program does.
1. You may copy and distribute verbatim copies of the Program's
source code as you receive it, in any medium, provided that you
conspicuously and appropriately publish on each copy an appropriate
copyright notice and disclaimer of warranty; keep intact all the
notices that refer to this License and to the absence of any warranty;
and give any other recipients of the Program a copy of this License
along with the Program.
You may charge a fee for the physical act of transferring a copy, and
you may at your option offer warranty protection in exchange for a fee.
2. You may modify your copy or copies of the Program or any portion
of it, thus forming a work based on the Program, and copy and
distribute such modifications or work under the terms of Section 1
above, provided that you also meet all of these conditions:
a) You must cause the modified files to carry prominent notices
stating that you changed the files and the date of any change.
b) You must cause any work that you distribute or publish, that in
whole or in part contains or is derived from the Program or any
part thereof, to be licensed as a whole at no charge to all third
parties under the terms of this License.
c) If the modified program normally reads commands interactively
when run, you must cause it, when started running for such
interactive use in the most ordinary way, to print or display an
announcement including an appropriate copyright notice and a
notice that there is no warranty (or else, saying that you provide
a warranty) and that users may redistribute the program under
these conditions, and telling the user how to view a copy of this
License. (Exception: if the Program itself is interactive but
does not normally print such an announcement, your work based on
the Program is not required to print an announcement.)
These requirements apply to the modified work as a whole. If
identifiable sections of that work are not derived from the Program,
and can be reasonably considered independent and separate works in
themselves, then this License, and its terms, do not apply to those
sections when you distribute them as separate works. But when you
distribute the same sections as part of a whole which is a work based
on the Program, the distribution of the whole must be on the terms of
this License, whose permissions for other licensees extend to the
entire whole, and thus to each and every part regardless of who wrote it.
Thus, it is not the intent of this section to claim rights or contest
your rights to work written entirely by you; rather, the intent is to
exercise the right to control the distribution of derivative or
collective works based on the Program.
In addition, mere aggregation of another work not based on the Program
with the Program (or with a work based on the Program) on a volume of
a storage or distribution medium does not bring the other work under
the scope of this License.
3. You may copy and distribute the Program (or a work based on it,
under Section 2) in object code or executable form under the terms of
Sections 1 and 2 above provided that you also do one of the following:
a) Accompany it with the complete corresponding machine-readable
source code, which must be distributed under the terms of Sections
1 and 2 above on a medium customarily used for software interchange; or,
b) Accompany it with a written offer, valid for at least three
years, to give any third party, for a charge no more than your
cost of physically performing source distribution, a complete
machine-readable copy of the corresponding source code, to be
distributed under the terms of Sections 1 and 2 above on a medium
customarily used for software interchange; or,
c) Accompany it with the information you received as to the offer
to distribute corresponding source code. (This alternative is
allowed only for noncommercial distribution and only if you
received the program in object code or executable form with such
an offer, in accord with Subsection b above.)
The source code for a work means the preferred form of the work for
making modifications to it. For an executable work, complete source
code means all the source code for all modules it contains, plus any
associated interface definition files, plus the scripts used to
control compilation and installation of the executable. However, as a
special exception, the source code distributed need not include
anything that is normally distributed (in either source or binary
form) with the major components (compiler, kernel, and so on) of the
operating system on which the executable runs, unless that component
itself accompanies the executable.
If distribution of executable or object code is made by offering
access to copy from a designated place, then offering equivalent
access to copy the source code from the same place counts as
distribution of the source code, even though third parties are not
compelled to copy the source along with the object code.
4. You may not copy, modify, sublicense, or distribute the Program
except as expressly provided under this License. Any attempt
otherwise to copy, modify, sublicense or distribute the Program is
void, and will automatically terminate your rights under this License.
However, parties who have received copies, or rights, from you under
this License will not have their licenses terminated so long as such
parties remain in full compliance.
5. You are not required to accept this License, since you have not
signed it. However, nothing else grants you permission to modify or
distribute the Program or its derivative works. These actions are
prohibited by law if you do not accept this License. Therefore, by
modifying or distributing the Program (or any work based on the
Program), you indicate your acceptance of this License to do so, and
all its terms and conditions for copying, distributing or modifying
the Program or works based on it.
6. Each time you redistribute the Program (or any work based on the
Program), the recipient automatically receives a license from the
original licensor to copy, distribute or modify the Program subject to
these terms and conditions. You may not impose any further
restrictions on the recipients' exercise of the rights granted herein.
You are not responsible for enforcing compliance by third parties to
this License.
7. If, as a consequence of a court judgment or allegation of patent
infringement or for any other reason (not limited to patent issues),
conditions are imposed on you (whether by court order, agreement or
otherwise) that contradict the conditions of this License, they do not
excuse you from the conditions of this License. If you cannot
distribute so as to satisfy simultaneously your obligations under this
License and any other pertinent obligations, then as a consequence you
may not distribute the Program at all. For example, if a patent
license would not permit royalty-free redistribution of the Program by
all those who receive copies directly or indirectly through you, then
the only way you could satisfy both it and this License would be to
refrain entirely from distribution of the Program.
If any portion of this section is held invalid or unenforceable under
any particular circumstance, the balance of the section is intended to
apply and the section as a whole is intended to apply in other
circumstances.
It is not the purpose of this section to induce you to infringe any
patents or other property right claims or to contest validity of any
such claims; this section has the sole purpose of protecting the
integrity of the free software distribution system, which is
implemented by public license practices. Many people have made
generous contributions to the wide range of software distributed
through that system in reliance on consistent application of that
system; it is up to the author/donor to decide if he or she is willing
to distribute software through any other system and a licensee cannot
impose that choice.
This section is intended to make thoroughly clear what is believed to
be a consequence of the rest of this License.
8. If the distribution and/or use of the Program is restricted in
certain countries either by patents or by copyrighted interfaces, the
original copyright holder who places the Program under this License
may add an explicit geographical distribution limitation excluding
those countries, so that distribution is permitted only in or among
countries not thus excluded. In such case, this License incorporates
the limitation as if written in the body of this License.
9. The Free Software Foundation may publish revised and/or new versions
of the General Public License from time to time. Such new versions will
be similar in spirit to the present version, but may differ in detail to
address new problems or concerns.
Each version is given a distinguishing version number. If the Program
specifies a version number of this License which applies to it and "any
later version", you have the option of following the terms and conditions
either of that version or of any later version published by the Free
Software Foundation. If the Program does not specify a version number of
this License, you may choose any version ever published by the Free Software
Foundation.
10. If you wish to incorporate parts of the Program into other free
programs whose distribution conditions are different, write to the author
to ask for permission. For software which is copyrighted by the Free
Software Foundation, write to the Free Software Foundation; we sometimes
make exceptions for this. Our decision will be guided by the two goals
of preserving the free status of all derivatives of our free software and
of promoting the sharing and reuse of software generally.
NO WARRANTY
11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN
OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS
TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE
PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
REPAIR OR CORRECTION.
12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
POSSIBILITY OF SUCH DAMAGES.
END OF TERMS AND CONDITIONS
How to Apply These Terms to Your New Programs
If you develop a new program, and you want it to be of the greatest
possible use to the public, the best way to achieve this is to make it
free software which everyone can redistribute and change under these terms.
To do so, attach the following notices to the program. It is safest
to attach them to the start of each source file to most effectively
convey the exclusion of warranty; and each file should have at least
the "copyright" line and a pointer to where the full notice is found.
Allzweckmesser
Copyright (C) 2018 Messerschleifer
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License along
with this program; if not, write to the Free Software Foundation, Inc.,
51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
Also add information on how to contact you by electronic and paper mail.
If the program is interactive, make it output a short notice like this
when it starts in an interactive mode:
Gnomovision version 69, Copyright (C) year name of author
Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
This is free software, and you are welcome to redistribute it
under certain conditions; type `show c' for details.
The hypothetical commands `show w' and `show c' should show the appropriate
parts of the General Public License. Of course, the commands you use may
be called something other than `show w' and `show c'; they could even be
mouse-clicks or menu items--whatever suits your program.
You should also get your employer (if you work as a programmer) or your
school, if any, to sign a "copyright disclaimer" for the program, if
necessary. Here is a sample; alter the names:
Yoyodyne, Inc., hereby disclaims all copyright interest in the program
`Gnomovision' (which makes passes at compilers) written by James Hacker.
{signature of Ty Coon}, 1 April 1989
Ty Coon, President of Vice
This General Public License does not permit incorporating your program into
proprietary programs. If your program is a subroutine library, you may
consider it more useful to permit linking proprietary applications with the
library. If this is what you want to do, use the GNU Lesser General
Public License instead of this License.
Allzweckmesser-master/README.md 0000664 0000000 0000000 00000002035 13354251234 0016623 0 ustar 00root root 0000000 0000000 # Allzweckmesser
A Tool for Measuring Latin Verse
# Installation
For development, install the package with the `--editable` option. If
you only want to use the package, you can omit this option.
```
pip install --editable /path/to/Allzweckmesser # The repo dir, not the package dir
```
## Test installation
If the package was installed correctly, you should be able to get
usage information when calling the package with the `-h` option:
```
python -m allzweckmesser -h
```
## Running Tests
When you’re in the repository root, you can run the unit tests like this:
```
python /path/to/setup.py test
```
A better way is to install `pytest` and `pytest-env` and then run the
test using the `py.test` executable directly:
```
pytest
```
The advantage of this is that the test requirements (`pytest` and
`pytest-env`) don’t have to be downloaded every time the tests are
executed.
# Notes
* Gladius by parkjisun from the Noun Project
## Interface
- text
- u/v richtig oder unbekannt
- Altlatein
- Menge an erlaubten Metren
- Autor
Allzweckmesser-master/allzweckmesser/ 0000775 0000000 0000000 00000000000 13354251234 0020377 5 ustar 00root root 0000000 0000000 Allzweckmesser-master/allzweckmesser/__init__.py 0000664 0000000 0000000 00000000144 13354251234 0022507 0 ustar 00root root 0000000 0000000 from . import (config, corpus, db, features, meters, model, scan, scanner,
wordlist)
Allzweckmesser-master/allzweckmesser/__main__.py 0000664 0000000 0000000 00000000037 13354251234 0022471 0 ustar 00root root 0000000 0000000 from . import scan
scan.main()
Allzweckmesser-master/allzweckmesser/config.py 0000664 0000000 0000000 00000001226 13354251234 0022217 0 ustar 00root root 0000000 0000000 import os
ROOT = os.path.dirname(os.path.dirname(__file__))
DATABASE = {
'dialect': 'sqlite',
'file': os.path.join(ROOT, 'azm.db')
}
MACRONS_FILE = os.environ.get('MACRONS_FILE',
os.path.join(ROOT, 'macrons.txt'))
MORPHEUS_DIR = os.environ.get('MORPHEUS_DIR',
os.path.join(ROOT, 'morpheus'))
MODE = os.environ.get('AZM_MODE', 'run')
POPULATE_DATABASE = os.environ.get('AZM_POPULATE_DATABASE', False)
RANKING_MODEL_PATH = os.environ.get('RANKING_MODEL_PATH',
os.path.join(ROOT,
'tree_classifier.joblib'))
Allzweckmesser-master/allzweckmesser/corpus.py 0000664 0000000 0000000 00000016466 13354251234 0022301 0 ustar 00root root 0000000 0000000 # -*- coding: utf-8 -*-
import logging
import re
import os.path
import sys
import traceback
from bs4 import BeautifulSoup
from unidecode import unidecode
from .model import Reading, Syllable, Token, Verse
BASE_HTML = """
"""
def get_reading_from_line_element(element):
tokens = []
span_begin = 0
idx = 0
for token_tag in element.find_all(name='span', class_='word'):
syllables = []
token_text = token_tag.text
token = Token(
token=unidecode(token_text),
span=[span_begin, span_begin + len(token_text)]
)
for syllable_tag in token_tag.find_all(name='span', class_='syll'):
syllable_text = syllable_tag.text
if 'long' in syllable_tag.attrs['class']:
syllable_length = 2
elif 'short' in syllable_tag.attrs['class']:
syllable_length = 1
elif 'elided' in syllable_tag.attrs['class']:
syllable_length = 0
else:
raise ValueError(
'Could not determine syllable length of syllable {!r}'
.format(syllable_tag)
)
syllable = Syllable(
idx=idx,
syllable=unidecode(syllable_text),
span=[span_begin, span_begin + len(syllable_text)],
syllable_length=syllable_length,
vowel_length=None
)
idx += 1
syllables.append(syllable)
span_begin += len(syllable_text)
# The + 1 is for simulating a space between tokens.
span_begin += 1
token.syllables = syllables
tokens.append(token)
return Reading(tokens=tokens)
def separate_punctuation(tokens):
i = 0
while i < len(tokens):
token = tokens[i]
m = re.match(r'^(?P[\W_]*)(?P\w*)'
'(?P[\W_]*)$',
token.text)
if m:
pre = m.group('pre_punct')
post = m.group('post_punct')
# Create tokens for the punctuation before a token.
span_begin = token.span[0]
for c in pre:
tokens.insert(i, Token(c, [span_begin, span_begin + 1]))
span_begin += 1
i += 1
# Create tokens for the punctuation after a token.
span_begin = token.span[1] - len(post)
for c in m.group('post_punct'):
tokens.insert(i + 1,
Token(c, [span_begin, span_begin + 1]))
span_begin += 1
i += 1
# Remove the punctuation from the original token and
# from its syllables.
token.text = m.group('non_punct')
span_begin = token.span[0] + m.start('non_punct')
span_end = token.span[1] - len(post)
token.span = [span_begin, span_end]
if pre:
token.syllables[0].text = token.syllables[0].text[len(pre):]
token.syllables[0].span[0] = span_begin
if post:
token.syllables[-1].text = (token.syllables[-1].
text[:-len(post)])
token.syllables[-1].span[1] = span_end
else:
logging.warn('{!r} does not match the punctuation regex.'
.format(token))
i += 1
return tokens
def reconstruct_verse_text_from_reading(reading):
try:
codepoints = [' ' for _ in range(reading.tokens[-1].span[1])]
for token in reading.tokens:
codepoints[token.span[0]:token.span[1]] = token.text
except Exception:
print('ERROR reconstructing verse from reading {!r}'
.format(reading), file=sys.stderr)
traceback.print_exc()
codepoints = []
return ''.join(codepoints)
class HypotacticLine:
def __init__(self, element):
self.element = element
reading = get_reading_from_line_element(element)
reading.tokens = separate_punctuation(reading.tokens)
text = reconstruct_verse_text_from_reading(reading)
self.verse = Verse(verse=text, readings=[reading])
class HypotacticDocument:
def __init__(self, file_path, parser='lxml'):
with open(file_path) as f:
try:
self.root = BeautifulSoup(f, parser)
self.title = self.root.title.text
except Exception as e:
print('Exception {!r} when parsing file {!r}'
.format(e, file_path))
self.title = None
def get_poems(self, filters=tuple()):
yield from (
poem
for poem in self.root.find_all(name='div', class_='poem')
if all(fil(poem) for fil in filters)
)
def get_lines(self, line_filters=tuple()):
yield from (
line
for line in self.root.find_all(name='div', class_='line')
if all(fil(line) for fil in line_filters)
)
def get_lines_with_meter(self, meters):
filters = [lambda tag: any((meter in tag.attrs['class'])
for meter in meters)]
if self.root.find(name='div', class_='poem'):
yield from (
line
for poem in self.get_poems(filters)
for line in poem.find_all(name='div', class_='line')
)
else:
self.get_lines(filters)
class HypotacticCorpus:
def __init__(self, file_paths, parser='lxml'):
self.file_paths = file_paths
self.parser = parser
self.documents = [HypotacticDocument(p, parser=parser)
for p in file_paths]
@classmethod
def from_directory(cls, directory, *args, **kwargs):
file_paths = [os.path.abspath(os.path.join(directory, basename))
for basename in os.listdir(directory)]
return cls(file_paths, *args, **kwargs)
def get_poems(self, filters=tuple()):
yield from (
poem
for doc in self.documents
for poem in doc.get_poems(filters)
)
def get_lines(self, line_filters=tuple()):
yield from (
line
for doc in self.documents
for line in doc.get_lines(line_filters)
)
def get_lines_with_meter(self, meters):
yield from (
line
for doc in self.documents
for line in doc.get_lines_with_meter(meters)
)
def save_html_tags(self, file_handle, tags, title='Saved Poems',
base_html=BASE_HTML, pretty=False):
soup = BeautifulSoup(base_html, self.parser)
title_tag = soup.new_tag('title')
title_tag.string = title
soup.find(name='head').append(title_tag)
latin = soup.new_tag('div')
latin.attrs['class'] = 'latin'
for tag in tags:
latin.append(tag)
soup.find(name='body').append(latin)
if pretty:
output = soup.prettify()
else:
output = str(soup)
file_handle.write(output)
Allzweckmesser-master/allzweckmesser/db.py 0000664 0000000 0000000 00000004724 13354251234 0021345 0 ustar 00root root 0000000 0000000 # -*- coding: utf-8 -*-
from .config import DATABASE as db_config
from .config import MODE
from sqlalchemy import create_engine, Column, Integer, String
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
def get_db_uri(db_config: dict):
"""Get a URI for connecting to a database.
This function supports SQLite, PostgreSQL, MySQL and probably some
other server-based RDBMS.
:param db_config: A dict containing what is necessary for creating
the URI.
"""
if db_config['dialect'] == 'sqlite':
uri_pattern = '{db[dialect]}:///{db[file]}'
else:
if 'port' in db_config:
uri_pattern = ('{db[dialect]}://{db[user]}:{db[password]}'
'@{db[host]}:{db[port]}/{db[database]}')
else:
uri_pattern = ('{db[dialect]}://{db[user]}:{db[password]}'
'@{db[host]}/{db[database]}')
uri = uri_pattern.format(db=db_config)
return uri
BASE = declarative_base()
class FormAnalysis(BASE):
"""A FormAnalysis holds represents an analysis produced by Morpheus."""
__tablename__ = 'FormAnalysis'
id = Column(Integer, primary_key=True, autoincrement=True)
form = Column(String(50), nullable=False, index=True)
# LDT tags have length 9, PROIEL tags have length 12
morphtag = Column(String(12), nullable=True)
lemma = Column(String(50), nullable=True)
accented = Column(String(60), nullable=True)
# Ideally, this table would have a joined UNIQUE constraint for
# form, morphtag, lemma and accented, but this would make
# insertions cumbersome. It’s easier to insert everything and to
# delete the duplicates afterwards. See the WordList object for
# how that is done.
def __repr__(self):
return ('').format(f=self)
def __str__(self):
return repr(self)
def __eq__(self, other):
return (isinstance(other, FormAnalysis)
and ((self.form, self.morphtag, self.lemma, self.accented)
== (other.form, other.morphtag,
other.lemma, other.accented)))
def __hash__(self):
return hash((self.form, self.morphtag, self.lemma, self.accented))
ENGINE = create_engine(get_db_uri(db_config))
SESSION_FACTORY = sessionmaker(bind=ENGINE)
if 'run' in MODE:
BASE.metadata.create_all(ENGINE)
Allzweckmesser-master/allzweckmesser/dev.py 0000664 0000000 0000000 00000006047 13354251234 0021536 0 ustar 00root root 0000000 0000000 #!/usr/bin/python3
# -*- coding: utf-8 -*-
import argparse
import json
import random
import sys
import traceback
from typing import List
from unidecode import unidecode
from .style import mark_correct
from .model import Verse
from .scanner import Scanner
def dev(reference_verses, number=10, randomize=False) -> List[Verse]:
"""Scan verses and compare them with their correct reference version."""
scanner = Scanner()
all_analyses = []
correct = 0
errors = 0
if randomize:
sample = random.sample(reference_verses, number)
else:
sample = reference_verses[:number]
for ref in sample:
ref_reading = ref.readings[0]
try:
analysis = scanner.scan_verses([unidecode(ref.text)])[0]
except Exception:
errors += 1
print('ERROR at verse {}'.format(ref.text),
file=sys.stderr)
traceback.print_exc()
continue
all_analyses.append(analysis)
correct_schema = ref_reading.get_schema()
analysis_correctnesses = [r.get_schema() == correct_schema
for r in analysis.readings]
this_correct = any(analysis_correctnesses)
if this_correct:
correct += 1
print('{ref} ({n} readings)'
.format(ref=mark_correct(ref_reading),
n=len(analysis.readings)))
else:
print('{ref} ({n} readings)'
.format(ref=ref_reading, n=len(analysis.readings)))
for reading in analysis.readings:
print(' {}'.format(reading.format_differences(ref_reading)))
print('Correct: {}/{} ({:.2f})\n{} program errors'
.format(correct, len(sample),
correct / len(sample),
errors))
return all_analyses
def parse_args() -> argparse.Namespace:
"""Parse arguments from the commandline.
:return: An argparse Namespace holding the arguments.
"""
d = 'Identify errors in verse parsing.'
parser = argparse.ArgumentParser(prog='allzweckmesser', description=d)
parser.add_argument('infile', help=('A JSON file containing verses'
' with one reading each.'))
parser.add_argument('--number', '-n', default=10, type=int,
help='Number of verses to analyze')
parser.add_argument('--randomize', '-r', default=False,
action='store_true',
help=('Randomize what verses are analyzed. If this is'
' not set, the first {number} verses are'
' analyzed.'))
args = parser.parse_args()
return args
def main():
"""Parse CLI arguments then read and scan verses."""
args = vars(parse_args())
args['reference_verses'] = [Verse.from_json(verse)
for verse
in json.load(open(args['infile']))]
del args['infile']
verse_analyses = dev(**args)
if __name__ == '__main__':
main()
Allzweckmesser-master/allzweckmesser/emenda.txt 0000664 0000000 0000000 00000000145 13354251234 0022371 0 ustar 00root root 0000000 0000000 Iulus im Morpheus als Iu_lus, Julus analysiert. Muss aber auch als I^u_lus, Iulus analysierbar sdin.
Allzweckmesser-master/allzweckmesser/features.py 0000664 0000000 0000000 00000002262 13354251234 0022571 0 ustar 00root root 0000000 0000000 # -*- coding: utf-8 -*-
from enum import Enum
class ReadingFeature(Enum):
MCL_TRIGGERS_PL = 0
SYNIZESIS = 1
S_ELISION = 2
HIAT = 3
class ReadingMeterFeatures(Enum):
DOES_NOT_FIT_METER = 10
NECESSARY_CHANGES_TO_MAKE_IT_FIT = 11
METER = 12
NO_USUAL_BREAK_PRESENT = 13
HEXAMETER_BRIDGE_VIOLATED = 14
class CombinedFeatures(Enum):
# The values have to start at 0 and be contiguous.
MCL_TRIGGERS_PL = 0
SYNIZESIS = 1
DOES_NOT_FIT_METER = 2
NO_USUAL_BREAK_PRESENT = 3
METER_RULES_VIOLATED = 4
def combine_features(reading_features, reading_meter_features):
features = [0 for _ in CombinedFeatures]
for rf, val in reading_features.items():
if hasattr(CombinedFeatures, rf.name):
features[CombinedFeatures[rf.name].value] = val
meter_rules_violated = 0
for rmf, val in reading_meter_features.items():
if hasattr(CombinedFeatures, rmf.name):
features[CombinedFeatures[rmf.name].value] = val
elif 'VIOLATED' in rmf.name.upper():
meter_rules_violated += 1
features[
CombinedFeatures.METER_RULES_VIOLATED.value] = meter_rules_violated
return features
Allzweckmesser-master/allzweckmesser/import.py 0000664 0000000 0000000 00000001457 13354251234 0022272 0 ustar 00root root 0000000 0000000 #!/usr/bin/env python3
# -*- coding: utf-8 -*-
import argparse
import sys
from model import *
def parse_args() -> argparse.Namespace:
"""Parse arguments from the commandline.
:return: An argparse Namespace holding the arguments.
"""
d = 'Import annotated verses from .json file.'
parser = argparse.ArgumentParser(prog='allzweckmesser', description=d)
parser.add_argument('--json',
help='A file containing the verses that are to be imported.')
args = parser.parse_args()
return args
def json_from_file(filename:str)->list:
"""Convert json-file to python object
:return: .json-file as python3 object
"""
with open(filename, 'r') as json_file:
return json.loads(json_file.read())
if __name__ == 'main':
args = parse_args()
Allzweckmesser-master/allzweckmesser/meters.py 0000664 0000000 0000000 00000012240 13354251234 0022247 0 ustar 00root root 0000000 0000000 #!/usr/bin/env python3
# -*- coding: utf-8 -*-
import itertools
import re
from .features import ReadingMeterFeatures
from .model import Reading, Position
def bridge(position_spec, feature):
def get_feature(meter: Meter, reading: Reading):
position = Position.after(position_spec[0], reading, meter,
position_spec[1])
if position and position.word_boundary:
return None
else:
return feature
return get_feature
class Meter:
def __init__(self, name: str, schema: str, breaks: list = None,
conditions: list = None, short_name: str = None,
id: int = None):
self.name = name
self.schema = schema
self.break_specs = breaks
# Convert condition functions to instance-bound methods.
self.conditions = ([cond.__get__(self) for cond in conditions]
if conditions else [])
self.short_name = short_name
self.id = id
def match_reading(self, reading: Reading):
return re.match(self.schema, reading.get_schema())
def collect_condition_features(self, reading: Reading):
features = []
for cond in self.conditions:
feature = cond(reading)
if feature:
features.append(feature)
return features
def reading_has_usual_breaks(self, reading: Reading):
if self.break_specs:
for breaks in self.break_specs:
satisfied = True
for b in breaks:
position = Position.after(b[0], reading, b[1], self)
if not (hasattr(position, 'word_boundary')
and position.word_boundary):
satisfied = False
break
if satisfied:
return True
else:
return False
else:
return True
AEOLIC_BASE = r'(?:(–)(–)|(–)(⏑)|(⏑)(–))'
ALL_METERS = {
'hexameter': Meter(
'Catalectic Dactylic Hexameter',
r'(–)(⏑⏑|–)(–)(⏑⏑|–)(–)(⏑⏑|–)(–)(⏑⏑|–)(–)(⏑⏑|–)(⏑|–)',
conditions=[
bridge(('mora', 15, 'Hermann’s Bridge'),
ReadingMeterFeatures.HEXAMETER_BRIDGE_VIOLATED)
],
breaks=[
[('mora', 6, 'Trithemimeral'), ('mora', 14, 'Hephthemimeral')],
[('mora', 10, 'Penthemimeral')],
[('mora', 16, 'Bucolic Diaeresis')]
],
short_name='hexameter',
id=0
),
'pentameter': Meter(
'Dactylic Pentameter',
r'(–)(⏑⏑|–)(–)(⏑⏑|–)(–)(–)(⏑⏑)(–)(⏑⏑)(⏑|–)',
breaks=[[('mora', 5, 'Middle diaeresis')]],
short_name='pentameter',
id=1
),
'ia6': Meter(
'Iambic Trimeter',
r'(⏑|⏑⏑|–)(⏑⏑|–)(⏑)(⏑⏑|–)(⏑|⏑⏑|–)(⏑⏑|–)(⏑)(⏑⏑|–)(⏑|⏑⏑|–)(⏑⏑|–)(⏑)(⏑|–)',
breaks=[
[('element', 4, 'After fourth element')],
[('element', 8, 'After eighth element')]
],
short_name='ia6',
id=2
),
'senarii': Meter(
'Iambic Senarius',
r'(⏑|⏑⏑|–)(⏑⏑|–)(⏑|⏑⏑|–)(⏑⏑|–)(⏑|⏑⏑|–)(⏑⏑|–)(⏑|⏑⏑|–)(⏑⏑|–)(⏑|⏑⏑|–)(⏑⏑|–)(⏑)(⏑|–)',
short_name='senarii',
id=3
),
'sap hen': Meter(
'Sapphic Hendecasyllable',
r'(–)(–|⏑)(–)(–|⏑)(–)(⏑)(⏑)(–)(⏑)(–)(⏑|–)',
conditions={},
short_name='sap hen',
id=4
),
'adoneus': Meter(
'Adoneus',
r'(–)(⏑⏑)(–)(⏑|–)',
short_name='adoneus',
id=5
),
'hendecasyllables': Meter(
'Phalaecian Hendecasyllable',
AEOLIC_BASE + r'(–)(⏑)(⏑)(–)(⏑)(–)(⏑)(–)(⏑|–)',
breaks=[[('syllable', 6, 'After sixth syllable')]],
short_name='hendecasyllables',
id=6
),
'scazon': Meter(
'Choliamb',
r'(⏑|–)(–)(⏑)(–)(⏑|–)(–)(⏑)(–)(⏑)(–)(–)(⏑|–)',
breaks=[[('syllable', 5, 'After fifth syllable')],
[('syllable', 7, 'After seventh syllable')]],
short_name='scazon',
id=7
),
}
ALL_METER_NAMES = list(ALL_METERS.keys())
def get_reading_meter_combinations(readings, meters=ALL_METERS):
reading_meter_rmfeatures = [
[reading, meter, {}]
for reading, meter
in itertools.product(readings, meters)
]
for reading, meter, rmfeatures in reading_meter_rmfeatures:
rmfeatures[ReadingMeterFeatures.DOES_NOT_FIT_METER] = int(
meter.match_reading(reading) is None)
# XXX: Implement this.
rmfeatures[ReadingMeterFeatures.NECESSARY_CHANGES_TO_MAKE_IT_FIT] = 0
rmfeatures[ReadingMeterFeatures.METER] = meter.id
rmfeatures[ReadingMeterFeatures.NO_USUAL_BREAK_PRESENT] = int(
meter.reading_has_usual_breaks(reading))
for feature in meter.collect_condition_features(reading):
rmfeatures[feature] = 1
return reading_meter_rmfeatures
Allzweckmesser-master/allzweckmesser/model.py 0000664 0000000 0000000 00000047607 13354251234 0022067 0 ustar 00root root 0000000 0000000 #!/usr/bin/env python3
# -*- coding: utf-8 -*-
from collections import defaultdict
import itertools
import json
import os
import re
from typing import Dict, List, Set
from .style import (mark_long, mark_wrong_length, mark_wrong_syllables,
mark_syllables_provider)
def check_format(json_file, check_for=dict):
if isinstance(json_file, check_for):
return json_file
elif isinstance(json_file, str):
if os.path.exists(json_file):
with open(json_file, 'r') as jf:
return json.load(jf)
else:
return json.loads(json_file)
else:
raise TypeError('Input not convertible.')
def from_json(json_file):
if hasattr(json_file, 'read'):
verses = json.loads(json_file.read())
elif isinstance(json_file, str) and os.path.exists(json_file):
verses = json.loads(open(json_file).read())
else:
TypeError('Input not convertible.')
return [Verse.from_json(verse) for verse in verses]
def minimal(full_dict:dict):
#print(full_dict)
result_dict = dict()
for key,value in full_dict.items():
if value == {}:
pass
elif isinstance(value, dict):
result_dict.update({key:minimal(value)})
elif value != None:
result_dict.update({key:value})
else:
pass
#print(result_dict)
return result_dict
class Syllable:
def __init__(self, syllable: str, span: List[int], idx: int = None,
syllable_length: int = 1, vowel_length: int = 1,
phenomena: dict = None):
if len(syllable) != span[1] - span[0]:
raise ValueError('Syllable length does not match syllable span.')
else:
self.text = syllable
self.span = span
self.id = idx if idx is not None else None
self.syllable_length = syllable_length
self.vowel_length = vowel_length
self.phenomena = phenomena or dict()
@classmethod
def from_json(cls, json_file):
raw = check_format(json_file)
idx = raw['id'] if 'id' in raw else 0
span = raw['span']
text = raw['syllable']
syllable_length = raw.get('syllable_length')
vowel_length = raw.get('vowel_length')
syllable = cls(text, span, idx, syllable_length, vowel_length)
if 'phenomena' in raw:
syllable.phenomena = dict()
for phenomenon in raw['phenomena'].items():
syllable.phenomena[phenomenon[0]] = Phenomenon.from_json(phenomenon[1])
return syllable
def to_dict(self):
features = dict()
features.update({'id':self.id})
features.update({'span':self.span})
features.update({'syllable':self.text})
features.update({'syllable_length':self.syllable_length})
features.update({'vowel_length':self.vowel_length})
features.update({'phenomena': minimal({key:value.to_dict() for key,value in self.phenomena.items()}) })
return minimal(features)
def to_json(self):
return json.dumps(self.to_dict())
def __repr__(self):
return (
'Syllable(text={s.text!r}, span={s.span!r}, id={s.id!r},'
' syllable_length={s.syllable_length!r},'
' vowel_length={s.vowel_length!r},'
' phenomena={s.phenomena!r})'
).format(s=self)
def __str__(self):
return self.text
class Phenomenon:
def __init__(self, caused_by=None, overruled_by=None,
chars=None, typus=None, omitted=None):
self.caused_by = caused_by
self.overruled_by = overruled_by
self.chars = chars
self.typus = typus
self.omitted = omitted
#@classmethod
#def positional_lengthening(cls, chars: str, caused_by=None,
#overruled_by=None):
#phenomenon = cls('positional lengthening', caused_by, overruled_by)
#phenomenon.chars = chars
#return phenomenon
#@classmethod
#def iambic_shortening(cls, typus: str, caused_by=None, overruled_by=None):
#phenomenon = cls('iambic shortening', caused_by, overruled_by)
#phenomenon.typus = typus
#return phenomenon
#@classmethod
#def s_elision(cls, caused_by=None, overruled_by=None):
#phenomenon = cls('s-elision', caused_by, overruled_by)
#phenomenon.omitted = 's'
#return phenomenon
#@classmethod
#def verse_end(cls, caused_by=None, overruled_by=None):
#phenomenon = cls('verse end', caused_by, overruled_by)
#return phenomenon
@classmethod
def from_json(cls, json_file):
raw = check_format(json_file)
phenomenon = cls()
if 'caused_by' in raw:
phenomenon.caused_by = raw['caused_by']
if 'overruled_by' in raw:
phenomenon.overruled_by = raw['overruled_by']
if 'chars' in raw:
phenomenon.chars = raw['chars']
if 'typus' in raw:
phenomenon.typus = raw['typus']
if 'omitted' in raw:
phenomenon.omitted = raw['omitted']
return phenomenon
def to_dict(self):
features = dict()
if self.caused_by != None:
features.update({'caused_by':self.caused_by})
if self.overruled_by != None:
features.update({'overruled_by':self.overruled_by})
if self.chars != None:
features.update({'chars':self.chars})
if self.typus != None:
features.update({'typus':self.typus})
if self.omitted != None:
features.update({'omitted':self.omitted})
return minimal(features)
def to_json(self):
return json.dumps(self.to_dict())
def __repr__(self):
return (
'Phenomenon(caused_by={p.caused_by!r},'
' overruled_by={p.overruled_by!r}, chars={p.chars!r},'
' typus={p.typus!r}, omitted={p.omitted!r})'
).format(p=self)
class MultisyllablePhenomenon(Phenomenon):
def __init__(self, beginning:int, end:int, caused_by=None,
overruled_by=None, chars=None, typus=None, omitted=None):
Phenomenon.__init__(self, caused_by, overruled_by,
chars, typus, omitted)
self.beginning = beginning
self.end = end
#def apheresis(self, beginning, end, caused_by=None, overruled_by=None):
#MultisyllablePhenomenon.__init__(self, 'apheresis', beginning, end,
#caused_by, overruled_by)
#def synizesis(self, beginning, end, caused_by=None, overruled_by=None):
#MultisyllablePhenomenon.__init__(self, 'synizesis', beginning, end,
#caused_by, overruled_by)
@classmethod
def from_json(cls, json_file):
raw = check_format(json_file)
beginning = raw['beginning']
end = raw['end']
phenomenon = cls(beginning, end)
if 'caused_by' in raw:
phenomenon.caused_by = raw['caused_by']
if 'overruled_by' in raw:
phenomenon.overruled_by = raw['overruled_by']
if 'chars' in raw:
phenomenon.chars = raw['chars']
if 'typus' in raw:
phenomenon.typus = raw['typus']
if 'omitted' in raw:
phenomenon.omitted = raw['omitted']
return phenomenon
def to_dict(self):
features = dict()
features.update({'beginning':self.beginning})
features.update({'end':self.end})
if self.caused_by != None:
features.update({'caused_by':self.caused_by})
if self.overruled_by != None:
features.update({'overruled_by':self.overruled_by})
if self.chars != None:
features.update({'chars':self.chars})
if self.typus != None:
features.update({'typus':self.typus})
if self.omitted != None:
features.update({'omitted':self.omitted})
return minimal(features)
def to_json(self):
return json.dumps(self.to_dict())
def __repr__(self):
return (
'MultiSyllablePhenomenon(caused_by={p.caused_by!r},'
' overruled_by={p.overruled_by!r}, chars={p.chars!r},'
' typus={p.typus!r}, omitted={p.omitted!r},'
' beginning={p.beginning!r}, end={p.end!r})'
).format(p=self)
class Token:
def __init__(self, token: str, span: List[int],
syllables: List[Syllable] = None, clitic: str = None,
accented: str = None,
lemma_to_morphtags: Dict[str, Set[str]] = None,
syllables_provider=None):
if len(token) != span[1]-span[0]:
raise ValueError('Length of token {} does not match span {}.'
.format(token, span))
else:
self.text = token
self.span = span
self.syllables = syllables or list()
self.clitic = clitic
self.accented = accented
self.lemma_to_morphtags = lemma_to_morphtags
self.syllables_provider = syllables_provider
@classmethod
def from_json(cls, json_file):
raw = check_format(json_file)
# self is undefined
text = raw['token']
span = raw['span']
token = cls(text, span)
if 'clitic' in raw:
token.clitic = raw['clitic']
if 'syllables' in raw:
for syllable in raw['syllables']:
token.syllables.append(Syllable.from_json(syllable))
return token
def to_dict(self):
features = dict()
features.update({'token': self.text})
features.update({'span': self.span})
features.update({'clitic': self.clitic})
if self.syllables:
features.update({'syllables': [syllable.to_dict() for syllable in self.syllables]})
return minimal(features)
def to_json(self):
return json.dumps(self.to_dict())
def is_punct(self):
return bool(re.match('^[\W_]+$', self.text))
def __repr__(self):
return (
'Token(token={t.text!r}, span={t.span!r},'
' syllables={t.syllables!r}, clitic={t.clitic!r},'
' accented={t.accented!r})'
).format(t=self)
def __str__(self):
return self.text
class Reading:
def __init__(self, tokens: List[Token] = None, phenomena: dict = None,
meter=None):
self.tokens = tokens or list()
self.phenomena = phenomena or dict()
self.features = defaultdict(lambda: 0)
self.meter = meter
@classmethod
def from_json(cls, json_file):
raw = check_format(json_file)
tokens = list()
for token in raw["tokens"]:
# self is undefined
tokens.append(Token.from_json(token))
reading = cls(tokens)
if 'phenomena' in raw:
for phenomenon in raw['phenomena'].items():
key, value = phenomenon
for v in value:
if key in reading.phenomena:
reading.phenomena[key].append(MultisyllablePhenomenon.from_json(v))
else:
reading.phenomena[key] = [MultisyllablePhenomenon.from_json(v)]
return reading
def get_schema(self):
schema_list = []
for token in self.tokens:
for syllable in token.syllables:
if syllable.syllable_length == 1:
schema_list.append('⏑')
elif syllable.syllable_length == 2:
schema_list.append('–')
# If length == 0, don’t append a symbol.
return ''.join(schema_list)
def to_dict(self):
features = dict()
features.update({'tokens': [token.to_dict() for token in self.tokens]})
phenomena = {key:[minimal(v.to_dict()) for v in value] for key,value in self.phenomena.items()}
features.update({'phenomena': phenomena})
return minimal(features)
def to_json(self):
return json.dumps(self.to_dict())
def __len__(self):
return len(self.tokens)
def append_token(self, token: Token):
self.tokens.append(token)
def __repr__(self):
return ('Reading(tokens={r.tokens!r}, phenomena={r.phenomena!r})'
.format(r=self))
def __str__(self):
forms = [
t.accented if t.accented is not None else t.text
for t in self.tokens
]
return ' '.join(forms)
def format_differences(self, reference, mark_long=mark_long,
mark_wrong_length=mark_wrong_length,
mark_wrong_syllables=mark_wrong_syllables,
mark_syllables_provider=mark_syllables_provider,
syllable_joiner='-', token_joiner=' '):
formatted_tokens = []
for token, ref_token in zip(self.tokens, reference.tokens):
formatted_syllables = []
sylls = token.syllables
ref_sylls = ref_token.syllables
if all(syll and ref_syll # TODO: and syll == ref_syll
for syll, ref_syll
in itertools.zip_longest(sylls, ref_sylls)):
for syll, ref_syll in itertools.zip_longest(sylls, ref_sylls):
fsyll = (mark_long(syll)
if syll.syllable_length == 2
else syll.text)
if syll.syllable_length != ref_syll.syllable_length:
fsyll = mark_wrong_length(fsyll)
formatted_syllables.append(fsyll)
else:
formatted_syllables = [mark_wrong_syllables(syll)
for syll in sylls]
if not token.is_punct():
formatted_token = mark_syllables_provider(
syllable_joiner.join(formatted_syllables),
token.syllables_provider
)
formatted_tokens.append(formatted_token)
formatted = token_joiner.join(formatted_tokens)
return formatted
class Verse:
def __init__(self, verse: str, source: dict = None,
readings: List[Reading] = None):
self.text = verse
self.source = source
self.readings = readings or list()
@classmethod
def from_plain_verse(cls, plain_verse):
verse = cls(plain_verse)
# TODO: Generate readings.
pass
return verse
@classmethod
def from_json(cls, json_file):
raw = check_format(json_file)
text = raw['verse']
source = dict()
if 'source' in raw:
source['author'] = raw['source']['author']
source['work'] = raw['source']['work']
source['place'] = raw['source']['place']
verse = cls(text, source=source)
for reading in raw['readings']:
verse.readings.append(Reading.from_json(reading))
return verse
def to_dict(self):
features = dict()
features.update({'verse':self.text})
features.update({'source':self.source})
features.update({'readings': [reading.to_dict() for reading in self.readings]})
return minimal(features)
def to_json(self):
return json.dumps(self.to_dict())
def __repr__(self):
return (
'Verse(text={v.text!r}, source={v.source!r},'
' readings={v.readings!r})'
).format(v=self)
def __str__(self):
s = 'Verse: {verse}\n{reading_num} Readings:\n{readings}'
readings_str = '\n'.join(str(r) for r in self.readings)
return s.format(verse=self.text, reading_num=len(self.readings),
readings=readings_str)
class Position:
def __init__(self, reading: Reading, mora: int, word_boundary: bool,
token: Token, syllable: Syllable, punctuation: str = None,
meter: '.meters.Meter' = None, element: int = None):
self.reading = reading
self.mora = mora
self.word_boundary = word_boundary
self.token = token
self.syllable = syllable
self.punctuation = punctuation
self.meter = meter
self.element = element
@classmethod
def after_mora(cls, reading: Reading, mora: int) -> 'Position':
morae = 0
punctuation = ''
for token in reading.tokens:
if token.is_punct():
punctuation += token.text
for i, syllable in enumerate(token.syllables):
word_boundary = i == 0
if morae == mora and syllable.syllable_length > 0:
position = cls(
reading=reading, mora=mora, token=token,
syllable=syllable, word_boundary=word_boundary,
punctuation=punctuation
)
return position
else:
morae += syllable.syllable_length
punctuation = ''
else:
# The position has not been found. There are two possibilities:
if morae == mora:
# The position is at the end of the sentence.
position = cls(
reading=reading, mora=mora, token=None, syllable=None,
word_boundary=True, punctuation=punctuation
)
return position
else:
# There is no syllable boundary at the given mora.
return None
@classmethod
def after_element(cls, reading: Reading, meter: '.meters.Meter',
element: int) -> 'Position':
# TODO: Implement this.
pass
@classmethod
def after_syllable(cls, reading: Reading, syll_num: int) -> 'Position':
morae = 0
syllables = 0
punctuation = ''
for token in reading.tokens:
if token.is_punct():
punctuation += token.text
for i, syllable in enumerate(token.syllables):
word_boundary = i == 0
if syllables == syll_num and syllable.syllable_length > 0:
position = cls(
reading=reading, mora=morae, token=token,
syllable=syllable, word_boundary=word_boundary,
punctuation=punctuation
)
return position
else:
morae += syllable.syllable_length
if syllable.syllable_length > 0:
syllables += 1
punctuation = ''
else:
# The position has not been found. There are two possibilities:
if syllables == syll_num:
# The position is at the end of the sentence.
position = cls(
reading=reading, mora=morae, token=None, syllable=None,
word_boundary=True, punctuation=punctuation
)
return position
else:
# There is no syllable boundary at the given syllable.
return None
@classmethod
def after(cls, type: str, reading: Reading, position_number: int,
meter: '.meters.Meter') -> 'Position':
if type == 'mora':
return cls.after_mora(reading, position_number)
elif type == 'syllable':
return cls.after_syllable(reading, position_number)
elif type == 'element':
return cls.after_element(reading, meter, position_number)
else:
raise ValueError(
'The after type has to be "mora" or "element", but is {!r}'
.format(type)
)
def __repr__(self):
return 'Position(' + ', '.join(
'{key}={val!r}'.format(key=attr, val=getattr(self, attr))
for attr in ['reading', 'mora', 'word_boundary', 'token',
'syllable', 'punctuation', 'meter', 'element']
) + ')'
Allzweckmesser-master/allzweckmesser/postags.py 0000664 0000000 0000000 00000057332 13354251234 0022443 0 ustar 00root root 0000000 0000000 #!/usr/bin/python
# -*- coding: utf-8 -*-
# Copyright 2015 Johan Winge
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see .
from __future__ import unicode_literals
import re
featMap = {}
PART_OF_SPEECH = "pos"
NOUN = "noun"
VERB = "verb"
ADJECTIVE = "adj"
ARTICLE = "article"
PARTICLE = "particle"
ADVERB = "adv"
ADVERBIAL = "adverbial"
CONJUNCTION = "conj"
PREPOSITION = "prep"
PRONOUN = "pron"
NUMERAL = "numeral"
INTERJECTION = "interj"
EXCLAMATION = "exclam"
PUNCTUATION = "punc"
featMap[PART_OF_SPEECH] = [NOUN, VERB, ADJECTIVE, ADVERB, ADVERBIAL, CONJUNCTION,
PREPOSITION, PRONOUN, NUMERAL, INTERJECTION, EXCLAMATION, PUNCTUATION]
PERSON = "person"
FIRST_PERSON = "1st"
SECOND_PERSON = "2nd"
THIRD_PERSON = "3rd"
featMap[PERSON] = [FIRST_PERSON, SECOND_PERSON, THIRD_PERSON]
NUMBER = "number"
SINGULAR = "sg"
PLURAL = "pl"
featMap[NUMBER] = [SINGULAR, PLURAL]
TENSE = "tense"
PRESENT = "pres"
IMPERFECT = "imperf"
PERFECT = "perf"
PLUPERFECT = "plup"
FUTURE_PERFECT = "futperf"
FUTURE = "fut"
featMap[TENSE] = [PRESENT, IMPERFECT, PERFECT, PLUPERFECT, FUTURE_PERFECT, FUTURE]
MOOD = "mood"
INDICATIVE = "ind"
SUBJUNCTIVE = "subj"
INFINITIVE = "inf"
IMPERATIVE = "imperat"
GERUNDIVE = "gerundive"
SUPINE = "supine"
GERUND = "gerund"
PARTICIPLE = "part"
featMap[MOOD] = [INDICATIVE, SUBJUNCTIVE, INFINITIVE, IMPERATIVE, GERUNDIVE,
SUPINE, GERUND, PARTICIPLE]
VOICE = "voice"
ACTIVE = "act"
PASSIVE = "pass"
featMap[VOICE] = [ACTIVE, PASSIVE]
GENDER = "gender"
MASCULINE = "masc"
FEMININE = "fem"
NEUTER = "neut"
featMap[GENDER] = [MASCULINE, FEMININE, NEUTER]
CASE = "case"
NOMINATIVE = "nom"
GENITIVE = "gen"
DATIVE = "dat"
ACCUSATIVE = "acc"
ABLATIVE = "abl"
VOCATIVE = "voc"
LOCATIVE = "loc"
featMap[CASE] = [NOMINATIVE, GENITIVE, DATIVE, ACCUSATIVE, ABLATIVE, VOCATIVE, LOCATIVE]
DEGREE = "degree"
POSITIVE = "pos"
COMPARATIVE = "comp"
SUPERLATIVE = "superl"
featMap[DEGREE] = [POSITIVE, COMPARATIVE, SUPERLATIVE]
REGULARITY = "regularity"
REGULAR = "reg"
IRREGULAR = "irreg"
featMap[REGULARITY] = [REGULAR, IRREGULAR]
LEMMA = "lemma"
ACCENTEDFORM = "accentedform"
def ldt_to_parse(ldt_tag):
parse = {}
if ldt_tag[0] == '-':
pass
elif ldt_tag[0] == 'n':
parse[PART_OF_SPEECH] = NOUN
elif ldt_tag[0] == 'v':
parse[PART_OF_SPEECH] = VERB
elif ldt_tag[0] == 't':
# parse[PART_OF_SPEECH] = PARTICIPLE
parse[PART_OF_SPEECH] = VERB
parse[MOOD] = PARTICIPLE
print("Note: 'participle' used as POS")
elif ldt_tag[0] == 'a':
parse[PART_OF_SPEECH] = ADJECTIVE
elif ldt_tag[0] == 'd':
parse[PART_OF_SPEECH] = ADVERB
elif ldt_tag[0] == 'c':
parse[PART_OF_SPEECH] = CONJUNCTION
elif ldt_tag[0] == 'r':
parse[PART_OF_SPEECH] = PREPOSITION
elif ldt_tag[0] == 'p':
parse[PART_OF_SPEECH] = PRONOUN
elif ldt_tag[0] == 'm':
parse[PART_OF_SPEECH] = NUMERAL
elif ldt_tag[0] == 'i':
parse[PART_OF_SPEECH] = INTERJECTION
elif ldt_tag[0] == 'e':
parse[PART_OF_SPEECH] = EXCLAMATION
elif ldt_tag[0] == 'u':
parse[PART_OF_SPEECH] = PUNCTUATION
else:
print("Warning: unknown part of speech:", ldt_tag[0])
if ldt_tag[1] == '-':
pass
elif ldt_tag[1] == '1':
parse[PERSON] = FIRST_PERSON
elif ldt_tag[1] == '2':
parse[PERSON] = SECOND_PERSON
elif ldt_tag[1] == '3':
parse[PERSON] = THIRD_PERSON
else:
print("Warning: unknown person:", ldt_tag[1])
if ldt_tag[2] == '-':
pass
elif ldt_tag[2] == 's':
parse[NUMBER] = SINGULAR
elif ldt_tag[2] == 'p':
parse[NUMBER] = PLURAL
else:
print("Warning: unknown number:", ldt_tag[2])
if ldt_tag[3] == '-':
pass
elif ldt_tag[3] == 'p':
parse[TENSE] = PRESENT
elif ldt_tag[3] == 'i':
parse[TENSE] = IMPERFECT
elif ldt_tag[3] == 'r':
parse[TENSE] = PERFECT
elif ldt_tag[3] == 'l':
parse[TENSE] = PLUPERFECT
elif ldt_tag[3] == 't':
parse[TENSE] = FUTURE_PERFECT
elif ldt_tag[3] == 'f':
parse[TENSE] = FUTURE
else:
print("Warning: unknown tense:", ldt_tag[3])
if ldt_tag[4] == '-':
pass
elif ldt_tag[4] == 'i':
parse[MOOD] = INDICATIVE
elif ldt_tag[4] == 's':
parse[MOOD] = SUBJUNCTIVE
elif ldt_tag[4] == 'n':
parse[MOOD] = INFINITIVE
elif ldt_tag[4] == 'm':
parse[MOOD] = IMPERATIVE
elif ldt_tag[4] == 'p':
parse[MOOD] = PARTICIPLE
elif ldt_tag[4] == 'd':
parse[MOOD] = GERUND
elif ldt_tag[4] == 'g':
parse[MOOD] = GERUNDIVE
elif ldt_tag[4] == 'u':
parse[MOOD] = SUPINE
else:
print("Warning: unknown mood:", ldt_tag[4])
if ldt_tag[5] == '-':
pass
elif ldt_tag[5] == 'a':
parse[VOICE] = ACTIVE
elif ldt_tag[5] == 'p':
parse[VOICE] = PASSIVE
else:
print("Warning: unknown voice:", ldt_tag[5])
if ldt_tag[6] == '-':
pass
elif ldt_tag[6] == 'm':
parse[GENDER] = MASCULINE
elif ldt_tag[6] == 'f':
parse[GENDER] = FEMININE
elif ldt_tag[6] == 'n':
parse[GENDER] = NEUTER
else:
print("Warning: unknown gender:", ldt_tag[6])
if ldt_tag[7] == '-':
pass
elif ldt_tag[7] == 'n':
parse[CASE] = NOMINATIVE
elif ldt_tag[7] == 'g':
parse[CASE] = GENITIVE
elif ldt_tag[7] == 'd':
parse[CASE] = DATIVE
elif ldt_tag[7] == 'a':
parse[CASE] = ACCUSATIVE
elif ldt_tag[7] == 'b':
parse[CASE] = ABLATIVE
elif ldt_tag[7] == 'v':
parse[CASE] = VOCATIVE
elif ldt_tag[7] == 'l':
parse[CASE] = LOCATIVE
else:
print("Warning: unknown case:", ldt_tag[7])
if ldt_tag[8] == '-':
pass
elif ldt_tag[8] == 'c':
parse[DEGREE] = COMPARATIVE
elif ldt_tag[8] == 's':
parse[DEGREE] = SUPERLATIVE
# POSITIVE not in use? (default)
else:
print("Warning: unknown degree:", ldt_tag[8])
return parse
def parse_to_ldt(parse):
ldt_tag = ""
if parse.get(PART_OF_SPEECH, '') == NOUN:
ldt_tag += 'n'
elif parse.get(PART_OF_SPEECH, '') == VERB:
ldt_tag += 'v'
# elif parse.get(PART_OF_SPEECH, '') == PARTICIPLE:
# LDTtag += 't'
elif parse.get(PART_OF_SPEECH, '') == ADJECTIVE:
ldt_tag += 'a'
elif parse.get(PART_OF_SPEECH, '') == ADVERB or parse.get(PART_OF_SPEECH, '') == ADVERBIAL:
ldt_tag += 'd'
elif parse.get(PART_OF_SPEECH, '') == CONJUNCTION:
ldt_tag += 'c'
elif parse.get(PART_OF_SPEECH, '') == PREPOSITION:
ldt_tag += 'r'
elif parse.get(PART_OF_SPEECH, '') == PRONOUN:
ldt_tag += 'p'
elif parse.get(PART_OF_SPEECH, '') == NUMERAL:
ldt_tag += 'm'
elif parse.get(PART_OF_SPEECH, '') == INTERJECTION:
ldt_tag += 'i'
elif parse.get(PART_OF_SPEECH, '') == EXCLAMATION:
ldt_tag += 'e'
elif parse.get(PART_OF_SPEECH, '') == PUNCTUATION:
ldt_tag += 'u'
else:
ldt_tag += '-'
if parse.get(PERSON, '') == FIRST_PERSON:
ldt_tag += '1'
elif parse.get(PERSON, '') == SECOND_PERSON:
ldt_tag += '2'
elif parse.get(PERSON, '') == THIRD_PERSON:
ldt_tag += '3'
else:
ldt_tag += '-'
if parse.get(NUMBER, '') == SINGULAR:
ldt_tag += 's'
elif parse.get(NUMBER, '') == PLURAL:
ldt_tag += 'p'
else:
ldt_tag += '-'
if parse.get(TENSE, '') == PRESENT:
ldt_tag += 'p'
elif parse.get(TENSE, '') == IMPERFECT:
ldt_tag += 'i'
elif parse.get(TENSE, '') == PERFECT:
ldt_tag += 'r'
elif parse.get(TENSE, '') == PLUPERFECT:
ldt_tag += 'l'
elif parse.get(TENSE, '') == FUTURE_PERFECT:
ldt_tag += 't'
elif parse.get(TENSE, '') == FUTURE:
ldt_tag += 'f'
else:
if parse.get(MOOD, '') == GERUNDIVE or parse.get(MOOD, '') == GERUND:
ldt_tag += 'p'
else:
ldt_tag += '-'
if parse.get(MOOD, '') == INDICATIVE:
ldt_tag += 'i'
elif parse.get(MOOD, '') == SUBJUNCTIVE:
ldt_tag += 's'
elif parse.get(MOOD, '') == INFINITIVE:
ldt_tag += 'n'
elif parse.get(MOOD, '') == IMPERATIVE:
ldt_tag += 'm'
elif parse.get(MOOD, '') == GERUNDIVE:
ldt_tag += 'g'
elif parse.get(MOOD, '') == SUPINE:
ldt_tag += 'u'
elif parse.get(MOOD, '') == GERUND:
ldt_tag += 'd'
elif parse.get(MOOD, '') == PARTICIPLE:
ldt_tag += 'p'
else:
ldt_tag += '-'
if parse.get(VOICE, '') == ACTIVE:
ldt_tag += 'a'
elif parse.get(VOICE, '') == PASSIVE:
ldt_tag += 'p'
else:
if parse.get(TENSE, '') == PRESENT and parse.get(MOOD, '') == PARTICIPLE or parse.get(MOOD, '') == GERUND:
ldt_tag += 'a'
elif parse.get(TENSE, '') == PERFECT and parse.get(MOOD, '') == PARTICIPLE or parse.get(MOOD, '') == GERUNDIVE:
ldt_tag += 'p'
else:
ldt_tag += '-'
if parse.get(GENDER, '') == MASCULINE:
ldt_tag += 'm'
elif parse.get(GENDER, '') == FEMININE:
ldt_tag += 'f'
elif parse.get(GENDER, '') == NEUTER:
ldt_tag += 'n'
else:
ldt_tag += '-'
if parse.get(CASE, '') == NOMINATIVE:
ldt_tag += 'n'
elif parse.get(CASE, '') == GENITIVE:
ldt_tag += 'g'
elif parse.get(CASE, '') == DATIVE:
ldt_tag += 'd'
elif parse.get(CASE, '') == ACCUSATIVE:
ldt_tag += 'a'
elif parse.get(CASE, '') == ABLATIVE:
ldt_tag += 'b'
elif parse.get(CASE, '') == VOCATIVE:
ldt_tag += 'v'
elif parse.get(CASE, '') == LOCATIVE:
ldt_tag += 'l'
else:
ldt_tag += '-'
if parse.get(DEGREE, '') == POSITIVE:
ldt_tag += '-'
elif parse.get(DEGREE, '') == COMPARATIVE and parse.get(REGULARITY, '') != IRREGULAR and ldt_tag[0] != 'd':
# Irregular forms are not marked for degree in LDT, nor adverbs (with few exceptions)!
ldt_tag += 'c'
elif parse.get(DEGREE, '') == SUPERLATIVE and parse.get(REGULARITY, '') != IRREGULAR and ldt_tag[0] != 'd':
ldt_tag += 's'
else:
ldt_tag += '-'
return ldt_tag
def unicodeaccents(txt):
for source, replacement in [("a_", "ā"), ("e_", "ē"), ("i_", "ī"), ("o_", "ō"), ("u_", "ū"), ("y_", "ȳ"),
("A_", "Ā"), ("E_", "Ē"), ("I_", "Ī"), ("O_", "Ō"), ("U_", "Ū"), ("Y_", "Ȳ"),
("ä_", "ā"), ("ë_", "ē"), ("ï_", "ī"), ("ö_", "ō"), ("ü_", "ū"), ("ÿ_", "ȳ"),
("æ_", "æ"), ("œ_", "œ"), ("Æ_", "Æ"), ("Œ_", "Œ")]:
txt = txt.replace(source, replacement)
return txt
def escape_macrons(txt):
for source, replacement in [("ā", "a_"), ("ē", "e_"), ("ī", "i_"), ("ō", "o_"), ("ū", "u_"), ("ȳ", "y_"),
("Ā", "A_"), ("Ē", "E_"), ("Ī", "I_"), ("Ō", "O_"), ("Ū", "U_"), ("Ȳ", "Y_")]:
txt = txt.replace(source, replacement)
return txt
def removemacrons(txt):
for source, replacement in [("ā", "a"), ("ē", "e"), ("ī", "i"), ("ō", "o"), ("ū", "u"), ("ȳ", "y"),
("Ā", "A"), ("Ē", "E"), ("Ī", "I"), ("Ō", "O"), ("Ū", "U"), ("Ȳ", "Y")]:
txt = txt.replace(source, replacement)
return txt
def filter_accents(accented):
accented = accented.replace("^_", "_^")
accented = re.sub("_\^([bcdfgpt][lr])", "^\\1", accented)
accented = re.sub("u_m$", "um", accented)
accented = re.sub("([AEIOUYaeiouy])n([sfx]|ct)", "\\1_n\\2", accented)
return accented
def morpheus_to_parses(wordform, nl):
"""Based on CruncherToXML.java in Perseus Hopper"""
parse = {}
nl = nl.replace("irreg_comp", "irreg comp")
nl = nl.replace("irreg_superl", "irreg superl")
morph_codes = nl.split()
accented = morph_codes[1]
lemma = None
if accented.count(",") == 0:
lemma = accented
accented = wordform
elif accented.count(",") == 1:
lemma = accented.split(",")[1]
accented = accented.split(",")[0]
assert lemma is not None
parse[LEMMA] = lemma
parse[ACCENTEDFORM] = filter_accents(accented)
last_morph_code = morph_codes[-1]
pos_abbrev = morph_codes[0]
if last_morph_code == "adverb":
parse[PART_OF_SPEECH] = ADVERB
elif last_morph_code == "article":
parse[PART_OF_SPEECH] = ARTICLE
elif last_morph_code == "particle":
parse[PART_OF_SPEECH] = PARTICLE
elif last_morph_code == "conj":
parse[PART_OF_SPEECH] = CONJUNCTION
elif last_morph_code == "prep":
parse[PART_OF_SPEECH] = PREPOSITION
elif last_morph_code in ["pron1", "pron2", "pron3", "relative", "demonstr", "indef", "interrog"]:
parse[PART_OF_SPEECH] = PRONOUN
elif last_morph_code == "numeral":
parse[PART_OF_SPEECH] = NUMERAL
elif last_morph_code == "exclam":
parse[PART_OF_SPEECH] = EXCLAMATION
elif last_morph_code == "alphabetic":
parse[PART_OF_SPEECH] = IRREGULAR
elif morph_codes[2] == "adverbial":
parse[PART_OF_SPEECH] = ADVERBIAL
elif pos_abbrev == "V":
parse[PART_OF_SPEECH] = VERB
elif pos_abbrev == "P":
# parse[PART_OF_SPEECH] = PARTICIPLE
parse[PART_OF_SPEECH] = VERB
parse[MOOD] = PARTICIPLE
elif pos_abbrev == "N":
if last_morph_code in ["us_a_um", "0_a_um", "er_ra_rum", "er_era_erum", "ius_ia_ium", "is_e", "er_ris_re",
"ans_adj", "ens_adj", "us_ius_adj", "0_ius_adj", "ior_ius_comp", "or_us_comp", "ax_adj",
"0_adj3", "peLs_pedis_adj", "ox_adj", "ix_adj", "s_tis_adj", "ex_icis_adj", "s_dis_adj",
"irreg_adj3", "irreg_adj1", "irreg_adj2", "pron_adj1", "pron_adj3"]:
parse[PART_OF_SPEECH] = ADJECTIVE
elif "pp4" in last_morph_code: # This is not in CruncherToXML...
if 'supine' in morph_codes:
parse[PART_OF_SPEECH] = VERB # ? Supine attribute is not used in LDT
else:
parse[PART_OF_SPEECH] = ADJECTIVE # Past participles in the comparative or superlative. But what about "amantior"?
else:
parse[PART_OF_SPEECH] = NOUN
else:
print("Warning: Unknown Morpheus Part-of-Speech tag: " + pos_abbrev)
def setfeature(parse, code, overwrite=False):
featfound = False
for feature, possiblevalues in featMap.items():
if code in possiblevalues:
if parse.get(feature) is None or overwrite:
parse[feature] = code
featfound = True
elif parse.get(feature) == code:
featfound = True
else:
print("Warning: Feature", feature, "already set! Old:", parse.get(feature), "New:", code)
if not featfound:
pass
# print("Warning: Code", code, "not mapped to feature!")
# enddef
grouped_parses = [parse]
for i in range(2, len(morph_codes)-1):
code = morph_codes[i]
if code.count('/') > 0:
code_components = code.split('/')
new_parses = []
for existingParse in grouped_parses:
for code_component in code_components:
dup_parse = existingParse.copy()
setfeature(dup_parse, code_component)
new_parses.append(dup_parse)
grouped_parses = new_parses
else:
for group_parse in grouped_parses:
setfeature(group_parse, code)
# Morpheus does not report gerunds, only gerundives. So for those gerundives which look like gerunds, add alternative parses.
# Similarly, many third declension nomina which can be of any gender are not marked for gender at all.
final_parses = []
for parse in grouped_parses:
if parse.get(MOOD, '') == GERUNDIVE and parse.get(NUMBER, '') == SINGULAR \
and parse.get(GENDER, '') == NEUTER and parse.get(CASE, '') != NOMINATIVE:
new_parse = parse.copy()
setfeature(new_parse, GERUND, overwrite=True)
final_parses.append(new_parse)
elif parse.get(GENDER, '') == '' and parse.get(CASE, '') != '':
new_parse = parse.copy()
setfeature(new_parse, MASCULINE)
final_parses.append(new_parse)
new_parse = parse.copy()
setfeature(new_parse, FEMININE)
final_parses.append(new_parse)
setfeature(parse, NEUTER)
# endif
final_parses.append(parse)
return final_parses
def parse_to_proiel_tag(parse):
tag = ""
if parse.get(PART_OF_SPEECH, '') == NOUN:
tag += 'Nb'
elif parse.get(PART_OF_SPEECH, '') == VERB:
tag += 'V-'
# elif parse.get(PART_OF_SPEECH, '') == PARTICIPLE:
# tag += 't'
elif parse.get(PART_OF_SPEECH, '') == ADJECTIVE:
tag += 'A-'
elif parse.get(PART_OF_SPEECH, '') == ADVERB or parse.get(PART_OF_SPEECH, '') == ADVERBIAL:
tag += 'Df'
elif parse.get(PART_OF_SPEECH, '') == CONJUNCTION:
tag += 'C-'
elif parse.get(PART_OF_SPEECH, '') == PREPOSITION:
tag += 'R-'
elif parse.get(PART_OF_SPEECH, '') == PRONOUN:
tag += 'Pp'
elif parse.get(PART_OF_SPEECH, '') == NUMERAL:
tag += 'Ma'
elif parse.get(PART_OF_SPEECH, '') == INTERJECTION:
tag += 'I-'
elif parse.get(PART_OF_SPEECH, '') == EXCLAMATION:
tag += 'I-'
elif parse.get(PART_OF_SPEECH, '') == PUNCTUATION:
tag += 'X-'
else:
tag += 'F-'
if parse.get(PERSON, '') == FIRST_PERSON:
tag += '1'
elif parse.get(PERSON, '') == SECOND_PERSON:
tag += '2'
elif parse.get(PERSON, '') == THIRD_PERSON:
tag += '3'
else:
tag += '-'
if parse.get(NUMBER, '') == SINGULAR:
tag += 's'
elif parse.get(NUMBER, '') == PLURAL:
tag += 'p'
else:
tag += '-'
if parse.get(TENSE, '') == PRESENT:
tag += 'p'
elif parse.get(TENSE, '') == IMPERFECT:
tag += 'i'
elif parse.get(TENSE, '') == PERFECT:
tag += 'r'
elif parse.get(TENSE, '') == PLUPERFECT:
tag += 'l'
elif parse.get(TENSE, '') == FUTURE_PERFECT:
tag += 't'
elif parse.get(TENSE, '') == FUTURE:
tag += 'f'
else:
tag += '-'
if parse.get(MOOD, '') == INDICATIVE:
tag += 'i'
elif parse.get(MOOD, '') == SUBJUNCTIVE:
tag += 's'
elif parse.get(MOOD, '') == INFINITIVE:
tag += 'n'
elif parse.get(MOOD, '') == IMPERATIVE:
tag += 'm'
elif parse.get(MOOD, '') == GERUNDIVE:
tag += 'g'
elif parse.get(MOOD, '') == SUPINE:
tag += 'u'
elif parse.get(MOOD, '') == GERUND:
tag += 'd'
elif parse.get(MOOD, '') == PARTICIPLE:
tag += 'p'
else:
tag += '-'
if parse.get(VOICE, '') == ACTIVE:
tag += 'a'
elif parse.get(VOICE, '') == PASSIVE:
tag += 'p'
else:
if parse.get(TENSE, '') == PRESENT and parse.get(MOOD, '') == PARTICIPLE:
tag += 'a'
elif parse.get(TENSE, '') == PERFECT and parse.get(MOOD, '') == PARTICIPLE:
tag += 'p'
else:
tag += '-'
if parse.get(GENDER, '') == MASCULINE:
tag += 'm'
elif parse.get(GENDER, '') == FEMININE:
tag += 'f'
elif parse.get(GENDER, '') == NEUTER:
tag += 'n'
else:
tag += '-'
if parse.get(CASE, '') == NOMINATIVE:
tag += 'n'
elif parse.get(CASE, '') == GENITIVE:
tag += 'g'
elif parse.get(CASE, '') == DATIVE:
tag += 'd'
elif parse.get(CASE, '') == ACCUSATIVE:
tag += 'a'
elif parse.get(CASE, '') == ABLATIVE:
tag += 'b'
elif parse.get(CASE, '') == VOCATIVE:
tag += 'v'
elif parse.get(CASE, '') == LOCATIVE:
tag += 'l'
else:
tag += '-'
if parse.get(DEGREE, '') == POSITIVE:
tag += 'p'
elif parse.get(DEGREE, '') == COMPARATIVE:
tag += 'c'
elif parse.get(DEGREE, '') == SUPERLATIVE:
tag += 's'
else:
if parse.get(PART_OF_SPEECH, '') == ADJECTIVE:
tag += 'p'
else:
tag += '-'
tag += '-'
if tag[2:] == "---------":
tag += 'n'
else:
tag += 'i'
return tag
def parses_to_proiel_tags(parses):
tags = []
for parse in parses:
tags.append(parse_to_proiel_tag(parse))
tagswithgender = {}
for tag in tags:
withoutgender = tag[0:7]+tag[8:12]
tagswithgender[withoutgender] = tagswithgender.get(withoutgender, set()) | {tag[7]}
for withoutgender in tagswithgender:
genders = tagswithgender[withoutgender]
if 'm' in genders and 'n' in genders:
tags.append(withoutgender[0:7]+'o'+withoutgender[7:11])
if 'm' in genders and 'f' in genders:
tags.append(withoutgender[0:7]+'p'+withoutgender[7:11])
if 'm' in genders and 'f' in genders and 'n' in genders:
tags.append(withoutgender[0:7]+'q'+withoutgender[7:11])
if 'f' in genders and 'n' in genders:
tags.append(withoutgender[0:7]+'r'+withoutgender[7:11])
for tag in tags:
if tag[0:2] == "Df":
if tag == "Df---------n":
tags.append("Df-------p-i")
tags.append("Dq"+tag[2:])
tags.append("Du"+tag[2:])
elif tag[0:2] == "Ma":
tags.append("Mo"+tag[2:])
elif tag[0:2] == "Pp":
tags.append("Pc"+tag[2:])
tags.append("Pd"+tag[2:])
tags.append("Pi"+tag[2:])
tags.append("Pk"+tag[2:])
tags.append("Pr"+tag[2:])
tags.append("Ps"+tag[2:])
tags.append("Pt"+tag[2:])
tags.append("Px"+tag[2:])
elif tag[0:2] == "Nb":
tags.append("Ne"+tag[2:])
# elif tag[0:8] == "V--s-g-m":
# tags.append("V----d--"+tag[8:])
# elif tag[0:7] == "V--sppa":
# tags.append("A--s---"+tag[7:9]+"p-i")
# elif tag[0:7] == "V--pppa":
# tags.append("A--p---"+tag[7:9]+"p-i")
# elif tag[0:7] == "V--srpp":
# tags.append("A--s---"+tag[7:9]+"p-i")
# elif tag[0:7] == "V--prpp":
# tags.append("A--p---"+tag[7:9]+"p-i")
return tags
def tag_distance(tag1, tag2):
"""To help select the best alternative, define a measure to compare how similar tags are."""
if not (len(tag1) == len(tag2) == 9 or len(tag1) == len(tag2) == 12):
print("Warning: Strange or mismatching tags!", tag1, tag2)
exit(0)
def is_nomen(tag):
if tag[0] == 'n' or tag[0] == 'a' or tag[0] == 'v' and (tag[3:6] == 'rpp' or tag[3:6] == 'ppa'):
return True
elif tag[0] == 'N' or tag[0] == 'A' or tag[0] == 'V' and (tag[4:7] == 'rpp' or tag[4:7] == 'ppa'):
return True
return False
# enddef
dist = 0
bothnomenbutdifferent = False
if is_nomen(tag1) and is_nomen(tag2) and tag1[0] != tag2[0]:
bothnomenbutdifferent = True
for i in range(0, len(tag1)):
if bothnomenbutdifferent and (len(tag1) == 9 and i in [3, 4, 5] or len(tag1) == 12 and i in [4, 5, 6]):
continue
else:
if tag1[i] != tag2[i]:
dist += 1
return dist
# enddef
Allzweckmesser-master/allzweckmesser/scan.py 0000775 0000000 0000000 00000005276 13354251234 0021712 0 ustar 00root root 0000000 0000000 #!/usr/bin/env python3
# -*- coding: utf-8 -*-
import argparse
import sys
from sklearn.externals import joblib
from typing import List
from .config import RANKING_MODEL_PATH
from .features import combine_features
from .meters import ALL_METERS, ALL_METER_NAMES, get_reading_meter_combinations
from .model import Verse
from .scanner import Scanner
def scan(plain_verses: List[str], meters=ALL_METER_NAMES,
**options) -> List[Verse]:
"""Scan Latin verses."""
meters = [ALL_METERS[m] for m in meters if m in ALL_METERS]
scanner = Scanner()
scanned_verses = scanner.scan_verses(plain_verses)
model = joblib.load(RANKING_MODEL_PATH)
for verse in scanned_verses:
reading_meter_combinations = (
get_reading_meter_combinations(
verse.readings, meters
)
)
vectors = []
for reading, meter, rmfeatures in reading_meter_combinations:
reading.meter = meter
vectors.append(combine_features(reading.features, rmfeatures))
probs = model.predict_proba(vectors)
sorted_probs = sorted(
[(probs[i], reading) for i in range(len(probs))],
key=lambda x: x[0][0]
)
first_one = sorted_probs[:1]
verse.readings = [prob_reading[1] for prob_reading in first_one]
return scanned_verses
def parse_args() -> argparse.Namespace:
"""Parse arguments from the commandline.
:return: An argparse Namespace holding the arguments.
"""
d = 'Scan Latin verses.'
parser = argparse.ArgumentParser(prog='allzweckmesser', description=d)
parser.add_argument('--infile', help=('A file containing the verses that'
' are to be scanned.'))
parser.add_argument('--meters', '-m', nargs='+', help=('The considered'
' meters.'))
args = parser.parse_args()
return args
def get_plain_verses(infile: str = None) -> List[str]:
"""Read verses that are to be scanned.
If infile is None the verses are read from the standard input.
:params infile: A file containing one verse per line.
:return: A list of the verses.
"""
if infile:
with open(infile) as f:
plain_verses = [line.strip() for line in f.readlines()]
else:
plain_verses = [line.strip() for line in sys.stdin.readlines()]
return plain_verses
def main():
"""Parse CLI arguments then read and scan verses."""
args = vars(parse_args())
args['plain_verses'] = get_plain_verses(args['infile'])
del args['infile']
scanned_verses = scan(**args)
for v in scanned_verses:
print(v)
if __name__ == '__main__':
main()
Allzweckmesser-master/allzweckmesser/scanner.py 0000664 0000000 0000000 00000064150 13354251234 0022410 0 ustar 00root root 0000000 0000000 # -*- coding: utf-8 -*-
import copy
import re
from typing import Dict, List, Set, Tuple
from itertools import product
from .db import FormAnalysis
from .model import Reading, Syllable, Token, Verse, Phenomenon
from .features import ReadingFeature
from .wordlist import WordList
CLITICS = ['que', 'qve', 'ue', 've', 'ne']
SPECIAL_CASES = {
# Positional lengthening for historical reasons
# TODO: But there are also hīc and hōc, which should have vowel_length=2.
'hic': lambda token: [Syllable("hic", token.span,
syllable_length=2, vowel_length=1)],
'hoc': lambda token: [Syllable("hoc", token.span,
syllable_length=2, vowel_length=1)],
# Forms with diphthong.
'ceu': lambda token: [Syllable("ceu", token.span,
syllable_length=2, vowel_length=2)],
'cui': lambda token: [Syllable("cui", token.span,
syllable_length=2, vowel_length=2)],
'ei': lambda token: [Syllable("ei", token.span,
syllable_length=2, vowel_length=2)],
'hei': lambda token: [Syllable("hei", token.span,
syllable_length=2, vowel_length=2)],
'heic': lambda token: [Syllable("heic", token.span,
syllable_length=2, vowel_length=2)],
'heus': lambda token: [Syllable("heus", token.span,
syllable_length=2, vowel_length=2)],
'heu': lambda token: [Syllable("heu", token.span,
syllable_length=2, vowel_length=2)],
'huic': lambda token: [Syllable("huic", token.span,
syllable_length=2, vowel_length=2)],
'hui': lambda token: [Syllable("hui", token.span,
syllable_length=2, vowel_length=2)],
'neu': lambda token: [Syllable("neu", token.span,
syllable_length=2, vowel_length=2)],
'seu': lambda token: [Syllable("seu", token.span,
syllable_length=2, vowel_length=2)],
'cuiquam': (lambda token:
[Syllable("cui", [token.span[0] + 0, token.span[0] + 3],
syllable_length=2, vowel_length=2),
Syllable("quam", [token.span[0] + 3, token.span[0] + 7],
syllable_length=1, vowel_length=1)]),
'cuiqvam': (lambda token:
[Syllable("cui", [token.span[0] + 0, token.span[0] + 3],
syllable_length=2, vowel_length=2),
Syllable("quam", [token.span[0] + 3, token.span[0] + 7],
syllable_length=1, vowel_length=1)]),
'cuique': (lambda token:
[Syllable("cui", [token.span[0] + 0, token.span[0] + 3],
syllable_length=2, vowel_length=2),
Syllable("que", [token.span[0] + 3, token.span[0] + 6],
syllable_length=2, vowel_length=2)]),
'cuius': (lambda token:
[Syllable("cui", [token.span[0] + 0, token.span[0] + 3],
syllable_length=2, vowel_length=2),
Syllable("us", [token.span[0] + 3, token.span[0] + 5],
syllable_length=2, vowel_length=2)]),
'cujus': (lambda token:
[Syllable("cui", [token.span[0] + 0, token.span[0] + 3],
syllable_length=2, vowel_length=2),
Syllable("us", [token.span[0] + 3, token.span[0] + 5],
syllable_length=2, vowel_length=2)]),
'deinde': (lambda token:
[Syllable("deind", [token.span[0], token.span[0] + 5],
syllable_length=2, vowel_length=2),
Syllable("e", [token.span[0] + 5, token.span[0] + 6],
syllable_length=1, vowel_length=1)]),
'huius': (lambda token:
[Syllable("hui", [token.span[0] + 0, token.span[0] + 3],
syllable_length=2, vowel_length=2),
Syllable("us", [token.span[0] + 3, token.span[0] + 5],
syllable_length=2, vowel_length=2)]),
'hujus': (lambda token:
[Syllable("hui", [token.span[0] + 0, token.span[0] + 3],
syllable_length=2, vowel_length=2),
Syllable("us", [token.span[0] + 3, token.span[0] + 5],
syllable_length=2, vowel_length=2)]),
'proinde': (lambda token:
[Syllable("proind", [token.span[0], token.span[0] + 6],
syllable_length=2, vowel_length=2),
Syllable("e", [token.span[0] + 6, token.span[0] + 7],
syllable_length=1, vowel_length=1)]),
'necnon': (lambda token:
[Syllable("nec", [token.span[0] + 0, token.span[0] + 3],
syllable_length=2, vowel_length=1),
Syllable("non", [token.span[0] + 3, token.span[0] + 6],
syllable_length=2, vowel_length=2)]),
}
def get_clitic(token: str) -> Tuple[str, str]:
"""Split a clitic from the token if possible.
:param token: A token that may contain a clitic.
:return: A tuple of token without clitic and clitic, if a clitic
was found. Or a tuple of the original token and None if no
clitic was found.
"""
for clitic in CLITICS:
if token.endswith(clitic):
return token[:-len(clitic)], clitic
else:
return token, None
def multiply_readings(readings: List[Reading],
n: int) -> List[Reading]:
"""Copy the readings n - 1 times.
:param readings: The readings that are to be multiplied.
:param n: The number with which to multiply.
:return: n times as many readings as they were before.
"""
orig_readings_len = len(readings)
for _ in range(n - 1):
for i in range(orig_readings_len):
# TODO: Think about moving this to Reading in model.py
new_reading = Reading(
[copy.deepcopy(token) for token in readings[i].tokens]
)
readings.append(new_reading)
return readings
def tokenize(plain_verse: str) -> List[Token]:
"""Tokenize a verse.
This function first splits on whitespace and then further on
punctuation. Punctuation marks are regarded as tokens and are
therefore included in the list of returned tokens.
:param plain_verse: The verse that is to be tokenized.
:return: A list of the found tokens.
"""
tokens = []
i = 0 # Index into the whole verse.
for token in re.split(r'\s', plain_verse):
if token:
# Add Tokens for the punctuation before a token.
pre_punct_match = re.search('^\W+', token)
if pre_punct_match:
for c in pre_punct_match.group():
tokens.append(Token(c, (i, i + 1)))
i += 1
pre_punct_end = pre_punct_match.end()
else:
pre_punct_end = 0
post_punct_match = re.search('[\W_]+$', token)
if post_punct_match:
# Add a Token for the word itself.
word = token[pre_punct_end:post_punct_match.start()]
tokens.append(Token(word, (i, i + len(word))))
i += len(word)
# Add Tokens for the punctuation after a token.
for c in post_punct_match.group():
tokens.append(Token(c, (i, i + 1)))
i += 1
else:
# Add a Token for the word itself.
word = token[pre_punct_end:]
tokens.append(Token(word, (i, i + len(word))))
i += len(word)
i += 1
return tokens
def blow_up_accented(accented):
matches = list(re.finditer(r'[_^]{2}', accented))
if matches:
# Generate blueprint.
blueprint = [accented[:matches[0].start()]]
for m in matches:
blueprint.append('{}')
blueprint.append(accented[matches[-1].end():])
blueprint = ''.join(blueprint)
# Fill blueprint with variants of accented form.
combinations = product([0, 1], repeat=len(matches))
blown_up = []
for combi in combinations:
format_args = ['_' if i == 1 else '^'
for i in combi]
blown_up.append(blueprint.format(*format_args))
else:
# The accented is form is unambiguous.
blown_up = [accented]
return blown_up
def condense_analyses(
analyses: Set[FormAnalysis]) -> Dict[str, Dict[str, Set[str]]]:
"""Condense analyses objects into a nested dict representation.
:param analyses: The analyses that are to be condensed.
:return: A condensed version of the analyses. The keys in the
outer dict are the accented forms, the keys in the inner dict
are lemmas and the strings in the set are the morphtags.
"""
condensed = {}
for a in analyses:
for accented in blow_up_accented(a.accented):
if accented in condensed:
if a.lemma in condensed[accented]:
condensed[accented][a.lemma].add(a.morphtag)
else:
condensed[accented][a.lemma] = {a.morphtag}
else:
condensed[accented] = {a.lemma: {a.morphtag}}
return condensed
def lemmatize(word_list: WordList, reading: Reading) -> List[Reading]:
"""Find different possible readings by analyzing the word forms.
This function analyzes the word forms in the verse and creates
readings for all possible combinations of accented versions of the
words. E.g. if two words occur with more than one accented
version, say one with two accented versions and the other with
three accented versions, a total of six readings will be
generated.
:param word_list: The word list to look up the word forms.
:param reading: A basic reading of a verse that is to be analyzed.
:return: A list of readings of the verse that differ with respect
to the accented versions for the forms.
"""
token_alternatives = []
for token in reading.tokens:
if token.is_punct():
analyses = None
else:
analyses = word_list.analyze(token.text)
if not analyses:
bare, clitic = get_clitic(token.text)
if clitic:
token.clitic = clitic
analyses = word_list.analyze(bare)
alternatives = []
if analyses:
condensed_analyses = condense_analyses(analyses)
for accented, lemma_to_morphtags in condensed_analyses.items():
# The token should not have any syllables at this
# point so that the question of copy vs deepcopy
# does not even arise.
t = copy.copy(token)
t.accented = accented
t.lemma_to_morphtags = lemma_to_morphtags
alternatives.append(t)
else:
alternatives.append(token)
token_alternatives.append(alternatives)
readings = [Reading()]
for alternatives in token_alternatives:
orig_readings_len = len(readings)
readings = multiply_readings(readings, len(alternatives))
for i, token in enumerate(alternatives):
start = i * orig_readings_len
for reading in readings[start:start+orig_readings_len]:
reading.append_token(token)
return readings
def get_syllables_for_accented_form(token):
syllables = []
regex = (
r'((? 1 and c[1] in 'AEIOUYaeiouy_' else 1
)
else:
syll_text += c.rstrip('_^')
if syll_text:
# Add the last syllable.
syll = Syllable(syllable=syll_text,
span=[syll_start, syll_start + len(syll_text)],
idx=None,
vowel_length=syll_vowel_length,
syllable_length=syll_vowel_length)
syllables.append(syll)
return syllables
def get_syllables_for_unknown_form(token):
"""Stolen from Jonathan (insert proper citation here)
ee
"""
strng = token.text
strng = strng.lower()
if strng.isupper():
chunks = [
chunk
for chunk
in re.split("(ae|oe|au|eu|yi|[aeiouy])", strng.lower())
if chunk != ""
]
else:
chunks=[
chunk
for chunk
in re.split("(ae|au|oe|[aeiouy])", strng.lower())
if chunk != ""
]
y = []
# Zaehler j: gerades j: Konsonanten werden an y angehaengt,
# ungerades j: Vokale werden an Konsonanten angehaengt
# Zu beachten: Faengt Wort mit Vokal an?
j = -1
fluff = 0
for ch in chunks:
j += 1
if j == 0:
if re.match("[^aeiou]", chunks[0]):
fluff = 1
y.append(ch)
else:
y.append(ch)
j += 1
elif j == 1 and fluff == 1:
y[0] += chunks[1]
else:
if j % 2 == 0:
if re.match("[^aeiou]", ch):
y[-1] += ch
else:
y.append(ch)
j += 1
else:
y.append(ch)
res = list()
length = token.span[0]
for x in y:
res.append(Syllable(x, [length, length+len(x)]))
length += (len(x))
# special cases again
if re.search("oen?$", strng) and strng.isupper():
res[-1] = Syllable("o", [res[-1].span[0], res[-1].span[0]+1])
if strng.endswith("n"):
res.append(Syllable("en", [res[-1].span[0] + 1, res[-1].span[1]]))
else:
res.append(Syllable("e", [res[-1].span[0] + 1, res[-1].span[1]]))
for syll in res:
if re.search(r'[aeiuoy]{2}', syll.text):
syll.vowel_length = 2
syll.syllable_length = 2
return res
def join_u_diphthong_syllables(token, syllables):
i = 0
while i < len(syllables) - 1:
this_syllable = syllables[i]
next_syllable = syllables[i + 1]
if (token.text[:this_syllable.span[1] - token.span[0]].endswith('ngu')
and next_syllable.text[0] in 'aeioy'):
this_syllable.text += next_syllable.text
this_syllable.span[1] = next_syllable.span[1]
if next_syllable.vowel_length == 2:
this_syllable.vowel_length = 2
this_syllable.syllable_length = 2
syllables.pop(i + 1)
i += 1
i += 1
return syllables
def generate_synizesis(reading):
syn_list = list()
for token in reading.tokens:
for i, syl in enumerate(token.syllables[:-1]):
other_syl = token.syllables[i+1]
new_text = syl.text+other_syl.text
match = re.search(r'(? 0:
new_abstracts = list()
combinations = list(product(['1', '2'], repeat=mcl_count))
for combi in combinations:
new_abstracts.append(abstract.format(*combi))
reading_copies = multiply_readings([reading], (mcl_count)*2)
else:
new_abstracts = [abstract]
reading_copies = [reading]
for i in range(len(new_abstracts)):
blueprint = new_abstracts[i]
new_reading = reading_copies[i]
syll_id = 0
for token in new_reading.tokens:
for s in token.syllables:
if blueprint[syll_id] == '1':
s.syllable_length = 1
if ('positional lengthening' in s.phenomena
and 'muta cum liquida' in s.phenomena):
(s.phenomena['positional lengthening']
.overruled_by) = 'muta cum liquida'
elif blueprint[syll_id] == '2':
s.syllable_length = 2
if (s.vowel_length < 2
and 'muta cum liquida' in s.phenomena):
reading.features[
ReadingFeature.MCL_TRIGGERS_PL] += 1
syll_id += 1
new_readings.append(copy.deepcopy(new_reading))
verse.readings = new_readings
return verse
class Scanner:
def __init__(self, old=False):
self.word_list = WordList()
self.old = old
def scan_verses(self, plain_verses: List[str]):
base_readings = [Reading(tokens=tokenize(v)) for v in plain_verses]
verses = [
Verse(verse=v, readings=lemmatize(self.word_list, br))
for v, br in zip(plain_verses, base_readings)
]
for verse in verses:
new_readings = list()
for reading in verse.readings:
new_readings.extend(get_syllables(reading))
verse.readings = new_readings
parse_verse(verse)
make_elisions(verse, self.old) #old latin not fully implemented
return verses
Allzweckmesser-master/allzweckmesser/style.py 0000664 0000000 0000000 00000001600 13354251234 0022106 0 ustar 00root root 0000000 0000000 # -*- coding: utf-8 -*-
from colorama import init, Back, Fore, Style
init()
def mark_long(text):
return ('{Style.BRIGHT}{text}{Style.NORMAL}'
.format(Style=Style, text=text))
def mark_wrong_length(text):
return ('{Fore.RED}{text}{Fore.RESET}'
.format(Fore=Fore, text=text))
def mark_wrong_syllables(text):
return ('{Back.RED}{text}{Back.RESET}'
.format(Back=Back, text=text))
def mark_correct(text):
return ('{Fore.GREEN}{text}{Fore.RESET}'
.format(Fore=Fore, text=text))
def mark_syllables_provider(text, provider):
if provider == 'get_syllables_for_accented_form':
return text
elif provider == 'SPECIAL_CASES':
return '{}{}'.format(text, '₀')
elif provider == 'get_syllables_for_unknown_form':
return '{}{}'.format(text, '₁')
else:
return '{}{}'.format(text, 'ₑ')
Allzweckmesser-master/allzweckmesser/wordlist.py 0000664 0000000 0000000 00000026361 13354251234 0022630 0 ustar 00root root 0000000 0000000 # -*- coding: utf-8 -*-
"""This module provides the WordList class, which serves to look up
form analyses produced by the Morpheus tool.
"""
from collections import defaultdict
import os
import subprocess
from typing import Dict, List, Set, Union
from sqlalchemy import and_, or_
from sqlalchemy.orm import sessionmaker
from . import postags
from .config import MACRONS_FILE, MORPHEUS_DIR, POPULATE_DATABASE
from .db import SESSION_FACTORY, FormAnalysis
def clean_lemma(lemma):
# TODO: Find out what this is for and write a docstring for it.
return (lemma.replace("#", "").replace("1", "").replace(" ", "+")
.replace("-", "").replace("^", "").replace("_", ""))
class WordList:
"""Mapping from forms to Morpheus analyses of the forms.
A WordList stores FormAnalysis objects in the `form_analyses`
attribute and forms that are unknown to Morpheus in the set of
`unknown_forms`.
Use the functions :func: `get_morphtags`, :func: `get_lemmas`,
:func: `get_accenteds` and the more general :func: `analyze` to
look up information about a form. Use the :func:
`populate_database` function to initially populate the database.
"""
def __init__(self, form_analyses: Dict[str, Set[FormAnalysis]] = None,
unknown_forms: Set[str] = None,
session_factory: sessionmaker = SESSION_FACTORY,
populate_database: bool = POPULATE_DATABASE) -> None:
"""Initialize a WordList.
:param form_analyses: Mapping of forms to form analyses.
:param unknown_forms: Words unknown to morpheus.
:param session_factory: The sqlalchemy sessionmaker.
"""
self.form_analyses = form_analyses or defaultdict(set)
self.unknown_forms = unknown_forms or set()
self.session_factory = session_factory
self._session = self.session_factory()
if populate_database:
self.populate_database()
def get_morphtags(self, form: str) -> Set[str]:
"""Get the morphtags of a form.
:param form: The form that is to be analyzed.
:return: The morphtags of the form.
"""
analyses = self.analyze(form)
return {a.morphtag for a in analyses} if analyses else set()
def get_lemmas(self, form: str) -> Set[str]:
"""Get the lemmas of a form.
:param form: The form that is to be analyzed.
:return: The lemmas of the form.
"""
analyses = self.analyze(form)
return {a.lemma for a in analyses} if analyses else set()
def get_accenteds(self, form: str) -> Set[str]:
"""Get the accented versions of a form.
:param form: The form that is to be analyzed.
:return: The accented versions of the form.
"""
analyses = self.analyze(form)
return {a.accented for a in analyses} if analyses else set()
def populate_database(self, macrons_file: str = MACRONS_FILE) -> None:
"""Populate database with form analyses from `macrons_file`.
`macrons_file` has to consist of lines that are either
1. a form, a morphtag, a lemma and an accented version
separated by spaces or
2. a comment starting with a number sign (#)
:param `macrons_file`: A text file containing the analysis
info.
"""
with open(macrons_file) as f:
for line in f:
line = line.strip()
if not line.startswith('#'):
form, morphtag, lemma, accented = line.split()
analysis = FormAnalysis(form=form, morphtag=morphtag,
lemma=lemma, accented=accented)
self._session.add(analysis)
self._session.commit()
def analyze(self, form: str) -> Set[FormAnalysis]:
"""Get a list of analyses for `form`.
The function first attempts to get the analyses from the saved
analyses, then attempts to load them from the database, then
attempts to analyze it by giving it to the Morpheus cruncher.
:param form: The form that is to be analyzed.
:return: The analyses or an empty list if the form is unknown.
"""
if form not in self.form_analyses:
if form in self.unknown_forms:
return set()
elif not self.load_from_db(form):
morpheus_analyses = self.analyze_with_morpheus([form])
if morpheus_analyses:
self.cache_analyses(morpheus_analyses)
else:
self.unknown_forms.add(form)
if not self.form_analyses[form] and form[0].isupper():
# Try to look up the non-capitalized version of the form.
analyses = self.analyze(form.lower())
if analyses:
self.cache_analyses({form: analyses})
return self.form_analyses[form]
def load_from_db(self, form: str) -> Set[FormAnalysis]:
"""Load analyses of `form` from the database.
:param form: The form that is to be analyzed.
:return: The analyses or an empty list if the form is unknown.
"""
analyses = set(self._session.query(FormAnalysis)
.filter_by(form=form).all())
if analyses:
self.cache_analyses({form: analyses})
return analyses
def load_all_from_db(self) -> Set[FormAnalysis]:
"""Load all analyses from the database.
:return: The analyses in the database
"""
analyses = set(self._session.query(FormAnalysis).all())
for analysis in analyses:
self.form_analyses[analysis.form].add(analysis)
return analyses
def analyze_with_morpheus(self, forms: Union[List[str], str],
update_db: bool = True,
morpheus_dir: str = MORPHEUS_DIR) -> Dict[
str, Set[FormAnalysis]]:
"""Start a morpheus process to analyze several forms.
:param forms: The forms that are to be analyzed.
:param update_db: Whether to update the database with the
analyses.
:param morpheus_dir: Directory where the Morpheus tool is
installed
:return: The analyses or an empty dict if all forms are unknown.
"""
env = os.environ.copy()
env['MORPHLIB'] = os.path.join(morpheus_dir, 'stemlib')
cruncher = os.path.join(morpheus_dir, 'bin', 'cruncher')
args = [cruncher, '-L']
proc = subprocess.run(args, env=env, universal_newlines=True,
input='\n'.join(forms), stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
if proc.returncode != 0:
raise RuntimeError(
'Failed executing morpheus with these args: {}\nStderr: "{}"'
.format(args, proc.stderr)
)
else:
out_lines = proc.stdout.split('\n')
analyzed_forms = {}
final_analyses = defaultdict(set)
unknown_forms = set()
for i in range(len(out_lines)):
form = out_lines[i]
if (i < len(out_lines) - 1
and out_lines[i+1].startswith('')):
# Next line has NL analyses, collect them.
nls = out_lines[i+1].strip()
analyzed_forms[form] = analyzed_forms.get(form, '') + nls
elif not form.startswith(''):
# form is actually a word form, but since the next
# line does not start with , form must be unknown to
# morpheus.
# TODO: Don’t add the empty string here.
unknown_forms.add(form)
for form, nls in analyzed_forms.items():
parses = []
for nl in nls.split(""):
nl = nl.replace("", "")
nlparts = nl.split()
if len(nlparts) > 0:
parses += postags.morpheus_to_parses(form, nl)
lemmatag_to_accenteds = defaultdict(list)
for parse in parses:
lemma = clean_lemma(parse[postags.LEMMA])
parse[postags.LEMMA] = lemma
accented = parse[postags.ACCENTEDFORM]
parse[postags.ACCENTEDFORM] = accented
tag = postags.parse_to_ldt(parse)
lemmatag_to_accenteds[(lemma, tag)].append(accented)
if len(lemmatag_to_accenteds) == 0:
print('Unexpected place')
continue
for (lemma, tag), accenteds in lemmatag_to_accenteds.items():
# Sometimes there are multiple accented forms;
# prefer 'volvit' to 'voluit',
# but keep 'Ju_lius' as well as 'I^u_lius'.
bestaccented = sorted(accenteds,
key=lambda x: x.count('v'))[-1]
lemmatag_to_accenteds[(lemma, tag)] = bestaccented
for (lemma, tag), accented in lemmatag_to_accenteds.items():
analysis = FormAnalysis(form=form, morphtag=tag,
lemma=lemma, accented=accented)
final_analyses[form].add(analysis)
if update_db:
self._session.add(analysis)
if update_db:
self._session.commit()
for form in unknown_forms:
self.unknown_forms.add(form)
if update_db:
self._session.add(FormAnalysis(form=form))
if update_db:
self._session.commit()
self._delete_duplicates_from_db()
return final_analyses
def _delete_duplicates_from_db(self) -> None:
"""Delete duplicate lines from the database."""
fa1 = FormAnalysis
fa2 = self._session.query(FormAnalysis).subquery('fa2')
tbd = (self._session.query(fa1)
.filter(fa1.form == fa2.c.form)
.filter(or_(fa1.morphtag == fa2.c.morphtag,
and_(fa1.morphtag.is_(None),
fa2.c.morphtag.is_(None))))
.filter(or_(fa1.lemma == fa2.c.lemma,
and_(fa1.lemma.is_(None), fa2.c.lemma.is_(None))))
.filter(or_(fa1.accented == fa2.c.accented,
and_(fa1.accented.is_(None),
fa2.c.accented.is_(None))))
.filter(fa1.id > fa2.c.id)
.all())
for a in tbd:
self._session.delete(a)
self._session.commit()
def depopulate_database(self) -> None:
"""Delete all form analyses from the database."""
self._session.query(FormAnalysis).delete()
self._session.commit()
def cache_analyses(self, analyses: Dict[str, Set[FormAnalysis]]) -> None:
"""Store some form analyses in self.form_analyses.
:param analyses: The analyses to cache.
"""
for form, ana_set in analyses.items():
self.form_analyses[form].update(ana_set)
if ana_set and form in self.unknown_forms:
self.unknown_forms.remove(form)
Allzweckmesser-master/blog.html 0000664 0000000 0000000 00000053472 13354251234 0017170 0 ustar 00root root 0000000 0000000
Simon Will, Victor Zimmermann & Christoph Schaller
We provide a tool for measuring Latin verse, as well as a web application highlighting results and providing helpful annotation of phenomena that lead to this classification.
[GitLab] [WebApp]
Introduction
One defining aspect of most poetry in opposition to prose is that poetry is “bound speech”, i.e. speech bound by constraints on the form instead of the content. These constraints can take various forms: For example, they can concern rhyme patterns or the way a text is laid out on paper.
This piece of work, however, focuses on metric constraints, i. e. constraints that concern the rhythmic structure of the text in question: Each line of verse has to take a certain sequence of marked and unmarked syllables. Most poetry in the Germanic languages is bound by accentuating (or qualitative) metrics, which means that accented syllables (as the determined by either loudness or pitch) are considered marked and non-accented syllables are considered unmarked.
In contrast, Ancient Greek and Latin metrics was governed by a quantitative principle meaning that long syllables are considered marked and short syllables are considered unmarked. There are about a dozen different meters that are frequently used in Latin verse and some more that occur less frequently. The hexameter is by far the most frequent one and is used primarily in the epos such as Virgil’s Aeneid and in didactic poetry such as Lucretius’s On the Nature of Things. For this reason, when automatic processing of Latin metrics is attempted, other meters are often overlooked in favor of the hexameter.
The attempt of this work, was to conceive of a system to automatically determine the quantities of a line’s syllables (a process called scanning) without overly focusing on one specific meter. In addition to building a library to do this, a web interface should present the result in an easily digestible way, detailing also how the system arrived at its result, in order to help learners of Latin better understand the process.
Latin Metrics
Metric Principles
As mentioned above, scanning a line means determining its syllables’ quantities. There are two common ways a syllable can be counted as long:
- The syllable contains a long vowel or a diphthong. In this case, the syllable is said to be long “by nature.”
- The syllable’s vowel is followed by two or more consonants (that may well be part of the next syllable or even the next word). In this case, the syllable is considered to be long “by position.”
All other syllables are considered short. However, as is often the case with language, a number of phenomena can occur that change the way the text is read impacting also the scanning of the line:
If a syllable can be considered long by position but the causing consonant cluster consists only of a muta (b, d, g, p, t, k) followed by a liquida (l, r), it may indeed be considered long by position, but more often the lengthening does not occur. For example, the second syllable of “volucris” (“bird”) can be considered long or short.
If one word ends with a vowel or an “m” (which is only nasalized) and the succeeding word begins with a vowel or an “h” (which is only an aspiration mark), the last syllable of the first word is elided. This phenomenon is called elision. For example, in “quare habe” the second syllable of “quare” is elided resulting in the reading “quar(h)abe”. If the elision is not actually carried out, this is called a hiat.
If the above situation occurs, but the second word is a certain form of the auxiliary “esse” (“to be”), e.g. “est” or “estis”, the first syllable of the form of “esse” is elided instead of the last syllable of the first word. For example, “pressa est“ is read as “pressast”. This is called apheresis.
If inside of a word, one syllable ends with a vowel end next begins with a vowel, the first vowel is usually short (in Latin: vocalis ante vocalem corripitur). However, sometimes the two syllables are blended into one which is then long by nature of containing a diphthong. This is called synizesis. For example, “eorum” while usually having three syllables (e-o-rum) can be read as eo-rum.
Meters
There are various Latin meters (the Hypotactic corpus counts 273 meters, of which only 40 occur in more than 50 lines and only 13 occur in more than 500 lines) and this is not the place for a detailed summary of them. But for one simple and popular meter, the Phalaecian Hendecasyllable, the schema is shown for illustration purposes:
x x – ⏑ ⏑ – ⏑ – ⏑ – –
– marks a long syllable, ⏑ marks a short syllable and x marks a syllaba anceps meaning that it can be either long or short. The last element of any meter is always marked as long in schemas. However, there is the license of putting a short syllable in this long element (brevis in longo).
An example of a line fitting the above schema is the first (and any other) line of Catullus 1:
Cui dono lepidum novum libellum?
(Wem gebe ich dieses zierliche neue Büchlein?)
In contrast, this is the hexameter schema:
– ⏕ – ⏕ – ⏕ – ⏕ – ⏕ – –
⏕ means that either one long or two short syllables are possible. This grants the poet significantly more leeway to fit a line to the meter than in the case of the hendecasyllable, but it makes it harder to scan it. The meters used in Latin comedy such as the iambic senarius even have a lot more uncertainties in them than the hexameter.
Related Work
Latin prosody has of course been studied for more than 2000 years, and extensively and systematically so since the 19th century. We omit all the fundamental research in this area that brought to light the principles described above and concentrate instead on attempts to treat Latin metrics using computational methods.
As hinted on above, there exists quite some previous work on the hexameter. The online tool Arma by Dylan Holmes is a scanner that is limited to scanning hexameters. It uses a purely rule-based approach without any knowledge about vowel lengths that is explained on the site itself. Basically, using this approach is possible here because the tool is limited to one meter, the hexameter, and the hexameter meter does not have many uncertainties in it.
There is also the practice website hexameter.co which makes the user scan hexameters and tells them whether they are correct or not.
Winge (2015) introduced a tool called latin-macronizer, which can be used on any Latin text (i.e. also on prose) to mark which vowels are long. The tool is based on the Latin morphology tool Morpheus that provides analyses of Latin word forms including lemmatization and vowel lengths as well as on a parser; combining these tools, latin-macronizer determines what form is present in the text and what quantity the vowels in this form have.
There are also two closed-source tools for scanning verse: Pede certo can scan hexameters and pentameters, but frequently enters an “error“ state yielding no scansion at all. The same website provides a way to search for word forms in Latin verse, however; this is very useful for finding out about how a word is used in verse and proved a valuable tool for us.
Besides that, there is the Metronom tool, the result of a Master’s Thesis of Jacek Tomaszewski. It is a polish interface for scanning Latin and Greek verse and is by far the most sophisticated attempt at scanning verse that is known to us. It supports a large amount of meters and works more reliable than the other tools. It has two shortcomings, however:
- It does not know about vowel lengths and often considers that vowels can be short or long resulting in frequent false positives, i.e. it says a line scans as a hexameter when it actually doesn’t because some vowel quantity contradicts this analysis.
- It does not show how it arrived at its results. I.e. it does not mark elision, positional lengthening, etc.
System
Motivation
After a review of the existing tools, we concluded that we wanted to build a tool that more closely resembles the human scanning process: When a human reads a line of verse, they know from experience which vowels are long rendering the syllables long by nature and can by practicing also spot syllables that are long by position. The vowel lengths that are ignored by the other scanning systems can be determined using a dictionary similar to the way it is done by Winge (2015) and we consider using these vowel lengths to be a more natural way of beginning the scansion of a verse.
Moreover, we wanted to build a tool that annotates its result instead of only providing the syllable lengths, which would make it more comprehensible and more meaningful for students of Latin.
Approach
At a high level, our approach consists of three basic steps:
- Generate all possible ways a line can be read, called the readings of a line, while storing information about how they came about.
- For each reading, combine it with every possible meter judging it on how good it fits the meter.
- Rank the reading-meter combinations using an SVM or a decision tree.
The first step consists of several sub-steps described in the next section.
Scansion Steps
Our system begins with splitting a line into tokens and looking them up in the Morpheus morphology dictionary. In case there is more than one configuration of vowel lengths for the form (e.g. puellā vs. puellă), all of them are considered.
Afterwards, the tokens are split into syllables and positional lengths are determined. In case of muta cum liquida, both the lengthening and the non-lengthening variant are considered.
Elisions and aphereses are applied where applicable. Synizeses are considered where applicable.
After this process, a list of readings has been generated. For example, for a line that contains one form that has three analyses, one muta cum liquida and one synizesis possibility, there exist 3 * 2 * 2 = 12 readings.
Feature Extraction
After the readings have been generated, they are paired with the meters and for every possible reading-meter-combination, the following five features are extracted:
- Number of muta cum liquida appearances that trigger a length by position (X[0] in the tree below)
- Number of synizeses applied (X[1] in the tree below)
- Whether (1) or not (0) the reading matches the meter in question (X[2] in the tree below)
- Whether (0) or not (1) a usual configuration of breaks is present in the verse (X[3] in the tree below)
- Number of other meter rules that are violated (X[4] in the tree below)
Each of these features can be interpreted as a penalty because a completely usual reading will have 0 for all or most of them.
Ranking
We consider a decision tree, a random forest and an SVM to rank the reading-meter combinations. The first one is supposed to be the reading-meter combination that most probably is correct for the given line.
Data
For training, we use parts of the Hypotactic dataset created by David Chamberlain to train our ranking algorithms. For reasons of limited time, we chose to include only four metra in our sub-dataset. These are the hexameter, the pentameter, the hendecasyllable and the scazon. We created a dataset with 10000 instances in the train set (it is actually larger but we only used the first 10000), 10000 in the development set and 15000 instances in the test set.
Chamberlain does not guarantee the correctness of the analyses in the Hypotactic dataset, but from our qualitative assessment, virtually all of the verses are entered correctly.
Evaluation
To evaluate our ranking algorithm, we use a the above mentioned splits of our Hypotactic verses. We annotate each reading that matches the scansion and meter of the Hypotactic verse as gold and train a number of machine learning classifiers on the resulting data set. The list of readings is then ranked by the expected probability of a gold classification.
The tables below show the probability that the correct reading-meter combination is in the (n+1)th top-ranked combinations. E.g. for the decision tree (dev), in 8433 out of 10005 instances, the correct reading-meter combination was one of the first two combinations.
DecisionTree (dev)
0 7338/10005 0.7334332833583208
1 8433/10005 0.8428785607196402
2 8801/10005 0.8796601699150425
3 8869/10005 0.8864567716141929
4 8976/10005 0.8971514242878561
5 8988/10005 0.8983508245877061
6 9008/10005 0.9003498250874563
7 9012/10005 0.9007496251874063
8 9038/10005 0.9033483258370815
9 9042/10005 0.9037481259370315
10 9044/10005 0.9039480259870065
11 9044/10005 0.9039480259870065
12 9048/10005 0.9043478260869565
13 9050/10005 0.9045477261369316
14 9050/10005 0.9045477261369316
15 9050/10005 0.9045477261369316
16 9054/10005 0.9049475262368816
SupportVectorMachine (dev)
0 7340/10005 0.7336331834082959
1 8407/10005 0.840279860069965
2 8797/10005 0.8792603698150925
3 8865/10005 0.8860569715142429
4 8981/10005 0.8976511744127936
5 8994/10005 0.8989505247376312
6 9011/10005 0.9006496751624188
7 9015/10005 0.9010494752623688
8 9039/10005 0.903448275862069
9 9043/10005 0.903848075962019
10 9045/10005 0.904047976011994
11 9045/10005 0.904047976011994
12 9048/10005 0.9043478260869565
13 9050/10005 0.9045477261369316
14 9050/10005 0.9045477261369316
15 9050/10005 0.9045477261369316
16 9054/10005 0.9049475262368816
RandomForest (dev)
0 7201/10005 0.7197401299350324
1 8257/10005 0.825287356321839
2 8736/10005 0.8731634182908545
3 8804/10005 0.879960019990005
4 8921/10005 0.8916541729135432
5 8934/10005 0.8929535232383808
6 8973/10005 0.8968515742128935
7 8977/10005 0.8972513743128436
8 9003/10005 0.8998500749625188
9 9007/10005 0.9002498750624688
10 9022/10005 0.9017491254372814
11 9022/10005 0.9017491254372814
12 9025/10005 0.9020489755122438
13 9027/10005 0.9022488755622189
14 9028/10005 0.9023488255872064
15 9028/10005 0.9023488255872064
16 9033/10005 0.9028485757121439
17 9034/10005 0.9029485257371315
18 9039/10005 0.903448275862069
19 9039/10005 0.903448275862069
20 9040/10005 0.9035482258870565
21 9040/10005 0.9035482258870565
22 9046/10005 0.9041479260369815
23 9046/10005 0.9041479260369815
24 9047/10005 0.904247876061969
25 9047/10005 0.904247876061969
DecisionTree (test)
0 10934/14945 0.731615925058548
1 12602/14945 0.8432251589160255
2 13180/14945 0.8819003011040482
3 13273/14945 0.8881231180996989
4 13418/14945 0.8978253596520576
5 13439/14945 0.8992305118768819
6 13471/14945 0.9013716962194714
7 13480/14945 0.9019739043158247
8 13508/14945 0.9038474406155905
9 13511/14945 0.9040481766477083
10 13512/14945 0.9041150886584142
11 13513/14945 0.9041820006691201
12 13525/14945 0.9049849447975912
13 13525/14945 0.9049849447975912
14 13527/14945 0.905118768819003
15 13527/14945 0.905118768819003
16 13534/14945 0.9055871528939444
17 13535/14945 0.9056540649046504
18 13535/14945 0.9056540649046504
19 13535/14945 0.9056540649046504
SupportVectorMachine (test)
0 10934/14945 0.731615925058548
1 12587/14945 0.8422214787554366
2 13183/14945 0.8821010371361659
3 13276/14945 0.8883238541318167
4 13417/14945 0.8977584476413516
5 13437/14945 0.89909668785547
6 13470/14945 0.9013047842087655
7 13479/14945 0.9019069923051187
8 13506/14945 0.9037136165941787
9 13510/14945 0.9039812646370023
10 13512/14945 0.9041150886584142
11 13512/14945 0.9041150886584142
12 13524/14945 0.9049180327868852
13 13525/14945 0.9049849447975912
14 13526/14945 0.9050518568082971
15 13526/14945 0.9050518568082971
16 13534/14945 0.9055871528939444
17 13535/14945 0.9056540649046504
18 13535/14945 0.9056540649046504
19 13535/14945 0.9056540649046504
RandomForest (test)
0 10818/14945 0.7238541318166611
1 12464/14945 0.8339913014386082
2 13120/14945 0.8778855804616928
3 13216/14945 0.8843091334894614
4 13359/14945 0.8938775510204081
5 13380/14945 0.8952827032452325
6 13433/14945 0.8988290398126464
7 13443/14945 0.8994981599197056
8 13472/14945 0.9014386082301773
9 13475/14945 0.9016393442622951
10 13487/14945 0.9024422883907661
11 13488/14945 0.9025092004014721
12 13500/14945 0.9033121445299431
13 13501/14945 0.9033790565406491
14 13504/14945 0.9035797925727668
15 13504/14945 0.9035797925727668
16 13511/14945 0.9040481766477083
17 13513/14945 0.9041820006691201
18 13519/14945 0.9045834727333556
19 13519/14945 0.9045834727333556
20 13520/14945 0.9046503847440616
21 13520/14945 0.9046503847440616
22 13522/14945 0.9047842087654734
23 13522/14945 0.9047842087654734
24 13524/14945 0.9049180327868852
25 13524/14945 0.9049180327868852
26 13527/14945 0.905118768819003
27 13527/14945 0.905118768819003
28 13527/14945 0.905118768819003
29 13527/14945 0.905118768819003
30 13529/14945 0.9052525928404148
We examine similar behaviour between the three classifiers, each reporting about 72 % recall for the top classification, while converging to 90 % recall by the fifth place of our ranking. The final 10 % recall missing are due to no suitable reading being generated in the first place.
The decision tree highlights the importance of the correct scansion of a given verse (X[2]), i.e. if the lengths don't match, the reading is definitely false. The other features seem to be more ambiguous, with X[0] and X[4] not even being considered for classification. This shows that additional features, regardless of machine learning approach, may be needed for a better performance.
Web Application
We wrote a Web Application as a frontend for our tool that is hosted under checkmyprosody.com. It provides the three top-ranked reading-meter combinations as well as an easily digestible way of displaying it.
Shortcomings and Future Work
We demonstrated that our approach is a feasible way to build a system that jointly scans the line and predicts a meter. However, the reading generation as well as the ranking yield results that are not wholly satisfactory for us.
The generated readings contain the correct reading in 90 % of the cases. The errors are mostly due to uncommon forms like proper names, especially the Greek ones. The morphology tool Morpheus does not easily handle these words. Also, we noticed some errors in the lengths that are entered in Morpheus. One way to handle this, is to manually allow more plausible forms whenever a proper name is detected (e.g. via capitalization).
Moreover, there are some phenomena that we have not considered, yet. This includes hiat and iambic shortening, where a long syllable before a short one can become short as well in special configurations.
As for the ranking, all the machine learning approaches worked similarly well, but in order to improve them, more and better features need to be incorporated. For example, there are some rules that have been discovered about double breves and other quantity sequences that occur only rarely, like Ritschl’s rule and the Hermann-Lachmann rule.
Another thing that needs to be improved is the number of meters the tool is able to analyze. For the tests, we only used four meters, but there are many more. They can fairly easily be incorporated into the system by adding them to our list of meters.
To bridge the gap that arises through the imperfect reading generation, one could identify which readings almost match a meter and adjust the quantities to make them match. This way, situations where some peculiarity (like an unknown Greek proper name) prohibits generating a correct reading can be healed.
Conclusion
We presented a system that jointly scans a line of Latin verse and predicts the meter it satisfies. Our approach was special in that it is fundamentally not limited to any specific meters and we incorporated knowledge about vowel lengths using an external morphology tool to make up for the added complexity of the task.
In order to make the system more useful for learners of Latin, we built a web frontend for our scansion system that annotates special phenomena in the verse and explains their effects.
While we could prove that our approach works in principle, there are several rough edges in our system and more work needs to be done to make it less reliant on the correctness of the morphology tool and to enhance the feature extraction process in order to improve the ranking system.
Acknowledgements
We want to thank Jonathan Geiger, Johan Winge, David Chamberlain and Jacek Tomaszewski for their precious advice and their willingness to answer any questions we had about their tools.
References
- Boldrini, Sandro: Prosodie und Metrik der Römer. 1999
- Crusius, Friedrich: Römische Metrik. 1986
- Drexler, Hans: Einführung in die römische Metrik. 1967
- Winge, Johan: Automatic annotation of Latin vowel length. Bachelor’s Thesis at Uppsala University. 2015.
Allzweckmesser-master/models/ 0000775 0000000 0000000 00000000000 13354251234 0016627 5 ustar 00root root 0000000 0000000 Allzweckmesser-master/models/forest_classifier.joblib 0000664 0000000 0000000 00000112301 13354251234 0023516 0 ustar 00root root 0000000 0000000 csklearn.ensemble.forest
RandomForestClassifier
q )q}q(X n_estimatorsqK
X estimator_paramsq(X criterionqX max_depthqX min_samples_splitqX min_samples_leafqX min_weight_fraction_leafq X max_featuresq
X max_leaf_nodesqX min_impurity_decreaseqX min_impurity_splitq
X random_stateqtqX min_impurity_decreaseqG X n_features_qKX classes_qcsklearn.externals.joblib.numpy_pickle
NumpyArrayWrapper
q)q}q(X shapeqKqX orderqX CqX subclassqcnumpy
ndarray
qX dtypeqcnumpy
dtype
qX i8qK KqRq (KX