Last updated:
0 purchases
averell 1.2.2
Averell, the python library and command line interface that facilitates working
with existing repositories of annotated poetry. Averell is able to download an annotated corpus and reconcile different
TEI entities to provide a unified JSON output at the desired granularity.
That is, for their investigations some researchers
might need the entire poem, poems split line by line,
or even word by word if that is available. Averell allows to specify the
granularity of the final generated dataset, which is a combined JSON with all
the entities in the selected corpora.
Each corpus in the catalog must specify the parser to produce the expected data format.
Free software: Apache Software License 2.0
Available corpora (version 1.1.0)
id
name
lang
size
docs
words
granularity
license
1
Disco V2.1
(disco2_1)
es
22M
4088
381539
stanza
line
CC-BY
2
Disco V3
(disco3)
es
28M
4080
377978
stanza
line
CC-BY
3
Sonetos Siglo
de Oro
(adso)
es
6.8M
5078
466012
stanza
line
CC-BY-NC
4.0
4
ADSO 100
poems corpus
(adso100)
es
128K
100
9208
stanza
line
CC-BY-NC
4.0
5
Poesía Lírica
Castellana Siglo
de Oro
(plc)
es
3.8M
475
299402
stanza
line
word
syllable
CC-BY-NC
4.0
6
Gongocorpus (gongo)
es
9.2M
481
99079
stanza
line
word
syllable
CC-BY-NC-ND
3.0
FR
7
Eighteenth Century
Poetry Archive
(ecpa)
en
2400M
3084
2063668
stanza
line
word
CC
BY-SA
4.0
8
For Better
For Verse
(4b4v)
en
39.5M
103
41749
stanza
line
Unknown
9
Métrique en
Ligne (mel)
fr
183M
5081
1850222
stanza
line
Unknown
10
Biblioteca Italiana
(bibit)
it
242M
25341
7121246
stanza
line
word
Unknown
11
Corpus of
Czech Verse
(czverse)
cs
4100M
66428
12636867
stanza
line
word
CC-BY-SA
12
Stichotheque
(stichopt)
pt
11.8M
1702
168411
stanza
line
Unkwown
Documentation
https://averell.readthedocs.io/
Installation
To install averell, run this command in your terminal:
pip install averell
This is the preferred method to install averell, as it will always install
the most recent stable release.
If you don’t have pip installed, this Python installation guide can guide
you through the process.
Usage
To show averell help:
averell --help
To list all available corpora:
averell list
Visualization example of one of the available corpora:
id name lang size docs words granularity license
---- ------------------- ------ ------ ------ ------- ------------- -----------
1 Disco V2.1 es 22M 4088 381539 stanza CC-BY
line
download
Download desired corpora into “mycorpora” folder:
averell download 2 3 --corpora-folder my_corpora
Example of poem in TEI format obtained from one of the corpora:
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader>
<fileDesc>
<titleStmt>
<title> Spanish Metrical Patterns Bank: Golden Age Sonnets.</title>
<principal>Borja Navarro Colorado</principal>
<respStmt>
<name>María Ribes Lafoz</name>
<name>Noelia Sánchez López</name>
<name>Borja Navarro Colorado</name>
<resp>Metrical patterns annotation</resp>
</respStmt>
</titleStmt>
<publicationStmt>
<publisher>Natural Language Processing Group. Department of Software and Computing Systems. University of Alicante (Spain)</publisher>
</publicationStmt>
<sourceDesc>
<bibl><title>Sonetos</title> de <author>Garcilaso de La Vega</author>. <publisher>Biblioteca Virtual Miguel de Cervantes</publisher>, edición de <editor role="editor">Ramón García González</editor>.</bibl>
</sourceDesc>
</fileDesc>
<encodingDesc>
<metDecl xml:id="bncolorado" type="met" pattern="((\+|\-)+)*">
<metSym value="+">stressed syllable</metSym>
<metSym value="-">unstressed syllable</metSym>
</metDecl>
<metDecl>
<p>All metrical patterns have been manually checked.</p>
</metDecl>
</encodingDesc>
</teiHeader>
<text>
<body>
<head>
<title>-XX-</title>
</head>
<lg type="cuarteto">
<l n="1" met="-++--++--+-">Con tal fuerza y vigor son concertados</l>
<l n="2" met="-----+-+-+-">para mi perdición los duros vientos,</l>
<l n="3" met="--+--+---+-">que cortaron mis tiernos pensamientos</l>
<l n="4" met="+----++--+-">luego que sobre mí fueron mostrados.</l>
</lg>
<lg type="terceto">
<l n="5" met="-++--+---+-">El mal es que me quedan los cuidados</l>
<l n="6" met="---+-----+-">en salvo de estos acontecimientos,</l>
<l n="7" met="-++--+---+-">que son duros, y tienen fundamentos</l>
</lg>
</body>
</text>
</TEI>
Generated example JSON file from input XML/TEI poem into
my_corpora/{corpus}/averell/parser/{author_name}/{poem_name}.json
{
"manually_checked": true,
"poem_title": "-XX-",
"author": "Garcilaso de La Vega",
"stanzas": [
{
"stanza_number": "1",
"stanza_type": "cuarteto",
"lines": [
{
"line_number": "1",
"line_text": "Con tal fuerza y vigor son concertados",
"metrical_pattern": "-++--++--+-"
},
{
"line_number": "2",
"line_text": "para mi perdición los duros vientos,",
"metrical_pattern": "-----+-+-+-"
},
{
"line_number": "3",
"line_text": "que cortaron mis tiernos pensamientos",
"metrical_pattern": "--+--+---+-"
},
{
"line_number": "4",
"line_text": "luego que sobre mí fueron mostrados.",
"metrical_pattern": "+----++--+-"
}
],
"stanza_text": "Con tal fuerza y vigor son concertados\npara mi perdición los duros vientos,\nque cortaron mis tiernos pensamientos\nluego que sobre mí fueron mostrados."
},
{
"stanza_number": "2",
"stanza_type": "terceto",
"lines": [
{
"line_number": "5",
"line_text": "El mal es que me quedan los cuidados",
"metrical_pattern": "-++--+---+-"
},
{
"line_number": "6",
"line_text": "en salvo de estos acontecimientos,",
"metrical_pattern": "---+-----+-"
},
{
"line_number": "7",
"line_text": "que son duros, y tienen fundamentos",
"metrical_pattern": "-++--+---+-"
}
],
"stanza_text": "El mal es que me quedan los cuidados\nen salvo de estos acontecimientos,\nque son duros, y tienen fundamentos"
}
]
}
export
Now we can combine and join these corpora through “granularity” selection:
averell export 2 3 --granularity line --corpora-folder my_corpora --filename export_1
It produces an single JSON file with information about all the lines in
those corpora. Example of two random lines in the file mycorpora/export_1.json:
{
"line_number": "5",
"line_text": "¿Has visto que en el mismo lugar donde",
"metrical_pattern": "++---+--++-",
"stanza_number": "2",
"manually_checked": false,
"poem_title": " - II - ",
"author": "Mira de Amescua",
"stanza_text": "¿Has visto que en el mismo lugar donde\nbordado estuvo el cristalino velo\nun bordado terliz de escarcha y hielo\nhace que el campo de verdor se monde?",
"stanza_type": "cuarteto"
}
{
"line_number": "10",
"line_text": "el que a lo cierto no a lo incierto mira,",
"metrical_pattern": "---+-+-+-+-",
"stanza_number": "3",
"manually_checked": false,
"poem_title": "- VIII - Considerando un sepulcro y los que están en él ",
"author": "Lope de Zarate",
"stanza_text": "De aquí si que consigue el ser dichoso\nel que a lo cierto no a lo incierto mira,\npues le adorna lo eterno fastuoso;",
"stanza_type": "terceto"
}
By default, export will download corpora if needed. To avoid this behaviour, the flag --no-download can be passed in.
Exported corpora can be easily loaded into Pandas
averell export adso100 --filename adso100.json
import pandas as pd
adso100 = pd.read_json(open("adso100.json"))
A note on IDS
IDS can be numeric identifiers in the averell list output, corpus shortcodes (shown between parenthesis), the speciall literal all to refer to all corpora, or two-letter ISO language codes to refer to avaliable corpora in a specific language.
For example, the command averell export 1 bibit fr will export DISCO V2.1, the Biblioteca Italiana poetry corpus, and all corpora tagged with the French languge tag in a single file spliting poems line by line.
Changelog
1.2.1 (2021-07-14)
Added two new readers:
Stichotheque Portuguese corpus
Corpus of Czech Verse
export_filename is also returned as an output of export_corpora
Fix writing function so as not to duplicate information
Change name key to corpus for clarity
Fix path split on Windows systems
Add corpus name to averell output files
1.1.0 (2020-09-18)
Added Biblioteca Italiana (bibit) reader
Added Archivio Metrico Italiano info to Biblioteca Italiana reader
Reduced fixtures file size
Adding a tmp file to git ignore
Adding languages and some other cosmetic changes
Fixing an error with the expected output of the averell list command
Adding slugs, langs, and ‘all’ to download and export
Fixing coverage
Adding documentation and fixing a test
1.0.3 (2020-09-03)
Added export --filename option
Added two new readers:
For better for verse
Métrique en ligne
1.0.2 (2020-06-23)
Added two new readers:
ECPA corpus
Gongocorpus
Minor bug fixes
1.0.1 (2020-05-18)
Setting up bumbpversion
Integration with Zenodo
1.0.0 (2020-04-29)
Remove commits-since code block
Adding automated deployments to PyPI on tag releases
Added menu
Remove comments and cleaner code fixes
Fix sorted output of tests
Added proper documentation and coverage tests
Added tests for export function
Added export function
Added TEI_NAMESPACE as a constant
Fixed docs. Fixed loads with Path. Fixed logging errors
Added tests
0.0.1 (2020-01-08)
First release on PyPI.
For personal and professional use. You cannot resell or redistribute these repositories in their original state.
There are no reviews.