Digital Humanities have led to new quantitative and more objective -as well as scientific- approaches to study the literary record. The case of Poetry is a special one, as there are much less quantitative studies and tools available for this particular genre. In this sense, the process of saving available poetic corpora from the Web is a hard and time-consuming one.
In order to bridge this gap, the POSTDATA team has developed Averell, an open-source tool that allows downloading corpus poems from different sources and formats and combining them in JSON and CSV.
All the necessary documentation for its installation can be found in the POSTDATA GitHub:
How to use it?
After installing it, the first thing to do is to see the available corpus in the catalogue to decide which one we are interested in working with; this will be done by entering the command “Averell list” in our terminal.
We currently have the following catalogue of 5 corpora:
We select and download the corpus by means of the command “download“.
Example: “averell download 2 3 4 –corpora-folder mycorpora”
This command will download the corpus with ids 2, 3 and 4 into the “mycorpora” folder and generate a JSON file for each poem in the corpus. These JSON files are located inside the folder of each corpus in “averell/parser” and in turn inside the folder of the author of the corresponding poem.
Averell allows to select the granularity of the resulting dataset, which will be a single JSON with all the information of the entities of the selected corpus. For instance:
averell export 2 3 –granularity line –corpora-folder mycorpora
We will obtain the file “line_2_3.json” inside the folder “mycorpora” whose content will be the information of all the lines of all the corpus poems with ids 2 and 3.
Extract of some lines of this dataset:
Averell is part of the PoetryLab suite by POSTDATA Project, and it’s open and ready to add new or existing public corpus to its repertoire. We are proud to be able to release this new tool and share it with all the Digital Humanities community.