In POSTDATA we use spaCy, a natural language processing library for Python. It has become an industry standard but, although it generally works well, there’s still some polishing to be done on the Spanish support.One problem we had with this library was that it doesn’t detect some Spanish pronouns well (for instance, in words like “dímelo”, “piérdete” or “hazme”). This is because spaCy uses a Spanish data model that has not been properly trained for this type of words. To solve this problem we have been working on a spaCy extension that allows for proper identification and separation of both the root word and its suffixes. This open source tool has been released and can be installed very easily from python with a simple “pip install spacy_affixes”.
On POSTDATA github page you can find all the necessary documentation:
https://github.com/linhd-postdata/spacy-affixes
How does it work?
The operation is very simple: we download cattery rules files (which we will get from the Freeling tool http://nlp.lsi.upc.edu/freeling/index.php/node/1 ) and after implementing their rules in python, we add this new behavior to the spaCy pipeline. Thanks to this we get much more accurate results than spaCy does in this task.
It is a key piece of the tools developed by the POSTDATA team within the PoetryLab suite and we are very proud to be able to release it and share it with the rest of the NLP community. It is our contribution in a field in which there are very few free resources for Spanish.