preprocess-corpora 0.1.1

Last updated: September 28, 2024

0 purchases

Free

Donate

Creator: bradpython12

Languages

Python

Description:

preprocesscorpora 0.1.1

preprocess-corpora
This repository contains Python scripts to preprocess and sentence-align parallel (or monolingual) corpora.
The repository heavily relies upon Uplug and (in lesser respect) TreeTagger to work.
Installation
First, make sure to have installed Uplug and TreeTagger.
Then, install the requirements via:
$ pip install -r requirements.txt

Finally, create the executables preprocess and align via:
$ pip install --editable .

Usage
Preprocessing
The preprocess script allows to preprocess raw text and then to tokenize and tag the text in the XML format used in OPUS.
Run preprocess to process all unformatted .txt-files in a folder.
Usage:
process [OPTIONS] FOLDER_IN FOLDER_OUT [de|en|es|fr|it|nl|ru|ca|sv|pt]
Options:

--from_word to use .docx-files as input, rather than .txt-files.
--tokenize to tokenize the files (requires installation of Uplug (and language support in Uplug)).
--tag to tag the files (requires installation of TreeTagger (and language support in TreeTagger))

Alignment
Run align to sentence-align .xml-files in a working directory. Requires installation of Uplug.
Usage:
align [OPTIONS] WORKING_DIR [[de|en|es|fr|it|nl|ru|ca|sv|pt]]...
Supported languages
Full support

German (de)
English (en)
Spanish (es) (+ variants Rioplatense (ar) and Mexican (mx) Spanish)
French (fr)
Italian (it)
Dutch (nl)
Russian (ru)
Portuguese (pt)

Limited support

Breton (br) [not supported in Uplug/TreeTagger]
Catalan (ca) [not supported in Uplug/TreeTagger]
Swedish (sv) [not supported in Uplug/TreeTagger]

Tests
Run the tests via
python -m unittest discover
In preprocess_corpora/tests/data/alice, you can find the example corpus used in the tests.
This corpus was compiled from Lewis Carroll's Alice in Wonderland and its translations into German, French, and Italian.
The source files are available through Project Gutenberg.

License:

For personal and professional use. You cannot resell or redistribute these repositories in their original state.

Files In This Product:

There are no reviews.

zed

preprocess-corpora 0.1.1

Languages

Categories

Description:

License:

Share

Files In This Product:

Overview

What you can do with it

What you can't do with it

Related Products

Views For YouTube Bot writed on Python

AI-Web-Scraper

quivr

roop

More From This Creator

xdict 1.1.11

xdisplayselect 1.0.0

xfcs 1.1.6

xfcsdashboard 0.0.2

xfds 0.3.0

preprocess-corpora 0.1.1

Languages

Categories

Description:

License:

Share

Files In This Product:

Customer Reviews

License

Overview

What you can do with it

What you can't do with it

Related Products

Views For YouTube Bot writed on Python

AI-Web-Scraper

quivr

roop

zed

More From This Creator

xdict 1.1.11

xdisplayselect 1.0.0

xfcs 1.1.6

xfcsdashboard 0.0.2

xfds 0.3.0