cutters 0.1.4

Creator: coderz1093

Last updated:

Add to Cart

Description:

cutters 0.1.4

cutters

A rule based sentence segmentation library.
Python bindings for the cutters library written in Rust.












🚧 This library is experimental. 🚧

Features

Full UTF-8 support.
Robust parsing.
Language specific rules (each defined by its own PEG).
Fast and memory efficient parsing via the pest library.
Sentences can contain quotes which can contain subsentences.

Supported languages

Croatian (standard)
English (standard)

There is also an additional Baseline "language" that simply splits the text on sentence terminals as defined by UTF-8. Its intended use is for benchmarking.
Example
After installing the cutters package with pip, usage is simple (note that the language is defined via ISO 639-1 two letter language codes).
import cutters

text = """
Petar Krešimir IV. je vladao od 1058. do 1074. St. Louis 9LX je događaj u svijetu šaha. To je prof.dr.sc. Ivan Horvat. Volim rock, punk, funk, pop itd. Tolstoj je napisao: "Sve sretne obitelji nalik su jedna na drugu. Svaka nesretna obitelj nesretna je na svoj način."
""";

sentences = cutters.cut(text, "hr");

print(sentences);

This results in the following output (note that the str struct fields are &str).
[Sentence {
str: "Petar Krešimir IV. je vladao od 1058. do 1074. ",
quotes: [],
}, Sentence {
str: "St. Louis 9LX je događaj u svijetu šaha.",
quotes: [],
}, Sentence {
str: "To je prof.dr.sc. Ivan Horvat.",
quotes: [],
}, Sentence {
str: "Volim rock, punk, funk, pop itd.",
quotes: [],
}, Sentence {
str: "Tolstoj je napisao: \"Sve sretne obitelji nalik su jedna na drugu. Svaka nesretna obitelj nesretna je na svoj način.\"",
quotes: [
Quote {
str: "Sve sretne obitelji nalik su jedna na drugu. Svaka nesretna obitelj nesretna je na svoj način.",
sentences: [
"Sve sretne obitelji nalik su jedna na drugu.",
"Svaka nesretna obitelj nesretna je na svoj način.",
],
},
],
}]

License

For personal and professional use. You cannot resell or redistribute these repositories in their original state.

Customer Reviews

There are no reviews.