cutters 0.1.4
cutters
A rule based sentence segmentation library.
Python bindings for the cutters library written in Rust.
🚧 This library is experimental. 🚧
Features
Full UTF-8 support.
Robust parsing.
Language specific rules (each defined by its own PEG).
Fast and memory efficient parsing via the pest library.
Sentences can contain quotes which can contain subsentences.
Supported languages
Croatian (standard)
English (standard)
There is also an additional Baseline "language" that simply splits the text on sentence terminals as defined by UTF-8. Its intended use is for benchmarking.
Example
After installing the cutters package with pip, usage is simple (note that the language is defined via ISO 639-1 two letter language codes).
import cutters
text = """
Petar Krešimir IV. je vladao od 1058. do 1074. St. Louis 9LX je događaj u svijetu šaha. To je prof.dr.sc. Ivan Horvat. Volim rock, punk, funk, pop itd. Tolstoj je napisao: "Sve sretne obitelji nalik su jedna na drugu. Svaka nesretna obitelj nesretna je na svoj način."
""";
sentences = cutters.cut(text, "hr");
print(sentences);
This results in the following output (note that the str struct fields are &str).
[Sentence {
str: "Petar Krešimir IV. je vladao od 1058. do 1074. ",
quotes: [],
}, Sentence {
str: "St. Louis 9LX je događaj u svijetu šaha.",
quotes: [],
}, Sentence {
str: "To je prof.dr.sc. Ivan Horvat.",
quotes: [],
}, Sentence {
str: "Volim rock, punk, funk, pop itd.",
quotes: [],
}, Sentence {
str: "Tolstoj je napisao: \"Sve sretne obitelji nalik su jedna na drugu. Svaka nesretna obitelj nesretna je na svoj način.\"",
quotes: [
Quote {
str: "Sve sretne obitelji nalik su jedna na drugu. Svaka nesretna obitelj nesretna je na svoj način.",
sentences: [
"Sve sretne obitelji nalik su jedna na drugu.",
"Svaka nesretna obitelj nesretna je na svoj način.",
],
},
],
}]
For personal and professional use. You cannot resell or redistribute these repositories in their original state.
There are no reviews.