anyks-sc 1.2.6

Last updated:

0 purchases

anyks-sc 1.2.6 Image
anyks-sc 1.2.6 Images
Add to Cart

Description:

anykssc 1.2.6

ANYKS Spell-checker (ASC)
Project description
There are a lot of typo and text error correction systems out there. Each one of those systems has its pros and cons, and each system has the right to live and will find its own user base. I would like to present my own version of the typo correction system with its own unique features.
List of features

Correction of mistakes in words with a Levenshtein distance of up to 4;
Correction of different types of typos in words: insertion, deletion, substitution, rearrangement of character;
Ё-fication of a word given the context (letter 'ё' is commonly replaced by letter 'е' in russian typed text);
Context-based word capitalization for proper names and titles;
Context-based splitting for words that are missing the separating space character;
Text analysis without correcting the original text;
Searching the text for errors, typos, incorrect context.

Requirements

Zlib
Bloom
OpenSSL
hnswlib
HandyPack
GperfTools
Python3
NLohmann::json
BigInteger
ALM

Install PyBind11
$ python3 -m pip install pybind11

Ready-to-use dictionaries



Dictionary name
Size (GB)
RAM (GB)
N-gram order
Language




wittenbell-3-big.asc
1.97
15.6
3
RU


wittenbell-3-middle.asc
1.24
9.7
3
RU


mkneserney-3-middle.asc
1.33
9.7
3
RU


wittenbell-3-single.asc
0.772
5.14
3
RU


wittenbell-5-single.asc
1.37
10.7
5
RU



Testing
To test the system, we used data from the 2016 "spelling correction" competition organized by Dialog21.
The trained binary dictionary that was used for testing: wittenbell-3-middle.asc



Mode
Precision
Recall
FMeasure




Typo correction
76.97
62.71
69.11


Error correction
73.72
60.53
66.48



I think it is unnecessary to add any other data. Anyone can repeat the test if they wish (all files used for testing are attached below).
Files used for testing

test.txt - Text used for testing;
correct.txt - File with correct text;
evaluate.py - Python3 script for correction result evaluation.


Description of Methods
Methods:

idw - Word ID retrieval method
idt - Token ID retrieval method
ids - Sequence ID retrieval method

Example:
>>> import asc
>>>
>>> asc.idw("hello")
313191024
>>>
>>> asc.idw("<s>")
1
>>>
>>> asc.idw("</s>")
22
>>>
>>> asc.idw("<unk>")
3
>>>
>>> asc.idt("1424")
2
>>>
>>> asc.idt("hello")
0
>>>
>>> asc.idw("Living")
13268942501
>>>
>>> asc.idw("in")
2047
>>>
>>> asc.idw("the")
83201
>>>
>>> asc.idw("USA")
72549
>>>
>>> asc.ids([13268942501, 2047, 83201, 72549])
16314074810955466382

Description



Name
Description




〈s〉
Sentence beginning token


〈/s〉
Sentence end token


〈url〉
URL-address token


〈num〉
Number (arabic or roman) token


〈unk〉
Unknown word token


〈time〉
Time token (15:44:56)


〈score〉
Score count token (4:3 ¦ 01:04)


〈fract〉
Fraction token (5/20 ¦ 192/864)


〈date〉
Date token (18.07.2004 ¦ 07/18/2004)


〈abbr〉
Abbreviation token (1-й ¦ 2-е ¦ 20-я ¦ p.s ¦ p.s.)


〈dimen〉
Dimensions token (200x300 ¦ 1920x1080)


〈range〉
Range of numbers token (1-2 ¦ 100-200 ¦ 300-400)


〈aprox〉
Approximate number token (~93 ¦ 95.86 ¦ 1020)


〈anum〉
Pseudo-number token (combination of numbers and other symbols) (T34 ¦ 895-M-86 ¦ 39km)


〈pcards〉
Symbols of the play cards (♠ ¦ ♣ ¦ ♥ ¦ ♦ )


〈punct〉
Punctuation token (. ¦ , ¦ ? ¦ ! ¦ : ¦ ; ¦ … ¦ ¡ ¦ ¿)


〈route〉
Direction symbols (arrows) (← ¦ ↑ ¦ ↓ ¦ ↔ ¦ ↵ ¦ ⇐ ¦ ⇑ ¦ ⇒ ¦ ⇓ ¦ ⇔ ¦ ◄ ¦ ▲ ¦ ► ¦ ▼)


〈greek〉
Symbols of the Greek alphabet (Α ¦ Β ¦ Γ ¦ Δ ¦ Ε ¦ Ζ ¦ Η ¦ Θ ¦ Ι ¦ Κ ¦ Λ ¦ Μ ¦ Ν ¦ Ξ ¦ Ο ¦ Π ¦ Ρ ¦ Σ ¦ Τ ¦ Υ ¦ Φ ¦ Χ ¦ Ψ ¦ Ω)


〈isolat〉
Isolation/quotation token (( ¦ ) ¦ [ ¦ ] ¦ { ¦ } ¦ " ¦ « ¦ » ¦ „ ¦ “ ¦ ` ¦ ⌈ ¦ ⌉ ¦ ⌊ ¦ ⌋ ¦ ‹ ¦ › ¦ ‚ ¦ ’ ¦ ′ ¦ ‛ ¦ ″ ¦ ‘ ¦ ” ¦ ‟ ¦ ' ¦〈 ¦ 〉)


〈specl〉
Special character token (_ ¦ @ ¦ # ¦ № ¦ © ¦ ® ¦ & ¦ § ¦ æ ¦ ø ¦ Þ ¦ – ¦ ‾ ¦ ‑ ¦ — ¦ ¯ ¦ ¶ ¦ ˆ ¦ ˜ ¦ † ¦ ‡ ¦ • ¦ ‰ ¦ ⁄ ¦ ℑ ¦ ℘ ¦ ℜ ¦ ℵ ¦ ◊ ¦ \ )


〈currency〉
Symbols of world currencies ($ ¦ € ¦ ₽ ¦ ¢ ¦ £ ¦ ₤ ¦ ¤ ¦ ¥ ¦ ℳ ¦ ₣ ¦ ₴ ¦ ₸ ¦ ₹ ¦ ₩ ¦ ₦ ¦ ₭ ¦ ₪ ¦ ৳ ¦ ƒ ¦ ₨ ¦ ฿ ¦ ₫ ¦ ៛ ¦ ₮ ¦ ₱ ¦ ﷼ ¦ ₡ ¦ ₲ ¦ ؋ ¦ ₵ ¦ ₺ ¦ ₼ ¦ ₾ ¦ ₠ ¦ ₧ ¦ ₯ ¦ ₢ ¦ ₳ ¦ ₥ ¦ ₰ ¦ ₿ ¦ ұ)


〈math〉
Mathematical operation token (+ ¦ - ¦ = ¦ / ¦ * ¦ ^ ¦ × ¦ ÷ ¦ − ¦ ∕ ¦ ∖ ¦ ∗ ¦ √ ¦ ∝ ¦ ∞ ¦ ∠ ¦ ± ¦ ¹ ¦ ² ¦ ³ ¦ ½ ¦ ⅓ ¦ ¼ ¦ ¾ ¦ % ¦ ~ ¦ · ¦ ⋅ ¦ ° ¦ º ¦ ¬ ¦ ƒ ¦ ∀ ¦ ∂ ¦ ∃ ¦ ∅ ¦ ∇ ¦ ∈ ¦ ∉ ¦ ∋ ¦ ∏ ¦ ∑ ¦ ∧ ¦ ∨ ¦ ∩ ¦ ∪ ¦ ∫ ¦ ∴ ¦ ∼ ¦ ≅ ¦ ≈ ¦ ≠ ¦ ≡ ¦ ≤ ¦ ≥ ¦ ª ¦ ⊂ ¦ ⊃ ¦ ⊄ ¦ ⊆ ¦ ⊇ ¦ ⊕ ¦ ⊗ ¦ ⊥ ¦ ¨)




Methods:

setZone - User zone set method

Example:
>>> import asc
>>>
>>> asc.setZone("com")
>>> asc.setZone("ru")
>>> asc.setZone("org")
>>> asc.setZone("net")


Methods:

clear - Method clear all data
setAlphabet - Method set alphabet
getAlphabet - Method get alphabet

Example:
>>> import asc
>>>
>>> asc.getAlphabet()
'abcdefghijklmnopqrstuvwxyz'
>>>
>>> asc.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> asc.getAlphabet()
'abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя'
>>>
>>> asc.clear()
>>>
>>> asc.getAlphabet()
'abcdefghijklmnopqrstuvwxyz'


Methods:

setUnknown - Method set unknown word
getUnknown - Method extraction unknown word

Example:
>>> import asc
>>>
>>> asc.setUnknown("word")
>>>
>>> asc.getUnknown()
'word'


Methods:

infoIndex - Method for print information about the dictionary
token - Method for determining the type of the token words
addText - Method of adding text for estimate
collectCorpus - Training method of assembling the text data for ASC [curpus = filename or dir, smoothing = wittenBell, modified = False, prepares = False, mod = 0.0, status = Null]
pruneVocab - Dictionary pruning method
buildArpa - Method for build ARPA
writeWords - Method for writing these words to a file
writeVocab - Method for writing dictionary data to a file
writeNgrams - Method of writing data to NGRAMs files
writeMap - Method of writing sequence map to file
writeSuffix - Method for writing data to a suffix file for digital abbreviations
writeAbbrs - Method for writing data to an abbreviation file
getSuffixes - Method for extracting the list of suffixes of digital abbreviations
writeArpa - Method of writing data to ARPA file
setThreads - Method for setting the number of threads used in work (0 - all available threads)
setStemmingMethod - Method for setting external stemming function
loadIndex - Binary index loading method
spell - Method for performing spell-checker
analyze - Method for analyze text
addAlt - Method for add a word/letter with an alternative letter
setAlphabet - Method for set Alphabet
setPilots - Method for set pilot words
setSubstitutes - Method for set letters to correct words from mixed alphabets
addAbbr - Method add abbreviation
setAbbrs - Method set abbreviations
getAbbrs - Method for extracting the list of abbreviations
addGoodword - Method add good word
addBadword - Method add bad word
addUWord - Method for add a word that always starts with a capital letter
setUWords - Method for add a list of identifiers for words that always start with a capital letter
readArpa - Method for reading an ARPA file, language model
readVocab - Method of reading the dictionary
setEmbedding - Method for set embedding
buildIndex - Method for build spell-checker index
setAdCw - Method for set dictionary characteristics (cw - count all words in dataset, ad - count all documents in dataset)
setCode - Method for set code language
addLemma - Method for add a Lemma to the dictionary
setNSWLibCount - Method for set the maximum number of options for analysis

Example:
>>> import asc
>>>
>>> asc.infoIndex("./wittenbell-3-single.asc")

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

* Code: RU

* Version: 1.0.0

* Dictionary name: Russian - single

* Locale: en_US.UTF-8
* Alphabet: абвгдеёжзийклмнопрстуфхцчшщъыьэюяabcdefghijklmnopqrstuvwxyz

* Build date: 09/08/2020 15:39:31

* Encrypted: NO

* ALM type: ALMv1

* Allow apostrophe: NO

* Count words: 106912195
* Count documents: 263998

* Only good words: NO
* Mix words in dicts: YES
* Confidence arpa: YES

* Count upper words: 841915
* Count pilots words: 15
* Count bad words: 108790
* Count good words: 124
* Count substitutes: 14
* Count abbreviations: 16532

* Alternatives: е => ё
* Count alternatives words: 58138

* Size embedding: 28

* Length n-gram: 3
* Count n-grams: 6710202

* Author: Yuriy Lobarev

* Contacts: site: https://anyks.com, e-mail: [email protected]

* Copyright ©: Yuriy Lobarev

* License type: GPLv3
* License text:
The GNU General Public License is a free, copyleft license for software and other kinds of works.

The licenses for most software and other practical works are designed to take away your freedom to share and change the works. By contrast, the GNU General Public License is intended to guarantee your freedom to share and change all versions of a program--to make sure it remains free software for all its users. We, the Free Software Foundation, use the GNU General Public License for most of our software; it applies also to any other work released this way by its authors. You can apply it to your programs, too.

When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for them if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs, and that you know you can do these things.

To protect your rights, we need to prevent others from denying you these rights or asking you to surrender the rights. Therefore, you have certain responsibilities if you distribute copies of the software, or if you modify it: responsibilities to respect the freedom of others.

For example, if you distribute copies of such a program, whether gratis or for a fee, you must pass on to the recipients the same freedoms that you received. You must make sure that they, too, receive or can get the source code. And you must show them these terms so they know their rights.

Developers that use the GNU GPL protect your rights with two steps: (1) assert copyright on the software, and (2) offer you this License giving you legal permission to copy, distribute and/or modify it.

For the developers' and authors' protection, the GPL clearly explains that there is no warranty for this free software. For both users' and authors' sake, the GPL requires that modified versions be marked as changed, so that their problems will not be attributed erroneously to authors of previous versions.

Some devices are designed to deny users access to install or run modified versions of the software inside them, although the manufacturer can do so. This is fundamentally incompatible with the aim of protecting users' freedom to change the software. The systematic pattern of such abuse occurs in the area of products for individuals to use, which is precisely where it is most unacceptable. Therefore, we have designed this version of the GPL to prohibit the practice for those products. If such problems arise substantially in other domains, we stand ready to extend this provision to those domains in future versions of the GPL, as needed to protect the freedom of users.

Finally, every program is threatened constantly by software patents. States should not allow patents to restrict development and use of software on general-purpose computers, but in those that do, we wish to avoid the special danger that patents applied to a free program could make it effectively proprietary. To prevent this, the GPL assures that patents cannot be used to render the program non-free.

The precise terms and conditions for copying, distribution and modification follow.

URL: https://www.gnu.org/licenses/gpl-3.0.ru.html

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

Example:
>>> import asc
>>> import spacy
>>> import pymorphy2
>>>
>>> asc.setThreads(0)
>>> asc.setOption(asc.options_t.ascSplit)
>>> asc.setOption(asc.options_t.ascAlter)
>>> asc.setOption(asc.options_t.ascESplit)
>>> asc.setOption(asc.options_t.ascRSplit)
>>> asc.setOption(asc.options_t.ascUppers)
>>> asc.setOption(asc.options_t.ascHyphen)
>>> asc.setOption(asc.options_t.ascWordRep)
>>> asc.setOption(asc.options_t.mixDicts)
>>> asc.setOption(asc.options_t.confidence)
>>> asc.setOption(asc.options_t.stemming)
>>>
>>> morphRu = pymorphy2.MorphAnalyzer()
>>> morphEn = spacy.load('en', disable=['parser', 'ner'])
>>>
>>> def status(text, status):
... print(text, status)
...
>>>
>>> def eng(word):
... global morphEn
... words = morphEn(word)
... word = ''.join([token.lemma_ for token in words]).strip()
... if word[0] != '-' and word[len(word) - 1] != '-':
... return word
... else:
... return ""
...
>>>
>>> def rus(word):
... global morphRu
... if morphRu != None:
... word = morphRu.parse(word)[0].normal_form
... return word
... else:
... return ""
...
>>>
>>> def run(word, lang):
... if lang == "ru":
... return rus(word.lower())
... elif lang == "en":
... return eng(word.lower())
...
>>>
>>> asc.setStemmingMethod(run)
>>>
>>> asc.loadIndex("./wittenbell-3-single.asc", "", status)
Loading dictionary 1
Loading dictionary 2
Loading dictionary 3
Loading dictionary 4
Loading dictionary 5
Loading dictionary 6
Loading dictionary 7
Loading dictionary 8
...
Loading Bloom filter 100
Loading stemming 0
Loading stemming 1
Loading stemming 2
Loading stemming 3
...
Loading language model 6
Loading language model 12
Loading language model 18
Loading language model 25
Loading language model 31
Loading language model 37
...
Loading alternative words 1
Loading alternative words 2
Loading alternative words 3
Loading alternative words 4
Loading alternative words 5
Loading alternative words 6
Loading alternative words 7
...
Loading substitutes letters 7
Loading substitutes letters 14
Loading substitutes letters 21
Loading substitutes letters 28
Loading substitutes letters 35
Loading substitutes letters 42
...
>>>
>>> res = asc.spell("начальнег зажог павзрослому", True)
>>> res
('начальник зажёг по-взрослому', [('начальнег', 'начальник'), ('зажог', 'зажёг'), ('павзрослому', 'по-взрослому')])
>>>
>>> res = asc.analyze("слзы теут на мрозе")
>>> res
[('теут', ['текут']), ('мрозе', ['мозг', 'мороз', 'морозе', 'моё']), ('слзы', ['слезы', 'слёзы'])]

Example:
>>> import asc
>>>
>>> asc.setThreads(0)
>>> asc.setOption(asc.options_t.ascSplit)
>>> asc.setOption(asc.options_t.ascAlter)
>>> asc.setOption(asc.options_t.ascESplit)
>>> asc.setOption(asc.options_t.ascRSplit)
>>> asc.setOption(asc.options_t.ascUppers)
>>> asc.setOption(asc.options_t.ascHyphen)
>>> asc.setOption(asc.options_t.ascWordRep)
>>> asc.setOption(asc.options_t.mixDicts)
>>> asc.setOption(asc.options_t.confidence)
>>>
>>> asc.addAlt("е", "ё")
>>> asc.addAlt("ежик", "ёжик")
>>> asc.addAlt("Легкий", "Лёгкий")
...
>>> asc.setAlphabet("абвгдеёжзийклмнопрстуфхцчшщъыьэюяabcdefghijklmnopqrstuvwxyz")
>>>
>>> asc.setPilots(["а","у","в","о","с","к","б","и","я","э","a","i","o","e","g"])
>>> asc.setSubstitutes({'p':'р','c':'с','o':'о','t':'т','k':'к','e':'е','a':'а','h':'н','x':'х','b':'в','m':'м'})
>>>
>>> asc.idw("Сбербанк")
13236490857
asc.idw("Совкомбанк")
22287680895
>>>
>>> asc.token("Сбербанк")
'<word>'
>>> asc.token("совкомбанк")
'<word>'
>>>
>>> asc.setAbbrs({13236490857, 22287680895})
>>>
>>> asc.addAbbr("США")
>>> asc.addAbbr("Сбер")
>>>
>>> asc.token("Сбербанк")
'<abbr>'
>>> asc.token("совкомбанк")
'<abbr>'
>>> asc.token("сша")
'<abbr>'
>>> asc.token("СБЕР")
'<abbr>'
...
>>> asc.getAbbrs()
{13236490857, 189243, 22287680895, 26938511}
>>>
>>> asc.addGoodword("T-34")
>>> asc.addGoodword("АН-25")
...
>>> asc.addBadword("ийти")
>>> asc.addBadword("циган")
>>> asc.addBadword("апичатка")
...
>>> asc.addUWord("Москва")
>>> asc.addUWord("Санкт-Петербург")
...
>>> def statusArpa(status):
... print("Read arpa", status)
...
>>> def statusVocab(status):
... print("Read vocab", status)
...
>>> def statusIndex(status):
... print("Build index", status)
...
>>> asc.readArpa("./words.arpa", statusArpa)
Read arpa 0
Read arpa 1
Read arpa 2
Read arpa 3
Read arpa 4
Read arpa 5
Read arpa 6
Read arpa 7
Read arpa 8
...
>>> asc.readVocab("./words.vocab", statusVocab)
Read vocab 0
Read vocab 1
Read vocab 2
Read vocab 3
Read vocab 4
Read vocab 5
Read vocab 6
...
>>> asc.setEmbedding({
... "а": 0, "б": 1, "в": 2, "г": 3, "д": 4, "е": 5,
... "ё": 5, "ж": 6, "з": 7, "и": 8, "й": 8, "к": 9,
... "л": 10, "м": 11, "н": 12, "о": 0, "п": 13, "р": 14,
... "с": 15, "т": 16, "у": 17, "ф": 18, "х": 19, "ц": 20,
... "ч": 21, "ш": 21, "щ": 21, "ъ": 22, "ы": 23, "ь": 22,
... "э": 5, "ю": 24, "я": 25, "<": 26, ">": 26, "~": 26,
... "-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
... "%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
... "\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
... "5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
... "b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
... "h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
... "n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
... "t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
... }, 28)
>>>
>>> asc.buildIndex(statusIndex)
Build index 0
Build index 1
Build index 2
Build index 3
Build index 4
...
>>> res = asc.spell("начальнег зажог павзрослому", True)
>>> res
('начальник зажег по-взрослому', [('начальнег', 'начальник'), ('зажог', 'зажег'), ('павзрослому', 'по-взрослому')])
>>>
>>> res = asc.analyze("слзы теут на мрозе")
>>> res
[('теут', ['текут']), ('мрозе', ['мозг', 'мороз', 'морозе', 'моё']), ('слзы', ['слезы', 'слёзы'])]

Example:
>>> import asc
>>>
>>> asc.setThreads(0)
>>> asc.setOption(asc.options_t.ascSplit)
>>> asc.setOption(asc.options_t.ascAlter)
>>> asc.setOption(asc.options_t.ascESplit)
>>> asc.setOption(asc.options_t.ascRSplit)
>>> asc.setOption(asc.options_t.ascUppers)
>>> asc.setOption(asc.options_t.ascHyphen)
>>> asc.setOption(asc.options_t.ascWordRep)
>>> asc.setOption(asc.options_t.mixDicts)
>>> asc.setOption(asc.options_t.confidence)
>>>
>>> asc.addAlt("е", "ё")
>>> asc.addAlt("ежик", "ёжик")
>>> asc.addAlt("зажег", "зажёг")
>>> asc.addAlt("Легкий", "Лёгкий")
...
>>> asc.setAlphabet("абвгдеёжзийклмнопрстуфхцчшщъыьэюяabcdefghijklmnopqrstuvwxyz")
>>>
>>> asc.setPilots(["а","у","в","о","с","к","б","и","я","э","a","i","o","e","g"])
>>> asc.setSubstitutes({'p':'р','c':'с','o':'о','t':'т','k':'к','e':'е','a':'а','h':'н','x':'х','b':'в','m':'м'})
>>>
>>> asc.addAbbr("США")
>>> asc.addAbbr("Сбер")
...
>>> asc.addGoodword("T-34")
>>> asc.addGoodword("АН-25")
...
>>> asc.addBadword("ийти")
>>> asc.addBadword("циган")
>>> asc.addBadword("апичатка")
...
>>> asc.idw("Москва")
50387419219
>>> asc.idw("Санкт-Петербург")
68256898625
>>>
>>> asc.setUWords({50387419219: 1, 68256898625: 1})
>>>
...
>>> def statusArpa(status):
... print("Read arpa", status)
...
>>> def statusIndex(status):
... print("Build index", status)
...
>>> asc.readArpa("./words.arpa", statusArpa)
Read arpa 0
Read arpa 1
Read arpa 2
Read arpa 3
Read arpa 4
Read arpa 5
Read arpa 6
Read arpa 7
Read arpa 8
...
>>> asc.setAdCw(38120, 13)
>>>
>>> asc.setEmbedding({
... "а": 0, "б": 1, "в": 2, "г": 3, "д": 4, "е": 5,
... "ё": 5, "ж": 6, "з": 7, "и": 8, "й": 8, "к": 9,
... "л": 10, "м": 11, "н": 12, "о": 0, "п": 13, "р": 14,
... "с": 15, "т": 16, "у": 17, "ф": 18, "х": 19, "ц": 20,
... "ч": 21, "ш": 21, "щ": 21, "ъ": 22, "ы": 23, "ь": 22,
... "э": 5, "ю": 24, "я": 25, "<": 26, ">": 26, "~": 26,
... "-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
... "%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
... "\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
... "5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
... "b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
... "h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
... "n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
... "t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
... }, 28)
>>>
>>> asc.buildIndex(statusIndex)
Build index 0
Build index 1
Build index 2
Build index 3
Build index 4
...
>>> res = asc.spell("начальнег зажог павзрослому", True)
>>> res
('начальник зажёг по-взрослому', [('начальнег', 'начальник'), ('зажог', 'зажёг'), ('павзрослому', 'по-взрослому')])
>>>
>>> res = asc.analyze("слзы теут на мрозе")
>>> res
[('теут', ['текут']), ('мрозе', ['мозг', 'мороз', 'морозе', 'моё']), ('слзы', ['слезы', 'слёзы'])]

Example:
>>> import asc
>>> import spacy
>>> import pymorphy2
>>>
>>> asc.setThreads(0)
>>> asc.setOption(asc.options_t.ascSplit)
>>> asc.setOption(asc.options_t.ascAlter)
>>> asc.setOption(asc.options_t.ascESplit)
>>> asc.setOption(asc.options_t.ascRSplit)
>>> asc.setOption(asc.options_t.ascUppers)
>>> asc.setOption(asc.options_t.ascHyphen)
>>> asc.setOption(asc.options_t.ascWordRep)
>>> asc.setOption(asc.options_t.mixDicts)
>>> asc.setOption(asc.options_t.confidence)
>>> asc.setOption(asc.options_t.stemming)
>>>
>>> asc.addAlt("е", "ё")
>>> asc.addAlt("ежик", "ёжик")
>>> asc.addAlt("зажег", "зажёг")
>>> asc.addAlt("Легкий", "Лёгкий")
...
>>> asc.setAlphabet("абвгдеёжзийклмнопрстуфхцчшщъыьэюяabcdefghijklmnopqrstuvwxyz")
>>>
>>> asc.setPilots(["а","у","в","о","с","к","б","и","я","э","a","i","o","e","g"])
>>> asc.setSubstitutes({'p':'р','c':'с','o':'о','t':'т','k':'к','e':'е','a':'а','h':'н','x':'х','b':'в','m':'м'})
>>>
>>> asc.addAbbr("США")
>>> asc.addAbbr("Сбер")
...
>>> asc.addGoodword("T-34")
>>> asc.addGoodword("АН-25")
...
>>> asc.addBadword("ийти")
>>> asc.addBadword("циган")
>>> asc.addBadword("апичатка")
...
>>> asc.addUWord("Москва")
>>> asc.addUWord("Санкт-Петербург")
...
>>> morphRu = pymorphy2.MorphAnalyzer()
>>> morphEn = spacy.load('en', disable=['parser', 'ner'])
>>>
>>> def statusArpa(status):
... print("Read arpa", status)
...
>>> def statusIndex(status):
... print("Build index", status)
...
>>> def statusStemming(status):
... print("Build stemming", status)
...
>>> def eng(word):
... global morphEn
... words = morphEn(word)
... word = ''.join([token.lemma_ for token in words]).strip()
... if word[0] != '-' and word[len(word) - 1] != '-':
... return word
... else:
... return ""
...
>>> def rus(word):
... global morphRu
... if morphRu != None:
... word = morphRu.parse(word)[0].normal_form
... return word
... else:
... return ""
...
>>> def run(word, lang):
... if lang == "ru":
... return rus(word.lower())
... elif lang == "en":
... return eng(word.lower())
...
>>> asc.readArpa("./words.arpa", statusArpa)
Read arpa 0
Read arpa 1
Read arpa 2
Read arpa 3
Read arpa 4
Read arpa 5
Read arpa 6
Read arpa 7
Read arpa 8
...
>>> asc.setAdCw(38120, 13)
>>>
>>> asc.setEmbedding({
... "а": 0, "б": 1, "в": 2, "г": 3, "д": 4, "е": 5,
... "ё": 5, "ж": 6, "з": 7, "и": 8, "й": 8, "к": 9,
... "л": 10, "м": 11, "н": 12, "о": 0, "п": 13, "р": 14,
... "с": 15, "т": 16, "у": 17, "ф": 18, "х": 19, "ц": 20,
... "ч": 21, "ш": 21, "щ": 21, "ъ": 22, "ы": 23, "ь": 22,
... "э": 5, "ю": 24, "я": 25, "<": 26, ">": 26, "~": 26,
... "-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
... "%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
... "\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
... "5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
... "b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
... "h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
... "n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
... "t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
... }, 28)
>>>
>>> asc.setCode("ru")
>>>
>>> asc.buildIndex(statusIndex)
Build index 0
Build index 1
Build index 2
Build index 3
Build index 4
...
>>> asc.setStemmingMethod(run)
>>>
>>> asc.buildStemming(statusStemming)
Build stemming 0
Build stemming 1
Build stemming 2
Build stemming 3
Build stemming 4
Build stemming 5
...
>>> asc.addLemma("говорил")
>>> asc.addLemma("ходить")
...
>>> asc.setNSWLibCount(50000)
>>>
>>> res = asc.spell("начальнег зажог павзрослому", True)
>>> res
('начальник зажёг по-взрослому', [('начальнег', 'начальник'), ('зажог', 'зажёг'), ('павзрослому', 'по-взрослому')])
>>>
>>> res = asc.analyze("слзы теут на мрозе")
>>> res
[('теут', ['текут']), ('мрозе', ['мозг', 'мороз', 'морозе', 'моё']), ('слзы', ['слезы', 'слёзы'])]


Methods:

setOption - Library options setting method
unsetOption - Disable module option method

Example:
>>> import asc
>>>
>>> asc.unsetOption(asc.options_t.debug)
>>> asc.unsetOption(asc.options_t.mixDicts)
>>> asc.unsetOption(asc.options_t.onlyGood)
>>> asc.unsetOption(asc.options_t.confidence)
...

Description



Options
Description




debug
Flag debug mode


bloom
Flag allowed to use Bloom filter to check words


uppers
Flag that allows you to correct the case of letters


stemming
Flag for stemming activation


onlyGood
Flag allowing to consider words from the white list only


mixDicts
Flag allowing the use of words consisting of mixed dictionaries


allowUnk
Flag allowing to unknown word


resetUnk
Flag to reset the frequency of an unknown word


allGrams
Flag allowing accounting of all collected n-grams


onlyTypos
Flag to only correct typos


lowerCase
Flag allowing to case-insensitive


confidence
Flag arpa file loading without pre-processing the words


tokenWords
Flag that takes into account when assembling N-grams, only those tokens that match words


interpolate
Flag allowing to use interpolation in estimating


ascSplit
Flag to allow splitting of merged words


ascAlter
Flag that allows you to replace alternative letters in words


ascESplit
Flag to allow splitting of misspelled concatenated words


ascRSplit
Flag that allows you to combine words separated by a space


ascUppers
Flag that allows you to correct the case of letters


ascHyphen
Flag to allow splitting of concatenated words with hyphens


ascSkipUpp
Flag to skip uppercase words


ascSkipLat
Flag allowing words in the latin alphabet to be skipped


ascSkipHyp
Flag to skip hyphenated words


ascWordRep
Flag that allows you to remove duplicate words




Methods:

erratum - Method for search typos in text
token - Method for determining the type of the token words
split - Method for performing a split of clumped words
splitByHyphens - Method for performing a split of clumped words by hyphens
check - Method for checking a word for its existence in the dictionary

Example:
>>> import asc
>>>
>>> asc.setThreads(0)
>>> asc.setOption(asc.options_t.ascSplit)
>>> asc.setOption(asc.options_t.ascAlter)
>>> asc.setOption(asc.options_t.ascESplit)
>>> asc.setOption(asc.options_t.ascRSplit)
>>> asc.setOption(asc.options_t.ascUppers)
>>> asc.setOption(asc.options_t.ascHyphen)
>>> asc.setOption(asc.options_t.ascWordRep)
>>> asc.setOption(asc.options_t.mixDicts)
>>> asc.setOption(asc.options_t.confidence)
>>>
>>> def status(text, status):
... print(text, status)
...
>>>
>>> asc.loadIndex("./wittenbell-3-single.asc", "", status)
Loading dictionary 1
Loading dictionary 2
Loading dictionary 3
Loading dictionary 4
Loading dictionary 5
Loading dictionary 6
Loading dictionary 7
Loading dictionary 8
...
Loading Bloom filter 100
Loading stemming 100
Loading language model 6
Loading language model 12
Loading language model 18
Loading language model 25
Loading language model 31
Loading language model 37
...
Loading alternative words 1
Loading alternative words 2
Loading alternative words 3
Loading alternative words 4
Loading alternative words 5
Loading alternative words 6
Loading alternative words 7
...
Loading substitutes letters 7
Loading substitutes letters 14
Loading substitutes letters 21
Loading substitutes letters 28
Loading substitutes letters 35
Loading substitutes letters 42
...
>>>
asc.erratum("начальнег зажёг павзрослому")
['начальнег', 'павзрослому']
>>>
asc.token("word")
'<word>'
>>> asc.token("12")
'<num>'
>>> asc.token("127.0.0.1")
'<url>'
>>> asc.token("14-33")
'<range>'
>>> asc.token("14:44:22")
'<time>'
>>> asc.token("08/02/2020")
'<date>'
>>>
>>> asc.split("приветкакдела")
'привет как Дела'
>>> asc.split("былмастеромпрятатьсянонемогвоспользоватьсясвоимиталантамипотому")
'был мастером прятаться но не мог воспользоваться своими талантами потому'
>>> asc.split("Ябинатакойсоставбысходилеслиб")
'я б и на такой состав бы сходил если б'
>>> asc.split("летчерезXVIретроспективнопросматриватьэтобудет")
'лет через XVI ретроспективно просматривать это будет'
>>>
>>> asc.splitByHyphens("привет-как-дела")
'привет как дела'
>>> asc.splitByHyphens("как-то-так")
'как то так'
>>> asc.splitByHyphens("как-то")
'как-то'
>>>
>>> asc.check("hello")
True
>>> asc.check("Шварценеггер")
True
>>> asc.check("прывет")
False


Methods:

setSize - Method for set size N-gram
setAlmV2 - Method for set the language model type ALMv2
unsetAlmV2 - Method for unset the language model type ALMv2
setLocale - Method set locale (Default: en_US.UTF-8)
setCode - Method for set code language
setLictype - Method for set dictionary license information type
setName - Method for set dictionary name
setAuthor - Method for set the dictionary author
setCopyright - Method for set copyright on a dictionary
setLictext - Method for set license information dictionary
setContacts - Method for set contact details of the dictionary author
pruneArpa - Language model pruning method
addWord - Method for add a word to the dictionary
generateEmbedding - Method for generation embedding
setSizeEmbedding - Method for set the embedding size

Description



Smoothing




wittenBell


addSmooth


goodTuring


constDiscount


naturalDiscount


kneserNey


modKneserNey



Example:
>>> import asc
>>>
>>> asc.setSize(3)
>>> asc.setAlmV2()
>>> asc.setThreads(0)
>>> asc.setLocale("en_US.UTF-8")
>>>
>>> asc.setOption(asc.options_t.allowUnk)
>>> asc.setOption(asc.options_t.resetUnk)
>>> asc.setOption(asc.options_t.mixDicts)
>>> asc.setOption(asc.options_t.tokenWords)
>>> asc.setOption(asc.options_t.confidence)
>>> asc.setOption(asc.options_t.interpolate)
>>>
>>> asc.addAlt("е", "ё")
>>> asc.addAlt("ежик", "ёжик")
>>> asc.addAlt("зажег", "зажёг")
>>> asc.addAlt("Легкий", "Лёгкий")
>>>
>>> asc.setAlphabet("абвгдеёжзийклмнопрстуфхцчшщъыьэюяabcdefghijklmnopqrstuvwxyz")
>>>
>>> asc.setPilots(["а","у","в","о","с","к","б","и","я","э","a","i","o","e","g"])
>>> asc.setSubstitutes({'p':'р','c':'с','o':'о','t':'т','k':'к','e':'е','a':'а','h':'н','x':'х','b':'в','m':'м'})
>>>
>>> asc.addAbbr("США")
>>> asc.addAbbr("Сбер")
>>> asc.addGoodword("T-34")
>>> asc.addGoodword("АН-25")
>>>
>>> asc.addBadword("ийти")
>>> asc.addBadword("циган")
>>> asc.addBadword("апичатка")
>>>
>>> asc.addUWord("Москва")
>>> asc.addUWord("Санкт-Петербург")
>>>
>>> def statusMap(status):
... print("Write map", status)
...
>>> def statusArpa1(status):
... print("Build arpa", status)
...
>>> def statusArpa2(status):
... print("Write arpa", status)
...
>>> def statusWords(status):
... print("Write words", status)
...
>>> def statusVocab(status):
... print("Write vocab", status)
...
>>> def statusAbbrs(status):
... print("Write abbrs", status)
...
>>> def statusPrune(status):
... print("Prune vocab", status)
...
>>> def statusNgram(status):
... print("Write ngram", status)
...
>>> def statusIndex(status):
... print("Build index", status)
...
>>> def status(text, status):
... print(text, status)
...
>>> asc.addText("The future is now", 0)
>>>
>>> asc.collectCorpus("./corpus/text.txt", asc.smoothing_t.wittenBell, 0.0, False, False, status)
Read text corpora 0
Read text corpora 1
Read text corpora 2
Read text corpora 3
Read text corpora 4
Read text corpora 5
Read text corpora 6
...
>>> asc.pruneVocab(-15.0, 0, 0, statusPrune)
Prune vocab 0
Prune vocab 1
Prune vocab 2
Prune vocab 3
Prune vocab 4
Prune vocab 5
Prune vocab 6
...
# Prune VOCAB or prune ARPA example
>>> asc.pruneArpa(0.015, 3, statusPrune)
Prune arpa 0
Prune arpa 1
Prune arpa 2
Prune arpa 3
Prune arpa 4
Prune arpa 5
Prune arpa 6
...
>>> asc.buildArpa(statusArpa1)
Build arpa 0
Build arpa 1
Build arpa 2
Build arpa 3
Build arpa 4
Build arpa 5
Build arpa 6
...
>>> asc.writeMap("./words.map", statusMap)
Write map 0
Write map 1
Write map 2
Write map 3
Write map 4
Write map 5
Write map 6
...
>>> asc.writeArpa("./words.arpa", statusArpa2)
Write arpa 0
Write arpa 1
Write arpa 2
Write arpa 3
Write arpa 4
Write arpa 5
Write arpa 6
...
>>> asc.writeWords("./words.txt", statusWords)
Write words 0
Write words 1
Write words 2
Write words 3
Write words 4
Write words 5
Write words 6
...
>>> asc.writeVocab("./words.vocab", statusVocab)
Write vocab 0
Write vocab 1
Write vocab 2
Write vocab 3
Write vocab 4
Write vocab 5
Write vocab 6
...
>>> asc.writeAbbrs("./words1.abbr", statusAbbrs)
Write abbrs 50
Write abbrs 100
>>>
>>> asc.writeSuffix("./words2.abbr", statusAbbrs)
Write abbrs 10
Write abbrs 20
Write abbrs 30
Write abbrs 40
Write abbrs 50
Write abbrs 60
...
>>> asc.writeNgrams("./words.ngram", statusNgram)
Write ngram 0
Write ngram 1
Write ngram 2
Write ngram 3
Write ngram 4
Write ngram 5
Write ngram 6
...
>>> asc.setCode("RU")
>>> asc.setLictype("MIT")
>>> asc.setName("Russian")
>>> asc.setAuthor("You name")
>>> asc.setCopyright("You company LLC")
>>> asc.setLictext("... License text ...")
>>> asc.setContacts("site: https://example.com, e-mail: [email protected]")
>>>
>>> asc.setEmbedding({
... "а": 0, "б": 1, "в": 2, "г": 3, "д": 4, "е": 5,
... "ё": 5, "ж": 6, "з": 7, "и": 8, "й": 8, "к": 9,
... "л": 10, "м": 11, "н": 12, "о": 0, "п": 13, "р": 14,
... "с": 15, "т": 16, "у": 17, "ф": 18, "х": 19, "ц": 20,
... "ч": 21, "ш": 21, "щ": 21, "ъ": 22, "ы": 23, "ь": 22,
... "э": 5, "ю": 24, "я": 25, "<": 26, ">": 26, "~": 26,
... "-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
... "%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
... "\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
... "5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
... "b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
... "h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
... "n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
... "t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
... }, 28)
>>>
>> asc.saveIndex("./3-wittenbell.asc", "", 128, status)
Read words 1
Read words 2
Read words 3
Read words 4
Read words 5
Read words 6
...
Train dictionary 0
Train dictionary 1
Train dictionary 2
Train dictionary 3
Train dictionary 4
Train dictionary 5
Train dictionary 6
...
Dump dictionary 0
Dump dictionary 1
Dump dictionary 2
Dump dictionary 3
Dump dictionary 4
Dump dictionary 5
Dump dictionary 6
...
Dump alternative letters 100
Dump alternative letters 100
Dump alternative words 200
Dump alternative words 100
Dump language model 0
Dump language model 100
Dump substitutes letters 9
Dump substitutes letters 18
Dump substitutes letters 27
Dump substitutes letters 36
Dump substitutes letters 45
Dump substitutes letters 54
Dump substitutes letters 63
Dump substitutes letters 72
Dump substitutes letters 81
Dump substitutes letters 90
Dump substitutes letters 100
Dump substitutes letters 100
>>>
>>> asc.infoIndex("./3-wittenbell.asc")

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

* Code: RU

* Version: 1.0.0

* Dictionary name: Russian

* Locale: en_US.UTF-8
* Alphabet: абвгдеёжзийклмнопрстуфхцчшщъыьэюяabcdefghijklmnopqrstuvwxyz

* Build date: 09/14/2020 01:39:50

* Encrypted: NO

* ALM type: ALMv2

* Allow apostrophe: NO

* Count words: 38120
* Count documents: 13

* Only good words: NO
* Mix words in dicts: YES
* Confidence arpa: YES

* Count upper words: 2
* Count pilots words: 15
* Count bad words: 3
* Count good words: 2
* Count substitutes: 11
* Count abbreviations: 12

* Alternatives: е => ё
* Count alternatives words: 1

* Size embedding: 28

* Length n-gram: 1

* Author: You name

* Contacts: site: https://example.com, e-mail: [email protected]

* Copyright ©: You company LLC

* License type: MIT
* License text:
... License text ...

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

Example:
>>> import asc
>>>
>>> asc.setSize(3)
>>> asc.setThreads(0)
>>> asc.setLocale("en_US.UTF-8")
>>>
>>> asc.setOption(asc.options_t.allowUnk)
>>> asc.setOption(asc.options_t.resetUnk)
>>> asc.setOption(asc.options_t.mixDicts)
>>> asc.setOption(asc.options_t.tokenWords)
>>> asc.setOption(asc.options_t.confidence)
>>> asc.setOption(asc.options_t.interpolate)
>>>
>>> asc.addAlt("е", "ё")
>>> asc.addAlt("Легкий", "Лёгкий")
>>>
>>> asc.setAlphabet("абвгдеёжзийклмнопрстуфхцчшщъыьэюяabcdefghijklmnopqrstuvwxyz")
>>>
>>> asc.setPilots(["а","у","в","о","с","к","б","и","я","э","a","i","o","e","g"])
>>> asc.setSubstitutes({'p':'р','c':'с','o':'о','t':'т','k':'к','e':'е','a':'а','h':'н','x':'х','b':'в','m':'м'})
>>>
>>> asc.addAbbr("США")
>>> asc.addAbbr("Сбер")
>>> asc.addGoodword("T-34")
>>> asc.addGoodword("АН-25")
>>>
>>> asc.addBadword("ийти")
>>> asc.addBadword("циган")
>>> asc.addBadword("апичатка")
>>>
>>> asc.addUWord("Москва")
>>> asc.addUWord("Санкт-Петербург")
>>>
>>> def statusArpa(status):
... print("Read arpa", status)
...
>>> def statusVocab(status):
... print("Read vocab", status)
...
>>> def statusIndex(status):
... print("Build index", status)
...
>>> def status(text, status):
... print(text, status)
...
>>> asc.readArpa("./words.arpa", statusArpa)
Read arpa 0
Read arpa 1
Read arpa 2
Read arpa 3
Read arpa 4
Read arpa 5
Read arpa 6
...
>>> asc.readVocab("./words.vocab", statusVocab)
Read vocab 0
Read vocab 1
Read vocab 2
Read vocab 3
Read vocab 4
Read vocab 5
Read vocab 6
...
>>> asc.setCode("RU")
>>> asc.setLictype("MIT")
>>> asc.setName("Russian")
>>> asc.setAuthor("You name")
>>> asc.setCopyright("You company LLC")
>>> asc.setLictext("... License text ...")
>>> asc.setContacts("site: https://example.com, e-mail: [email protected]")
>>>
>>> asc.setEmbedding({
... "а": 0, "б": 1, "в": 2, "г": 3, "д": 4, "е": 5,
... "ё": 5, "ж": 6, "з": 7, "и": 8, "й": 8, "к": 9,
... "л": 10, "м": 11, "н": 12, "о": 0, "п": 13, "р": 14,
... "с": 15, "т": 16, "у": 17, "ф": 18, "х": 19, "ц": 20,
... "ч": 21, "ш": 21, "щ": 21, "ъ": 22, "ы": 23, "ь": 22,
... "э": 5, "ю": 24, "я": 25, "<": 26, ">": 26, "~": 26,
... "-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
... "%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
... "\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
... "5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
... "b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
... "h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
... "n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
... "t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
... }, 28)
>>>
>>> asc.buildIndex(statusIndex)
Build index 0
Build index 1
Build index 2
Build index 3
Build index 4
Build index 5
Build index 6
...
>>> asc.saveIndex("./3-wittenbell.asc", "", 128, status)
Dump dictionary 0
Dump dictionary 1
Dump dictionary 2
Dump dictionary 3
Dump dictionary 4
Dump dictionary 5
Dump dictionary 6
...
Dump alternative letters 100
Dump alternative letters 100
Dump alternative words 200
Dump alternative words 100
Dump language model 0
Dump language model 100
Dump substitutes letters 9
Dump substitutes letters 18
Dump substitutes letters 27
Dump substitutes letters 36
Dump substitutes letters 45
Dump substitutes letters 54
Dump substitutes letters 63
Dump substitutes letters 72
Dump substitutes letters 81
Dump substitutes letters 90
Dump substitutes letters 100
Dump substitutes letters 100
>>>
>>> asc.infoIndex("./3-wittenbell.asc")

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

* Code: RU

* Version: 1.0.0

* Dictionary name: Russian

* Locale: en_US.UTF-8
* Alphabet: абвгдеёжзийклмнопрстуфхцчшщъыьэюяabcdefghijklmnopqrstuvwxyz

* Build date: 09/14/2020 01:58:52

* Encrypted: NO

* ALM type: ALMv1

* Allow apostrophe: NO

* Count words: 38120
* Count documents: 13

* Only good words: NO
* Mix words in dicts: YES
* Confidence arpa: YES

* Count upper words: 2
* Count pilots words: 15
* Count bad words: 3
* Count good words: 2
* Count substitutes: 11
* Count abbreviations: 2

* Alternatives: е => ё
* Count alternatives words: 1

* Size embedding: 28

* Length n-gram: 3
* Count n-grams: 353

* Author: You name

* Contacts: site: https://example.com, e-mail: [email protected]

* Copyright ©: You company LLC

* License type: MIT
* License text:
... License text ...

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

Example:
>>> import asc
>>>
>>> asc.setSize(3)
>>> asc.setAlmV2()
>>> asc.setThreads(0)
>>> asc.setLocale("en_US.UTF-8")
>>>
>>> asc.setOption(asc.options_t.allowUnk)
>>> asc.setOption(asc.options_t.resetUnk)
>>> asc.setOption(asc.options_t.mixDicts)
>>> asc.setOption(asc.options_t.tokenWords)
>>> asc.setOption(asc.options_t.confidence)
>>> asc.setOption(asc.options_t.interpolate)
>>>
>>> asc.addAlt("е", "ё")
>>> asc.addAlt("Легкий", "Лёгкий")
>>>
>>> asc.setAlphabet("абвгдеёжзийклмнопрстуфхцчшщъыьэюяabcdefghijklmnopqrstuvwxyz")
>>>
>>> asc.setPilots(["а","у","в","о","с","к","б","и","я","э","a","i","o","e","g"])
>>> asc.setSubstitutes({'p':'р','c':'с','o':'о','t':'т','k':'к','e':'е','a':'а','h':'н','x':'х','b':'в','m':'м'})
>>>
>>> asc.addAbbr("США")
>>> asc.addAbbr("Сбер")
>>> asc.addGoodword("T-34")
>>> asc.addGoodword("АН-25")
>>>
>>> asc.addBadword("ийти")
>>> asc.addBadword("циган")
>>> asc.addBadword("апичатка")
>>>
>>> asc.addUWord("Москва")
>>> asc.addUWord("Санкт-Петербург")
>>>
>>> def statusArpa(status):
... print("Read arpa", status)
...
>>> def statusIndex(status):
... print("Build index", status)
...
>>> def statusPrune(status):
... print("Prune arpa", status)
...
>>> def status(text, status):
... print(text, status)
...
>>> asc.readArpa("./words.arpa", statusArpa)
Read arpa 0
Read arpa 1
Read arpa 2
Read arpa 3
Read arpa 4
Read arpa 5
Read arpa 6
...
>>> asc.setAdCw(38120, 13)
>>>
>>> asc.addWord("министерство")
>>> asc.addWord("возмездие", 0, 1)
>>> asc.addWord("возражение", asc.idw("возражение"), 2)
...
>>>
>>> asc.setCode("RU")
>>> asc.setLictype("MIT")
>>> asc.setName("Russian")
>>> asc.setAuthor("You name")
>>> asc.setCopyright("You company LLC")
>>> asc.setLictext("... License text ...")
>>> asc.setContacts("site: https://example.com, e-mail: [email protected]")
>>>
>>> asc.setEmbedding({
... "а": 0, "б": 1, "в": 2, "г": 3, "д": 4, "е": 5,
... "ё": 5, "ж": 6, "з": 7, "и": 8, "й": 8, "к": 9,
... "л": 10, "м": 11, "н": 12, "о": 0, "п": 13, "р": 14,
... "с": 15, "т": 16, "у": 17, "ф": 18, "х": 19, "ц": 20,
... "ч": 21, "ш": 21, "щ": 21, "ъ": 22, "ы": 23, "ь": 22,
... "э": 5, "ю": 24, "я": 25, "<": 26, ">": 26, "~": 26,
... "-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
... "%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
... "\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
... "5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
... "b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
... "h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
... "n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
... "t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
... }, 28)
>>>
>>> asc.buildIndex(statusIndex)
Build index 0
Build index 1
Build index 2
Build index 3
Build index 4
Build index 5
Build index 6
...
>>> asc.saveIndex("./3-wittenbell.asc", "password", 128, status)
Dump dictionary 0
Dump dictionary 1
Dump dictionary 2
Dump dictionary 3
Dump dictionary 4
Dump dictionary 5
Dump dictionary 6
...
Dump alternative letters 100
Dump alternative letters 100
Dump alternative words 200
Dump alternative words 100
Dump language model 0
Dump language model 100
Dump substitutes letters 9
Dump substitutes letters 18
Dump substitutes letters 27
Dump substitutes letters 36
Dump substitutes letters 45
Dump substitutes letters 54
Dump substitutes letters 63
Dump substitutes letters 72
Dump substitutes letters 81
Dump substitutes letters 90
Dump substitutes letters 100
Dump substitutes letters 100
>>>
>>> asc.infoIndex("./3-wittenbell.asc")

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

* Code: RU

* Version: 1.0.0

* Dictionary name: Russian

* Build date: 09/14/2020 02:09:38

* Encrypted: YES

* ALM type: ALMv2

* Allow apostrophe: NO

* Count words: 38120
* Count documents: 13

* Only good words: NO
* Mix words in dicts: YES
* Confidence arpa: YES

* Count upper words: 2
* Count pilots words: 15
* Count bad words: 3
* Count good words: 2
* Count substitutes: 11
* Count abbreviations: 2

* Alternatives: е => ё
* Count alternatives words: 1

* Size embedding: 28

* Length n-gram: 3
* Count n-grams: 353

* Author: You name

* Contacts: site: https://example.com, e-mail: [email protected]

* Copyright ©: You company LLC

* License type: MIT
* License text:
... License text ...

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

Example:
>>> import asc
>>>
>>> asc.setSize(3)
>>> asc.setAlmV2()
>>> asc.setThreads(0)
>>> asc.setLocale("en_US.UTF-8")
>>>
>>> asc.setOption(asc.options_t.allowUnk)
>>> asc.setOption(asc.options_t.resetUnk)
>>> asc.setOption(asc.options_t.mixDicts)
>>> asc.setOption(asc.options_t.tokenWords)
>>> asc.setOption(asc.options_t.confidence)
>>> asc.setOption(asc.options_t.interpolate)
>>>
>>> asc.addAlt("е", "ё")
>>> asc.addAlt("Легкий", "Лёгкий")
>>>
>>> asc.setAlphabet("абвгдеёжзийклмнопрстуфхцчшщъыьэюяabcdefghijklmnopqrstuvwxyz")
>>>
>>> asc.setPilots(["а","у","в","о","с","к","б","и","я","э","a","i","o","e","g"])
>>> asc.setSubstitutes({'p':'р','c':'с','o':'о','t':'т','k':'к','e':'е','a':'а','h':'н','x':'х','b':'в','m':'м'})
>>>
>>> asc.addAbbr("США")
>>> asc.addAbbr("Сбер")
>>> asc.addGoodword("T-34")
>>> asc.addGoodword("АН-25")
>>>
>>> asc.addBadword("ийти")
>>> asc.addBadword("циган")
>>> asc.addBadword("апичатка")
>>>
>>> asc.addUWord("Москва")
>>> asc.addUWord("Санкт-Петербург")
>>>
>>> def statusArpa(status):
... print("Read arpa", status)
...
>>> def statusIndex(status):
... print("Build index", status)
...
>>> def statusPrune(status):
... print("Prune arpa", status)
...
>>> def status(text, status):
... print(text, status)
...
>>> asc.readArpa("./words.arpa", statusArpa)
Read arpa 0
Read arpa 1
Read arpa 2
Read arpa 3
Read arpa 4
Read arpa 5
Read arpa 6
...
>>> asc.setAdCw(38120, 13)
>>>
>>> asc.addWord("министерство")
>>> asc.addWord("возмездие", 0, 1)
>>> asc.addWord("возражение", asc.idw("возражение"), 2)
...
>>>
>>> asc.setCode("RU")
>>> asc.setLictype("MIT")
>>> asc.setName("Russian")
>>> asc.setAuthor("You name")
>>> asc.setCopyright("You company LLC")
>>> asc.setLictext("... License text ...")
>>> asc.setContacts("site: https://example.com, e-mail: [email protected]")
>>>
>>> asc.setSizeEmbedding(32)
>>> asc.generateEmbedding()
>>>
>>> asc.buildIndex(statusIndex)
Build index 0
Build index 1
Build index 2
Build index 3
Build index 4
Build index 5
Build index 6
...
>>> asc.saveIndex("./3-wittenbell.asc", "password", 128, status)
Dump dictionary 0
Dump dictionary 1
Dump dictionary 2
Dump dictionary 3
Dump dictionary 4
Dump dictionary 5
Dump dictionary 6
...
Dump alternative letters 100
Dump alternative letters 100
Dump alternative words 200
Dump alternative words 100
Dump language model 0
Dump language model 100
Dump substitutes letters 9
Dump substitutes letters 18
Dump substitutes letters 27
Dump substitutes letters 36
Dump substitutes letters 45
Dump substitutes letters 54
Dump substitutes letters 63
Dump substitutes letters 72
Dump substitutes letters 81
Dump substitutes letters 90
Dump substitutes letters 100
Dump substitutes letters 100
>>>
>>> asc.infoIndex("./3-wittenbell.asc")

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

* Code: RU

* Version: 1.0.0

* Dictionary name: Russian

* Build date: 09/14/2020 02:09:38

* Encrypted: YES

* ALM type: ALMv2

* Allow apostrophe: NO

* Count words: 38120
* Count documents: 13

* Only good words: NO
* Mix words in dicts: YES
* Confidence arpa: YES

* Count upper words: 2
* Count pilots words: 15
* Count bad words: 3
* Count good words: 2
* Count substitutes: 11
* Count abbreviations: 2

* Alternatives: е => ё
* Count alternatives words: 1

* Size embedding: 28

* Length n-gram: 3
* Count n-grams: 353

* Author: You name

* Contacts: site: https://example.com, e-mail: [email protected]

* Copyright ©: You company LLC

* License type: MIT
* License text:
... License text ...

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *


Methods:

size - Method of obtaining the size of the N-gram

Example:
>>> import asc
>>>
>>> asc.setOption(asc.options_t.confidence)
>>>
>>> asc.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> asc.readArpa('./lm.arpa')
>>>
>>> asc.size()
3


Methods:

damerauLevenshtein - Determination of the Damerau-Levenshtein distance in phrases
distanceLevenshtein - Determination of Levenshtein distance in phrases
tanimoto - Method for determining Jaccard coefficient (quotient - Tanimoto coefficient)
needlemanWunsch - Word stretching method

Example:
>>> import asc
>>> asc.damerauLevenshtein("привет", "приветик")
2
>>>
>>> asc.damerauLevenshtein("приевтик", "приветик")
1
>>>
>>> asc.distanceLevenshtein("приевтик", "приветик")
2
>>>
>>> asc.tanimoto("привет", "приветик")
0.7142857142857143
>>>
>>> asc.tanimoto("привеитк", "приветик")
0.4
>>>
>>> asc.needlemanWunsch("привеитк", "приветик")
4
>>>
>>> asc.needlemanWunsch("привет", "приветик")
2
>>>
>>> asc.damerauLevenshtein("acre", "car")
2
>>> asc.distanceLevenshtein("acre", "car")
3
>>>
>>> asc.damerauLevenshtein("anteater", "theatre")
4
>>> asc.distanceLevenshtein("anteater", "theatre")
5
>>>
>>> asc.damerauLevenshtein("banana", "nanny")
3
>>> asc.distanceLevenshtein("banana", "nanny")
3
>>>
>>> asc.damerauLevenshtein("cat", "crate")
2
>>> asc.distanceLevenshtein("cat", "crate")
2
>>>
>>> asc.mulctLevenshtein("привет", "приветик")
4
>>>
>>> asc.mulctLevenshtein("приевтик", "приветик")
1
>>>
>>> asc.mulctLevenshtein("acre", "car")
3
>>>
>>> asc.mulctLevenshtein("anteater", "theatre")
5
>>>
>>> asc.mulctLevenshtein("banana", "nanny")
4
>>>
>>> asc.mulctLevenshtein("cat", "crate")
4


Methods:

textToJson - Method to convert text to JSON
isAllowApostrophe - Apostrophe permission check method
switchAllowApostrophe - Method for permitting or denying an apostrophe as part of a word

Example:
>>> import asc
>>>
>>> def callbackFn(text):
... print(text)
...
>>> asc.isAllowApostrophe()
False
>>> asc.switchAllowApostrophe()
>>>
>>> asc.isAllowApostrophe()
True
>>> asc.textToJson("«On nous dit qu'aujourd'hui c'est le cas, encore faudra-t-il l'évaluer» l'astronomie", callbackFn)
[["«","On","nous","dit","qu'aujourd'hui","c'est","le","cas",",","encore","faudra-t-il","l'évaluer","»","l'astronomie"]]


Methods:

jsonToText - Method to convert JSON to text

Example:
>>> import asc
>>>
>>> def callbackFn(text):
... print(text)
...
>>> asc.jsonToText('[["«","On","nous","dit","qu\'aujourd\'hui","c\'est","le","cas",",","encore","faudra-t-il","l\'évaluer","»","l\'astronomie"]]', callbackFn)
«On nous dit qu'aujourd'hui c'est le cas, encore faudra-t-il l'évaluer» l'astronomie


Methods:

restore - Method for restore text from context

Example:
>>> import asc
>>>
>>> asc.setOption(asc.options_t.uppers)
>>>
>>> asc.restore(["«","On","nous","dit","qu\'aujourd\'hui","c\'est","le","cas",",","encore","faudra-t-il","l\'évaluer","»","l\'astronomie"])
"«On nous dit qu'aujourd'hui c'est le cas, encore faudra-t-il l'évaluer» l'astronomie"


Methods:

allowStress - Method for allow using stress in words
disallowStress - Method for disallow using stress in words

Example:
>>> import asc
>>>
>>> asc.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> def callbackFn(text):
... print(text)
...
>>> asc.textToJson('«Бе́лая стрела́» — согласно распространённой в 1990-е годы в России городской легенде, якобы специально организованная и подготовленная законспирированная правительственная спецслужба, сотрудники которой — бывшие и действовавшие милиционеры и спецназовцы, имеющие право на физическую ликвидацию особо опасных уголовных авторитетов и лидеров орудовавших в России ОПГ, относительно которых не представляется возможным привлечения их к уголовной ответственности законными методами[1][2][3]. Несмотря на отсутствие официальных доказательств существования организации и многочисленные опровержения со стороны силовых структур и служб безопасности[4], в российском обществе легенду считают основанной на подлинных фактах громких убийств криминальных авторитетов, совершённых в 1990-е годы, и не исключают существование реальной спецслужбы[5].', callbackFn)
[["«","Белая","стрела","»","—","согласно","распространённой","в","1990-е","годы","в","России","городской","легенде",",","якобы","специально","организованная","и","подготовленная","законспирированная","правительственная","спецслужба",",","сотрудники","которой","—","бывшие","и","действовавшие","милиционеры","и","спецназовцы",",","имеющие","право","на","физическую","ликвидацию","особо","опасных","уголовных","авторитетов","и","лидеров","орудовавших","в","России","ОПГ",",","относительно","которых","не","представляется","возможным","привлечения","их","к","уголовной","ответственности","законными","методами","[","1","]","[","2","]","[","3","]","."],["Несмотря","на","отсутствие","официальных","доказательств","существования","организации","и","многочисленные","опровержения","со","стороны","силовых","структур","и","служб","безопасности","[","4","]",",","в","российском","обществе","легенду","считают","основанной","на","подлинных","фактах","громких","убийств","криминальных","авторитетов",",","совершённых","в","1990-е","годы",",","и","не","исключают","существование","реальной","спецслужбы","[","5","]","."]]
>>>
>>> asc.jsonToText('[["«","Белая","стрела","»","—","согласно","распространённой","в","1990-е","годы","в","России","городской","легенде",",","якобы","специально","организованная","и","подготовленная","законспирированная","правительственная","спецслужба",",","сотрудники","которой","—","бывшие","и","действовавшие","милиционеры","и","спецназовцы",",","имеющие","право","на","физическую","ликвидацию","особо","опасных","уголовных","авторитетов","и","лидеров","орудовавших","в","России","ОПГ",",","относительно","которых","не","представляется","возможным","привлечения","их","к","уголовной","ответственности","законными","методами","[","1","]","[","2","]","[","3","]","."],["Несмотря","на","отсутствие","официальных","доказательств","существования","организации","и","многочисленные","опровержения","со","стороны","силовых","структур","и","служб","безопасности","[","4","]",",","в","российском","обществе","легенду","считают","основанной","на","подлинных","фактах","громких","убийств","криминальных","авторитетов",",","совершённых","в","1990-е","годы",",","и","не","исключают","существование","реальной","спецслужбы","[","5","]","."]]', callbackFn)
«Белая стрела» — согласно распространённой в 1990-е годы в России городской легенде, якобы специально организованная и подготовленная законспирированная правительственная спецслужба, сотрудники которой — бывшие и действовавшие милиционеры и спецназовцы, имеющие право на физическую ликвидацию особо опасных уголовных авторитетов и лидеров орудовавших в России ОПГ, относительно которых не представляется возможным привлечения их к уголовной ответственности законными методами [1] [2] [3].
Несмотря на отсутствие официальных доказательств существования организации и многочисленные опровержения со стороны силовых структур и служб безопасности [4], в российском обществе легенду считают основанной на подлинных фактах громких убийств криминальных авторитетов, совершённых в 1990-е годы, и не исключают существование реальной спецслужбы [5].
>>>
>>> asc.allowStress()
>>> asc.textToJson('«Бе́лая стрела́» — согласно распространённой в 1990-е годы в России городской легенде, якобы специально организованная и подготовленная законспирированная правительственная спецслужба, сотрудники которой — бывшие и действовавшие милиционеры и спецназовцы, имеющие право на физическую ликвидацию особо опасных уголовных авторитетов и лидеров орудовавших в России ОПГ, относительно которых не представляется возможным привлечения их к уголовной ответственности законными методами[1][2][3]. Несмотря на отсутствие официальных доказательств существования организации и многочисленные опровержения со стороны силовых структур и служб безопасности[4], в российском обществе легенду считают основанной на подлинных фактах громких убийств криминальных авторитетов, совершённых в 1990-е годы, и не исключают существование реальной спецслужбы[5].', callbackFn)
[["«","Бе́лая","стрела́","»","—","согласно","распространённой","в","1990-е","годы","в","России","городской","легенде",",","якобы","специально","организованная","и","подготовленная","законспирированная","правительственная","спецслужба",",","сотрудники","которой","—","бывшие","и","действовавшие","милиционеры","и","спецназовцы",",","имеющие","право","на","физическую","ликвидацию","особо","опасных","уголовных","авторитетов","и","лидеров","орудовавших","в","России","ОПГ",",","относительно","которых","не","представляется","возможным","привлечения","их","к","уголовной","ответственности","законными","методами","[","1","]","[","2","]","[","3","]","."],["Несмотря","на","отсутствие","официальных","доказательств","существования","организации","и","многочисленные","опровержения","со","стороны","силовых","структур","и","служб","безопасности","[","4","]",",","в","российском","обществе","легенду","считают","основанной","на","подлинных","фактах","громких","убийств","криминальных","авторитетов",",","совершённых","в","1990-е","годы",",","и","не","исключают","существование","реальной","спецслужбы","[","5","]","."]]
>>>
>>> asc.jsonToText('[["«","Бе́лая","стрела́","»","—","согласно","распространённой","в","1990-е","годы","в","России","городской","легенде",",","якобы","специально","организованная","и","подготовленная","законспирированная","правительственная","спецслужба",",","сотрудники","которой","—","бывшие","и","действовавшие","милиционеры","и","спецназовцы",",","имеющие","право","на","физическую","ликвидацию","особо","опасных","уголовных","авторитетов","и","лидеров","орудовавших","в","России","ОПГ",",","относительно","которых","не","представляется","возможным","привлечения","их","к","уголовной","ответственности","законными","методами","[","1","]","[","2","]","[","3","]","."],["Несмотря","на","отсутствие","официальных","доказательств","существования","организации","и","многочисленные","опровержения","со","стороны","силовых","структур","и","служб","безопасности","[","4","]",",","в","российском","обществе","легенду","считают","основанной","на","подлинных","фактах","громких","убийств","криминальных","авторитетов",",","совершённых","в","1990-е","годы",",","и","не","исключают","существование","реальной","спецслужбы","[","5","]","."]]', callbackFn)
«Бе́лая стрела́» — согласно распространённой в 1990-е годы в России городской легенде, якобы специально организованная и подготовленная законспирированная правительственная спецслужба, сотрудники которой — бывшие и действовавшие милиционеры и спецназовцы, имеющие право на физическую ликвидацию особо опасных уголовных авторитетов и лидеров орудовавших в России ОПГ, относительно которых не представляется возможным привлечения их к уголовной ответственности законными методами [1] [2] [3].
Несмотря на отсутствие официальных доказательств существования организации и многочисленные опровержения со стороны силовых структур и служб безопасности [4], в российском обществе легенду считают основанной на подлинных фактах громких убийств криминальных авторитетов, совершённых в 1990-е годы, и не исключают существование реальной спецслужбы [5].
>>>
>>> asc.disallowStress()
>>> asc.textToJson('«Бе́лая стрела́» — согласно распространённой в 1990-е годы в России городской легенде, якобы специально организованная и подготовленная законспирированная правительственная спецслужба, сотрудники которой — бывшие и действовавшие милиционеры и спецназовцы, имеющие право на физическую ликвидацию особо опасных уголовных авторитетов и лидеров орудовавших в России ОПГ, относительно которых не представляется возможным привлечения их к уголовной ответственности законными методами[1][2][3]. Несмотря на отсутствие официальных доказательств существования организации и многочисленные опровержения со стороны силовых структур и служб безопасности[4], в российском обществе легенду считают основанной на подлинных фактах громких убийств криминальных авторитетов, совершённых в 1990-е годы, и не исключают существование реальной спецслужбы[5].', callbackFn)
[["«","Белая","стрела","»","—","согласно","распространённой","в","1990-е","годы","в","России","городской","легенде",",","якобы","специально","организованная","и","подготовленная","законспирированная","правительственная","спецслужба",",","сотрудники","которой","—","бывшие","и","действовавшие","милиционеры","и","спецназовцы",",","имеющие","право","на","физическую","ликвидацию","особо","опасных","уголовных","авторитетов","и","лидеров","орудовавших","в","России","ОПГ",",","относительно","которых","не","представляется","возможным","привлечения","их","к","уголовной","ответственности","законными","методами","[","1","]","[","2","]","[","3","]","."],["Несмотря","на","отсутствие","официальных","доказательств","существования","организации","и","многочисленные","опровержения","со","стороны","силовых","структур","и","служб","безопасности","[","4","]",",","в","российском","обществе","легенду","считают","основанной","на","подлинных","фактах","громких","убийств","криминальных","авторитетов",",","совершённых","в","1990-е","годы",",","и","не","исключают","существование","реальной","спецслужбы","[","5","]","."]]
>>>
>>> asc.jsonToText('[["«","Белая","стрела","»","—","согласно","распространённой","в","1990-е","годы","в","России","городской","легенде",",","якобы","специально","организованная","и","подготовленная","законспирированная","правительственная","спецслужба",",","сотрудники","которой","—","бывшие","и","действовавшие","милиционеры","и","спецназовцы",",","имеющие","право","на","физическую","ликвидацию","особо","опасных","уголовных","авторитетов","и","лидеров","орудовавших","в","России","ОПГ",",","относительно","которых","не","представляется","возможным","привлечения","их","к","уголовной","ответственности","законными","методами","[","1","]","[","2","]","[","3","]","."],["Несмотря","на","отсутствие","официальных","доказательств","существования","организации","и","многочисленные","опровержения","со","стороны","силовых","структур","и","служб","безопасности","[","4","]",",","в","российском","обществе","легенду","считают","основанной","на","подлинных","фактах","громких","убийств","криминальных","авторитетов",",","совершённых","в","1990-е","годы",",","и","не","исключают","существование","реальной","спецслужбы","[","5","]","."]]', callbackFn)
«Белая стрела» — согласно распространённой в 1990-е годы в России городской легенде, якобы специально организованная и подготовленная законспирированная правительственная спецслужба, сотрудники которой — бывшие и действовавшие милиционеры и спецназовцы, имеющие право на физическую ликвидацию особо опасных уголовных авторитетов и лидеров орудовавших в России ОПГ, относительно которых не представляется возможным привлечения их к уголовной ответственности законными методами [1] [2] [3].
Несмотря на отсутствие официальных доказательств существования организации и многочисленные опровержения со стороны силовых структур и служб безопасности [4], в российском обществе легенду считают основанной на подлинных фактах громких убийств криминальных авторитетов, совершённых в 1990-е годы, и не исключают существование реальной спецслужбы [5].


Methods:

addBadword - Method add bad word
setBadwords - Method set words to blacklist
getBadwords - Method get words in blacklist

Example:
>>> import asc
>>>
>>> asc.setBadwords(["hello", "world", "test"])
>>>
>>> asc.getBadwords()
{1554834897, 2156498622, 28307030}
>>>
>>> asc.addBadword("test2")
>>>
>>> asc.getBadwords()
{5170183734, 1554834897, 2156498622, 28307030}

Example:
>>> import asc
>>>
>>> asc.setBadwords({24227504, 1219922507, 1794085167})
>>>
>>> asc.getBadwords()
{24227504, 1219922507, 1794085167}
>>>
>>> asc.clear(asc.clear_t.badwords)
>>>
>>> asc.getBadwords()
{}


Methods:

addGoodword - Method add good word
setGoodwords - Method set words to whitelist
getGoodwords - Method get words in whitelist

Example:
>>> import asc
>>>
>>> asc.setGoodwords(["hello", "world", "test"])
>>>
>>> asc.getGoodwords()
{1554834897, 2156498622, 28307030}
>>>
>>> asc.addGoodword("test2")
>>>
>>> asc.getGoodwords()
{5170183734, 1554834897, 2156498622, 28307030}
>>>
>>> asc.clear(asc.clear_t.goodwords)
>>>
>> asc.getGoodwords()
{}

Example:
>>> import asc
>>>
>>> asc.setGoodwords({24227504, 1219922507, 1794085167})
>>>
>>> asc.getGoodwords()
{24227504, 1219922507, 1794085167}


Methods:

setUserToken - Method for adding user token
getUserTokens - User token list retrieval method
getUserTokenId - Method for obtaining user token identifier
getUserTokenWord - Method for obtaining a custom token by its identifier

Example:
>>> import asc
>>>
>>> asc.setUserToken("usa")
>>>
>>> asc.setUserToken("russia")
>>>
>>> asc.getUserTokenId("usa")
5759809081
>>>
>>> asc.getUserTokenId("russia")
9910674734
>>>
>>> asc.getUserTokens()
['usa', 'russia']
>>>
>>> asc.getUserTokenWord(5759809081)
'usa'
>>>
>>> asc.getUserTokenWord(9910674734)
'russia'
>>>
>> asc.clear(asc.clear_t.utokens)
>>>
>>> asc.getUserTokens()
[]


Methods:

findNgram - N-gram search method in text
word - "Method to extract a word by its identifier"

Example:
>>> import asc
>>>
>>> def callbackFn(text):
... print(text)
...
>>> asc.setOption(asc.options_t.confidence)
>>> asc.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>> asc.readArpa('./lm.arpa')
>>>
>>> asc.idw("привет")
2487910648
>>> asc.word(2487910648)
'привет'
>>>
>>> asc.findNgram("Особое место занимает чудотворная икона Лобзание Христа Иудою", callbackFn)
<s> Особое
Особое место
место занимает
занимает чудотворная
чудотворная икона
икона Лобзание
Лобзание Христа
Христа Иудою
Иудою </s>


>>>


Methods:

setUserTokenMethod - Method for set a custom token processing function

Example:
>>> import asc
>>>
>>> def fn(token, word):
... if token and (token == "<usa>"):
... if word and (word.lower() == "usa"):
... return True
... elif token and (token == "<russia>"):
... if word and (word.lower() == "russia"):
... return True
... return False
...
>>> asc.setUserToken("usa")
>>>
>>> asc.setUserToken("russia")
>>>
>>> asc.setUserTokenMethod("usa", fn)
>>>
>>> asc.setUserTokenMethod("russia", fn)
>>>
>>> asc.idw("usa")
5759809081
>>>
>>> asc.idw("russia")
9910674734
>>>
>>> asc.getUserTokenWord(5759809081)
'usa'
>>>
>>> asc.getUserTokenWord(9910674734)
'russia'


Methods:

setWordPreprocessingMethod - Method for set the word preprocessing function

Example:
>>> import asc
>>>
>>> def run(word, context):
... if word == "возле": word = "около"
... return word
...
>>> asc.setOption(asc.options_t.debug)
>>>
>>> asc.setOption(asc.options_t.confidence)
>>>
>>> asc.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> asc.readArpa('./lm.arpa')
>>>
>>> asc.setWordPreprocessingMethod(run)
>>>
>>> a = asc.perplexity("неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???с лязгом выкатился и остановился возле мальчика....")
info: <s> Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор <punct> <punct> <punct> </s>

info: p( неожиданно | <s> ) = [2gram] 0.00038931 [ -3.40969900 ] / 0.99999991
info: p( из | неожиданно ...) = [2gram] 0.10110741 [ -0.99521700 ] / 0.99999979
info: p( подворотни | из ...) = [2gram] 0.00711798 [ -2.14764300 ] / 1.00000027
info: p( в | подворотни ...) = [2gram] 0.51077661 [ -0.29176900 ] / 1.00000021
info: p( олега | в ...) = [2gram] 0.00082936 [ -3.08125500 ] / 0.99999974
info: p( ударил | олега ...) = [2gram] 0.25002820 [ -0.60201100 ] / 0.99999978
info: p( яркий | ударил ...) = [2gram] 0.50002878 [ -0.30100500 ] / 1.00000034
info: p( прожектор | яркий ...) = [2gram] 0.50002878 [ -0.30100500 ] / 1.00000034
info: p( патрульный | прожектор ...) = [2gram] 0.50002878 [ -0.30100500 ] / 1.00000034
info: p( трактор | патрульный ...) = [2gram] 0.50002878 [ -0.30100500 ] / 1.00000034
info: p( <punct> | трактор ...) = [OOV] 0.00000000 [ -inf ] / 0.99999973
info: p( <punct> | <punct> ...) = [OOV] 0.00000000 [ -inf ] / 0.99999993
info: p( <punct> | <punct> ...) = [OOV] 0.00000000 [ -inf ] / 0.99999993
info: p( </s> | <punct> ...) = [1gram] 0.05693430 [ -1.24462600 ] / 0.99999993

info: 1 sentences, 13 words, 0 OOVs
info: 3 zeroprobs, logprob= -12.97624000 ppl= 8.45034200 ppl1= 9.95800426

info: <s> С лязгом выкатился и остановился около мальчика <punct> <punct> <punct> <punct> </s>

info: p( с | <s> ) = [2gram] 0.00642448 [ -2.19216200 ] / 0.99999991
info: p( лязгом | с ...) = [2gram] 0.00195917 [ -2.70792700 ] / 0.99999999
info: p( выкатился | лязгом ...) = [2gram] 0.50002878 [ -0.30100500 ] / 1.00000034
info: p( и | выкатился ...) = [2gram] 0.51169951 [ -0.29098500 ] / 1.00000024
info: p( остановился | и ...) = [2gram] 0.00143382 [ -2.84350600 ] / 0.99999975
info: p( около | остановился ...) = [1gram] 0.00011358 [ -3.94468000 ] / 1.00000003
info: p( мальчика | около ...) = [1gram] 0.00003932 [ -4.40541100 ] / 1.00000016
info: p( <punct> | мальчика ...) = [OOV] 0.00000000 [ -inf ] / 0.99999990
info: p( <punct> | <punct> ...) = [OOV] 0.00000000 [ -inf ] / 0.99999993
info: p( <punct> | <punct> ...) = [OOV] 0.00000000 [ -inf ] / 0.99999993
info: p( <punct> | <punct> ...) = [OOV] 0.00000000 [ -inf ] / 0.99999993
info: p( </s> | <punct> ...) = [1gram] 0.05693430 [ -1.24462600 ] / 0.99999993

info: 1 sentences, 11 words, 0 OOVs
info: 4 zeroprobs, logprob= -17.93030200 ppl= 31.20267541 ppl1= 42.66064865
>>> print(a.logprob)
-30.906542


Methods:

setLogfile - Method of set the file for log output
setOOvFile - Method set file for saving OOVs words

Example:
>>> import asc
>>>
>>> asc.setLogfile("./log.txt")
>>>
>>> asc.setOOvFile("./oov.txt")


Methods:

perplexity - Perplexity calculation
pplConcatenate - Method of combining perplexia
pplByFiles - Method for reading perplexity calculation by file or group of files

Example:
>>> import asc
>>>
>>> asc.setOption(asc.options_t.confidence)
>>>
>>> asc.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> asc.readArpa('./lm.arpa')
>>>
>>> a = asc.perplexity("неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???с лязгом выкатился и остановился возле мальчика....")
>>>
>>> print(a.logprob)
-30.906542
>>>
>>> print(a.oovs)
0
>>>
>>> print(a.words)
24
>>>
>>> print(a.sentences)
2
>>>
>>> print(a.zeroprobs)
7
>>>
>>> print(a.ppl)
17.229063831108224
>>>
>>> print(a.ppl1)
19.398698060810077
>>>
>>> b = asc.pplByFiles("./text.txt")
>>>
>>> c = asc.pplConcatenate(a, b)
>>>
>>> print(c.ppl)
7.384123548831112

Description



Name
Description




ppl
The meaning of perplexity without considering the beginning of the sentence


ppl1
The meaning of perplexion taking into account the beginning of the sentence


oovs
Count of oov words


words
Count of words in sentence


logprob
Word sequence frequency


sentences
Count of sequences


zeroprobs
Count of zero probs




Methods:

tokenization - Method for breaking text into tokens

Example:
>>> import asc
>>>
>>> def tokensFn(word, context, reset, stop):
... print(word, " => ", context)
... return True
...
>>> asc.switchAllowApostrophe()
>>>
>>> asc.tokenization("«On nous dit qu'aujourd'hui c'est le cas, encore faudra-t-il l'évaluer» l'astronomie", tokensFn)
« => []
On => ['«']
nous => ['«', 'On']
dit => ['«', 'On', 'nous']
qu'aujourd'hui => ['«', 'On', 'nous', 'dit']
c'est => ['«', 'On', 'nous', 'dit', "qu'aujourd'hui"]
le => ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est"]
cas => ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est", 'le']
, => ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est", 'le', 'cas']
encore => ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est", 'le', 'cas', ',']
faudra-t-il => ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est", 'le', 'cas', ',', 'encore']
l => ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est", 'le', 'cas', ',', 'encore', 'faudra-t-il', 'l']
' => ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est", 'le', 'cas', ',', 'encore', 'faudra-t-il', 'l']
évaluer => ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est", 'le', 'cas', ',', 'encore', 'faudra-t-il', 'l', "'"]
» => ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est", 'le', 'cas', ',', 'encore', 'faudra-t-il', 'l', "'", 'évaluer']
l => ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est", 'le', 'cas', ',', 'encore', 'faudra-t-il', 'l', "'", 'évaluer', '»']
' => ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est", 'le', 'cas', ',', 'encore', 'faudra-t-il', 'l', "'", 'évaluer', '»', 'l']
astronomie => ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est", 'le', 'cas', ',', 'encore', 'faudra-t-il', 'l', "'", 'évaluer', '»', 'l', "'"]


Methods:

setTokenizerFn - Method for set the function of an external tokenizer

Example:
>>> import asc
>>>
>>> def tokenizerFn(text, callback):
... word = ""
... context = []
... for letter in text:
... if letter == " " and len(word) > 0:
... if not callback(word, context, False, False): return
... context.append(word)
... word = ""
... elif letter == "." or letter == "!" or letter == "?":
... if not callback(word, context, True, False): return
... word = ""
... context = []
... else:
... word += letter
... if len(word) > 0:
... if not callback(word, context, False, True): return
...
>>> def tokensFn(word, context, reset, stop):
... print(word, " => ", context)
... return True
...
>>> asc.setTokenizerFn(tokenizerFn)
>>>
>>> asc.tokenization("Hello World today!", tokensFn)
Hello => []
World => ['Hello']
today => ['Hello', 'World']


Methods:

sentences - Sentences generation method
sentencesToFile - Method for assembling a specified number of sentences and writing to a file

Example:
>>> import asc
>>>
>>> def sentencesFn(text):
... print("Sentences:", text)
... return True
...
>>> asc.setOption(asc.options_t.confidence)
>>>
>>> asc.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> asc.readArpa('./lm.arpa')
>>>
>>> asc.sentences(sentencesFn)
Sentences: <s> В общем </s>
Sentences: <s> С лязгом выкатился и остановился возле мальчика </s>
Sentences: <s> У меня нет </s>
Sentences: <s> Я вообще не хочу </s>
Sentences: <s> Да и в общем </s>
Sentences: <s> Не могу </s>
Sentences: <s> Ну в общем </s>
Sentences: <s> Так что я вообще не хочу </s>
Sentences: <s> Потому что я вообще не хочу </s>
Sentences: <s> Продолжение следует </s>
Sentences: <s> Неожиданно из подворотни в олега ударил яркий прожектор патрульный трактор </s>
>>>
>>> asc.sentencesToFile(5, "./result.txt")


Methods:

fixUppers - Method for correcting registers in the text
fixUppersByFiles - Method for correcting text registers in a text file

Example:
>>> import asc
>>>
>>> asc.setOption(asc.options_t.confidence)
>>>
>>> asc.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> asc.readArpa('./lm.arpa')
>>>
>>> asc.fixUppers("неожиданно из подворотни в олега ударил яркий прожектор патрульный трактор???с лязгом выкатился и остановился возле мальчика....")
'Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор??? С лязгом выкатился и остановился возле мальчика....'
>>>
>>> asc.fixUppersByFiles("./corpus", "./result.txt", "txt")


Methods:

checkHypLat - Hyphen and latin character search method

Example:
>>> import asc
>>>
>>> asc.checkHypLat("Hello-World")
(True, True)
>>>
>>> asc.checkHypLat("Hello")
(False, True)
>>>
>>> asc.checkHypLat("Привет")
(False, False)
>>>
>>> asc.checkHypLat("так-как")
(True, False)


Methods:

getUppers - Method for extracting registers for each word
countLetter - Method for counting the amount of a specific letter in a word

Example:
>>> import asc
>>>
>>> asc.setOption(asc.options_t.confidence)
>>>
>>> asc.readArpa('./lm.arpa')
>>>
>>> asc.idw("Living")
10493385932
>>>
>>> asc.idw("in")
3301
>>>
>>> asc.idw("the")
217280
>>>
>>> asc.idw("USA")
188643
>>>
>>> asc.getUppers([10493385932, 3301, 217280, 188643])
[1, 0, 0, 7]
>>>
>>> asc.countLetter("hello-world", "-")
1
>>>
>>> asc.countLetter("hello-world", "l")
3


Methods:

urls - Method for extracting URL address coordinates in a string

Example:
>>> import asc
>>>
>>> asc.urls("This website: example.com was designed with ...")
{14: 25}
>>>
>>> asc.urls("This website: https://a.b.c.example.net?id=52#test-1 was designed with ...")
{14: 52}
>>>
>>> asc.urls("This website: https://a.b.c.example.net?id=52#test-1 and 127.0.0.1 was designed with ...")
{14: 52, 57: 66}


Methods:

roman2Arabic - Method for translating Roman numerals to Arabic

Example:
>>> import asc
>>>
>>> asc.roman2Arabic("XVI")
16


Methods:

rest - Method for correction and detection of words with mixed alphabets
setSubstitutes - Method for set letters to correct words from mixed alphabets
getSubstitutes - Method of extracting letters to correct words from mixed alphabets

Example:
>>> import asc
>>>
>>> asc.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> asc.setSubstitutes({'p':'р','c':'с','o':'о','t':'т','k':'к','e':'е','a':'а','h':'н','x':'х','b':'в','m':'м'})
>>>
>>> asc.getSubstitutes()
{'a': 'а', 'b': 'в', 'c': 'с', 'e': 'е', 'h': 'н', 'k': 'к', 'm': 'м', 'o': 'о', 'p': 'р', 't': 'т', 'x': 'х'}
>>>
>>> str = "ПPИBETИК"
>>>
>>> str.lower()
'пpиbetик'
>>>
>>> asc.rest(str)
'приветик'


Methods:

setTokensDisable - Method for set the list of forbidden tokens
setTokensUnknown - Method for set the list of tokens cast to 〈unk〉
setTokenDisable - Method for set the list of unidentifiable tokens
setTokenUnknown - Method of set the list of tokens that need to be identified as 〈unk〉
getTokensDisable - Method for retrieving the list of forbidden tokens
getTokensUnknown - Method for extracting a list of tokens reducible to 〈unk〉
setAllTokenDisable - Method for set all tokens as unidentifiable
setAllTokenUnknown - The method of set all tokens identified as 〈unk〉

Example:
>>> import asc
>>>
>>> asc.idw("<date>")
6
>>>
>>> asc.idw("<time>")
7
>>>
>>> asc.idw("<abbr>")
5
>>>
>>> asc.idw("<math>")
9
>>>
>>> asc.setTokenDisable("date|time|abbr|math")
>>>
>>> asc.getTokensDisable()
{9, 5, 6, 7}
>>>
>>> asc.setTokensDisable({6, 7, 5, 9})
>>>
>>> asc.setTokenUnknown("date|time|abbr|math")
>>>
>>> asc.getTokensUnknown()
{9, 5, 6, 7}
>>>
>>> asc.setTokensUnknown({6, 7, 5, 9})
>>>
>>> asc.setAllTokenDisable()
>>>
>>> asc.getTokensDisable()
{2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 23}
>>>
>>> asc.setAllTokenUnknown()
>>>
>>> asc.getTokensUnknown()
{2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 23}


Methods:

countAlphabet - Method of obtaining the number of letters in the dictionary

Example:
>>> import asc
>>>
>>> asc.getAlphabet()
'abcdefghijklmnopqrstuvwxyz'
>>>
>>> asc.countAlphabet()
26
>>>
>>> asc.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> asc.countAlphabet()
59


Methods:

countBigrams - Method get count bigrams
countTrigrams - Method get count trigrams
countGrams - Method get count N-gram by lm size

Example:
>>> import asc
>>>
>>> asc.setOption(asc.options_t.confidence)
>>>
>>> asc.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> asc.readArpa('./lm.arpa')
>>>
>>> asc.countBigrams("неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???с лязгом выкатился и остановился возле мальчика....")
12
>>>
>>> asc.countTrigrams("неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???с лязгом выкатился и остановился возле мальчика....")
10
>>>
>>> asc.size()
3
>>>
>>> asc.countGrams("неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???с лязгом выкатился и остановился возле мальчика....")
10
>>>
>>> asc.idw("неожиданно")
3263936167
>>>
>>> asc.idw("из")
5134
>>>
>>> asc.idw("подворотни")
12535356101
>>>
>>> asc.idw("в")
53
>>>
>>> asc.idw("Олега")
2824508300
>>>
>>> asc.idw("ударил")
24816796913
>>>
>>> asc.countBigrams([3263936167, 5134, 12535356101, 53, 2824508300, 24816796913])
5
>>>
>>> asc.countTrigrams([3263936167, 5134, 12535356101, 53, 2824508300, 24816796913])
4
>>>
>>> asc.countGrams([3263936167, 5134, 12535356101, 53, 2824508300, 24816796913])
4


Methods:

arabic2Roman - Convert arabic number to roman number

Example:
>>> import asc
>>>
>>> asc.arabic2Roman(23)
'XXIII'
>>>
>>> asc.arabic2Roman("33")
'XXXIII'


Methods:

setThreads - Method for set the number of threads (0 - all threads)

Example:
>>> import asc
>>>
>>> asc.setOption(asc.options_t.confidence)
>>>
>>> asc.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> asc.readArpa('./lm.arpa')
>>>
>>> asc.setThreads(3)
>>>
>>> a = asc.pplByFiles("./text.txt")
>>>
>>> print(a.logprob)
-48201.29481399994


Methods:

fti - Method for removing the fractional part of a number

Example:
>>> import asc
>>>
>>> asc.fti(5892.4892)
5892489200000
>>>
>>> asc.fti(5892.4892, 4)
58924892


Methods:

context - Method for assembling text context from a sequence

Example:
>>> import asc
>>>
>>> asc.setOption(asc.options_t.confidence)
>>>
>>> asc.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> asc.readArpa('./lm.arpa')
>>>
>>> asc.idw("неожиданно")
3263936167
>>>
>>> asc.idw("из")
5134
>>>
>>> asc.idw("подворотни")
12535356101
>>>
>>> asc.idw("в")
53
>>>
>>> asc.idw("Олега")
2824508300
>>>
>>> asc.idw("ударил")
24816796913
>>>
>>> asc.context([3263936167, 5134, 12535356101, 53, 2824508300, 24816796913])
'Неожиданно из подворотни в Олега ударил'


Methods:

isAbbr - Method of checking a word for compliance with an abbreviation
isSuffix - Method for checking a word for a suffix of a numeric abbreviation
isToken - Method for checking if an identifier matches a token
isIdWord - Method for checking if an identifier matches a word

Example:
>>> import asc
>>>
>>> asc.setOption(asc.options_t.confidence)
>>>
>>> asc.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> asc.readArpa('./lm.arpa')
>>>
>>> asc.addAbbr("США")
>>>
>>> asc.isAbbr("сша")
True
>>>
>>> asc.addSuffix("1-я")
>>>
>>> asc.isSuffix("1-я")
True
>>>
>>> asc.isToken(asc.idw("США"))
True
>>>
>>> asc.isToken(asc.idw("1-я"))
True
>>>
>>> asc.isToken(asc.idw("125"))
True
>>>
>>> asc.isToken(asc.idw("<s>"))
True
>>>
>>> asc.isToken(asc.idw("Hello"))
False
>>>
>>> asc.isIdWord(asc.idw("https://anyks.com"))
True
>>>
>>> asc.isIdWord(asc.idw("Hello"))
True
>>>
>>> asc.isIdWord(asc.idw("-"))
False


Methods:

findByFiles - Method search N-grams in a text file

Example:
>>> import asc
>>>
>>> asc.setOption(asc.options_t.debug)
>>>
>>> asc.setOption(asc.options_t.confidence)
>>>
>>> asc.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> asc.readArpa('./lm.arpa')
>>>
>>> asc.findByFiles("./text.txt", "./result.txt")
info: <s> Кукай
сари кукай
сари японские
японские каллиграфы
каллиграфы я
я постоянно
постоянно навещал
навещал их
их тайно
тайно от
от людей
людей </s>


info: <s> Неожиданно из
Неожиданно из подворотни
из подворотни в
подворотни в Олега
в Олега ударил
Олега ударил яркий
ударил яркий прожектор
яркий прожектор патрульный
прожектор патрульный трактор
патрульный трактор

<s> С лязгом
С лязгом выкатился
лязгом выкатился и
выкатился и остановился
и остановился возле
остановился возле мальчика
возле мальчика


Methods:

checkSequence - Sequence Existence Method
existSequence - Method for checking the existence of a sequence, excluding non-word tokens
checkByFiles - Method for checking if a sequence exists in a text file

Example:
>>> import asc
>>>
>>> asc.setOption(asc.options_t.debug)
>>>
>>> asc.setOption(asc.options_t.confidence)
>>>
>>> asc.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> asc.readArpa('./lm.arpa')
>>>
>>> asc.addAbbr("США")
>>>
>>> asc.isAbbr("сша")
>>>
>>> asc.checkSequence("Неожиданно из подворотни в олега ударил")
True
>>>
>>> asc.checkSequence("Сегодня сыграл и в Олега ударил яркий прожектор патрульный трактор с корпоративным сектором")
True
>>>
>>> asc.checkSequence("Сегодня сыграл и в Олега ударил яркий прожектор патрульный трактор с корпоративным сектором", True)
True
>>>
>>> asc.checkSequence("в Олега ударил яркий")
True
>>>
>>> asc.checkSequence("в Олега ударил яркий", True)
True
>>>
>>> asc.checkSequence("от госсекретаря США")
True
>>>
>>> asc.checkSequence("от госсекретаря США", True)
True
>>>
>>> asc.checkSequence("Неожиданно из подворотни в олега ударил", 2)
True
>>>
>>> asc.checkSequence(["Неожиданно","из","подворотни","в","олега","ударил"], 2)
True
>>>
>>> asc.existSequence("<s> Сегодня сыграл и в, Олега ударил яркий прожектор, патрульный трактор - с корпоративным сектором </s>", 2)
(True, 0)
>>>
>>> asc.existSequence(["<s>","Сегодня","сыграл","и","в",",","Олега","ударил","яркий","прожектор",",","патрульный","трактор","-","с","корпоративным","сектором","</s>"], 2)
(True, 2)
>>>
>>> asc.idw("от")
6086
>>>
>>> asc.idw("госсекретаря")
51273912082
>>>
>>> asc.idw("США")
5
>>>
>>> asc.checkSequence([6086, 51273912082, 5])
True
>>>
>>> asc.checkSequence([6086, 51273912082, 5], True)
True
>>>
>>> asc.checkSequence(["от", "госсекретаря", "США"])
True
>>>
>>> asc.checkSequence(["от", "госсекретаря", "США"], True)
True
>>>
>>> asc.checkByFiles("./text.txt", "./result.txt")
info: 1999 | YES | Какой-то период времени мы вообще не общались

info: 2000 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор.С лязгом выкатился и остановился возле мальчика.

info: 2001 | YES | Так как эти яйца жалко есть а хочется все больше любоваться их можно покрыть лаком даже прозрачным лаком для ногтей

info: 2002 | NO | кукай <unk> <unk> сари кукай <unk> <unk> сари японские каллиграфы я постоянно навещал их тайно от людей

info: 2003 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???С лязгом выкатился и остановился возле мальчика....

info: 2004 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор?С лязгом выкатился и остановился возле мальчика.

info: 2005 | YES | Сегодня яичницей никто не завтракал как впрочем и вчера на ближайшем к нам рынке мы ели фруктовый салат со свежевыжатым соком как в старые добрые времена в Бразилии

info: 2006 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор!С лязгом выкатился и остановился возле мальчика.

info: 2007 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор.с лязгом выкатился и остановился возле мальчика.

All texts: 2007
Exists texts: 1359
Not exists texts: 648
>>>
>>> asc.checkByFiles("./corpus", "./result.txt", False, "txt")
info: 1999 | YES | Какой-то период времени мы вообще не общались

info: 2000 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор.С лязгом выкатился и остановился возле мальчика.

info: 2001 | YES | Так как эти яйца жалко есть а хочется все больше любоваться их можно покрыть лаком даже прозрачным лаком для ногтей

info: 2002 | NO | кукай <unk> <unk> сари кукай <unk> <unk> сари японские каллиграфы я постоянно навещал их тайно от людей

info: 2003 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???С лязгом выкатился и остановился возле мальчика....

info: 2004 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор?С лязгом выкатился и остановился возле мальчика.

info: 2005 | YES | Сегодня яичницей никто не завтракал как впрочем и вчера на ближайшем к нам рынке мы ели фруктовый салат со свежевыжатым соком как в старые добрые времена в Бразилии

info: 2006 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор!С лязгом выкатился и остановился возле мальчика.

info: 2007 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор.с лязгом выкатился и остановился возле мальчика.

All texts: 2007
Exists texts: 1359
Not exists texts: 648
>>>
>>> asc.checkByFiles("./corpus", "./result.txt", True, "txt")
info: 2000 | NO | Так как эти яйца жалко есть а хочется все больше любоваться их можно покрыть лаком даже прозрачным лаком для ногтей

info: 2001 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор.С лязгом выкатился и остановился возле мальчика.

info: 2002 | NO | Сегодня яичницей никто не завтракал как впрочем и вчера на ближайшем к нам рынке мы ели фруктовый салат со свежевыжатым соком как в старые добрые времена в Бразилии

info: 2003 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор!С лязгом выкатился и остановился возле мальчика.

info: 2004 | NO | кукай <unk> <unk> сари кукай <unk> <unk> сари японские каллиграфы я постоянно навещал их тайно от людей

info: 2005 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор?С лязгом выкатился и остановился возле мальчика.

info: 2006 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???С лязгом выкатился и остановился возле мальчика....

info: 2007 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор.с лязгом выкатился и остановился возле мальчика.

All texts: 2007
Exists texts: 0
Not exists texts: 2007


Methods:

check - String Check Method
match - String Matching Method
addAbbr - Method add abbreviation
addSuffix - Method add number suffix abbreviation
setSuffixes - Method set number suffix abbreviations
readSuffix - Method for reading data from a file of suffixes and abbreviations

Example:
>>> import asc
>>>
>>> asc.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>> asc.setSubstitutes({'p':'р','c':'с','o':'о','t':'т','k':'к','e':'е','a':'а','h':'н','x':'х','b':'в','m':'м'})
>>>
>>> asc.check("Дом-2", asc.check_t.home2)
True
>>>
>>> asc.check("Дом2", asc.check_t.home2)
False
>>>
>>> asc.check("Дом-2", asc.check_t.latian)
False
>>>
>>> asc.check("Hello", asc.check_t.latian)
True
>>>
>>> asc.check("прiвет", asc.check_t.latian)
True
>>>
>>> asc.check("Дом-2", asc.check_t.hyphen)
True
>>>
>>> asc.check("Дом2", asc.check_t.hyphen)
False
>>>
>>> asc.check("Д", asc.check_t.letter)
True
>>>
>>> asc.check("$", asc.check_t.letter)
False
>>>
>>> asc.check("-", asc.check_t.letter)
False
>>>
>>> asc.check("просtоквaшино", asc.check_t.similars)
True
>>>
>>> asc.match("my site http://example.ru, it's true", asc.match_t.url)
True
>>>
>>> asc.match("по вашему ip адресу 46.40.123.12 проводится проверка", asc.match_t.url)
True
>>>
>>> asc.match("мой адрес в формате IPv6: http://[2001:0db8:11a3:09d7:1f34:8a2e:07a0:765d]/", asc.match_t.url)
True
>>>
>>> asc.match("13-я", asc.match_t.abbr)
True
>>>
asc.match("13-я-й", asc.match_t.abbr)
False
>>>
asc.match("т.д", asc.match_t.abbr)
True
>>>
asc.match("т.п.", asc.match_t.abbr)
True
>>>
>>> asc.match("С.Ш.А.", asc.match_t.abbr)
True
>>>
>>> asc.addAbbr("сша")
>>> asc.match("США", asc.match_t.abbr)
True
>>>
>>> asc.addSuffix("15-летия")
>>> asc.match("15-летия", asc.match_t.abbr)
True
>>>
>>> asc.getSuffixes()
{3139900457}
>>>
>>> asc.idw("лет")
328041
>>>
>>> asc.idw("тых")
352214
>>>
>>> asc.setSuffixes({328041, 352214})
>>>
>>> asc.getSuffixes()
{328041, 352214}
>>>
>>> def status(status):
... print(status)
...
>>> asc.readSuffix("./suffix.abbr", status)
>>>
>>> asc.match("15-лет", asc.match_t.abbr)
True
>>>
>>> asc.match("20-тых", asc.match_t.abbr)
True
>>>
>>> asc.match("15-летия", asc.match_t.abbr)
False
>>>
>>> asc.match("Hello", asc.match_t.latian)
True
>>>
>>> asc.match("прiвет", asc.match_t.latian)
False
>>>
>>> asc.match("23424", asc.match_t.number)
True
>>>
>>> asc.match("hello", asc.match_t.number)
False
>>>
>>> asc.match("23424.55", asc.match_t.number)
False
>>>
>>> asc.match("23424", asc.match_t.decimal)
False
>>>
>>> asc.match("23424.55", asc.match_t.decimal)
True
>>>
>>> asc.match("23424,55", asc.match_t.decimal)
True
>>>
>>> asc.match("-23424.55", asc.match_t.decimal)
True
>>>
>>> asc.match("+23424.55", asc.match_t.decimal)
True
>>>
>>> asc.match("+23424.55", asc.match_t.anumber)
True
>>>
>>> asc.match("15T-34", asc.match_t.anumber)
True
>>>
>>> asc.match("hello", asc.match_t.anumber)
False
>>>
>>> asc.match("hello", asc.match_t.allowed)
True
>>>
>>> asc.match("évaluer", asc.match_t.allowed)
False
>>>
>>> asc.match("13", asc.match_t.allowed)
True
>>>
>>> asc.match("Hello-World", asc.match_t.allowed)
True
>>>
>>> asc.match("Hello", asc.match_t.math)
False
>>>
>>> asc.match("+", asc.match_t.math)
True
>>>
>>> asc.match("=", asc.match_t.math)
True
>>>
>>> asc.match("Hello", asc.match_t.upper)
True
>>>
>>> asc.match("hello", asc.match_t.upper)
False
>>>
>>> asc.match("hellO", asc.match_t.upper)
False
>>>
>>> asc.match("a", asc.match_t.punct)
False
>>>
>>> asc.match(",", asc.match_t.punct)
True
>>>
>>> asc.match(" ", asc.match_t.space)
True
>>>
>>> asc.match("a", asc.match_t.space)
False
>>>
>>> asc.match("a", asc.match_t.special)
False
>>>
>>> asc.match("±", asc.match_t.special)
False
>>>
>>> asc.match("[", asc.match_t.isolation)
True
>>>
>>> asc.match("a", asc.match_t.isolation)
False
>>>
>>> asc.match("a", asc.match_t.greek)
False
>>>
>>> asc.match("Ψ", asc.match_t.greek)
True
>>>
>>> asc.match("->", asc.match_t.route)
False
>>>
>>> asc.match("⇔", asc.match_t.route)
True
>>>
>>> asc.match("a", asc.match_t.letter)
True
>>>
>>> asc.match("!", asc.match_t.letter)
False
>>>
>>> asc.match("!", asc.match_t.pcards)
False
>>>
>>> asc.match("♣", asc.match_t.pcards)
True
>>>
>>> asc.match("p", asc.match_t.currency)
False
>>>
>>> asc.match("$", asc.match_t.currency)
True
>>>
>>> asc.match("€", asc.match_t.currency)
True
>>>
>>> asc.match("₽", asc.match_t.currency)
True
>>>
>>> asc.match("₿", asc.match_t.currency)
True


Methods:

delInText - Method for delete letter in text

Example:
>>> import asc
>>>
>>> asc.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> asc.delInText("неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор??? с лязгом выкатился и остановился возле мальчика....", asc.wdel_t.punct)
'неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор с лязгом выкатился и остановился возле мальчика'
>>>
>>> asc.delInText("hello-world-hello-world", asc.wdel_t.hyphen)
'helloworldhelloworld'
>>>
>>> asc.delInText("неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор??? с лязгом выкатился и остановился возле мальчика....", asc.wdel_t.broken)
'неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор с лязгом выкатился и остановился возле мальчика'
>>>
>>> asc.delInText("«On nous dit qu'aujourd'hui c'est le cas, encore faudra-t-il l'évaluer» l'astronomie", asc.wdel_t.broken)
"On nous dit qu'aujourd'hui c'est le cas encore faudra-t-il l'valuer l'astronomie"


Methods:

countsByFiles - Method for counting the number of n-grams in a text file

Example:
>>> import asc
>>>
>>> asc.setOption(asc.options_t.debug)
>>>
>>> asc.setOption(asc.options_t.confidence)
>>>
>>> asc.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> asc.readArpa('./lm.arpa')
>>>
>>> asc.countsByFiles("./text.txt", "./result.txt", 3)
info: 0 | Сегодня яичницей никто не завтракал как впрочем и вчера на ближайшем к нам рынке мы ели фруктовый салат со свежевыжатым соком как в старые добрые времена в Бразилии

info: 10 | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор?С лязгом выкатился и остановился возле мальчика.

info: 10 | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор!С лязгом выкатился и остановился возле мальчика.

info: 0 | Так как эти яйца жалко есть а хочется все больше любоваться их можно покрыть лаком даже прозрачным лаком для ногтей

info: 10 | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???С лязгом выкатился и остановился возле мальчика....

Counts 3grams: 471
>>>
>>> asc.countsByFiles("./corpus", "./result.txt", 2, "txt")
info: 19 | Так как эти яйца жалко есть а хочется все больше любоваться их можно покрыть лаком даже прозрачным лаком для ногтей

info: 12 | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор.с лязгом выкатился и остановился возле мальчика.

info: 12 | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор!С лязгом выкатился и остановился возле мальчика.

info: 10 | кукай <unk> <unk> сари кукай <unk> <unk> сари японские каллиграфы я постоянно навещал их тайно от людей

info: 12 | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???С лязгом выкатился и остановился возле мальчика....

info: 12 | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор?С лязгом выкатился и остановился возле мальчика.

info: 27 | Сегодня яичницей никто не завтракал как впрочем и вчера на ближайшем к нам рынке мы ели фруктовый салат со свежевыжатым соком как в старые добрые времена в Бразилии

Counts 2grams: 20270

Description



N-gram size
Description




1
language model size


2
bigram


3
trigram

License:

For personal and professional use. You cannot resell or redistribute these repositories in their original state.

Customer Reviews

There are no reviews.