cypunct 0.1.1

Creator: bradpython12

Last updated:

Add to Cart

Description:

cypunct 0.1.1

Cypunct is designed to solve the problem of quickly splitting a Unicode
string based on a set of characters.
Cypunct is designed to work on Python 2.6, 2.7, and 3.3+. Because
Cypunct is a Cython extension, it will (probably) only work in the CPython
runtime.
For Python versions 2.6 and 2.7, Cypunct will only run if these
CPython runtimes are compiled with the flag
--enable-unicode=ucs4. Cypunct will throw an exception
if your Python 2 runtime was not compiled with UCS-4.

Installation
Installation is easiest with pip. Just run
pip install cypunct


Usage
Cypunct takes a Unicode string and a frozenset of delimiter characters,
and splits the string based on that set. Every delimiter character
should be a single Unicode code point – len(char) should be 1.
A simple example, where we provide a small frozenset is below.
>>> from cypunct import split
>>> split("James Mishra is the... best human ever, or so I think.", frozenset({' ', '.', ','}))
['James', 'Mishra', 'is', 'the', 'best', 'human', 'ever', 'or', 'so', 'I', 'think', '']
However, if you only need to split on whitespace characters, str.split() much
better performance. If you only need to split on one character, str.split(char)
will also be much faster.
Cypunct really shines when you need to split on many possible characters,
such as an entire Unicode character category.
The below example splits on all Unicode punctuation, and nothing else.
>>> from cypunct.unicode_classes import P
>>> split("James Mishra is the... best human ever, or so I think.", P)
['James Mishra is the', ' best human ever', ' or so I think', '']
The following Unicode classes are available as sets:


Category
Description



C
Other

Cc
Other, Format

Cf
Other, Not Assigned

Co
Other, Private Use

Cs
Other, Surrogate

L
Letter

Ll
Letter, Lowercase

Lm
Letter, Modifier

Lo
Letter, Other

Lt
Letter, Titlecase

Lu
Letter, Uppercase

M
Mark

Mc
Mark, Space Combining

Me
Mark, Enclosing

Mn
Mark, Nonspacing

N
Number

Nd
Number, Decimal Digit

Nl
Number, Letter

No
Number, Other

P
Punctuation

Pc
Punctuation, Connector

Pd
Punctuation, Dash

Pe
Punctuation, Close

Pf
Punctuation, Final Quote

Pi
Punctuation, Initial Quote

Po
Punctuation, Other

Ps
Punctuation, Open

S
Symbol

Sc
Symbol, Currency

Sk
Symbol, Modifier

Sm
Symbol, Math

So
Symbol, Other

Z
Separator

Zl
Separator, Line

Zp
Separator, Paragraph

Zs
Separator, Space



cypunct.unicode_classes.COMMON_SEPARATORS is the union of the C, P, S, and Z
frozensets. I have found it personally useful when splitting text for natural
language processing applications.
If you don’t specify a frozenset for Cypunct to use, then Cypunct will
default to COMMON_SEPARATORS.


Updating Unicode data
Currently, cypunct.unicode_classes is a Python module autogenerated from a
UnicodeData.txt file. The autogeneration script exists in
make_punctuation_file.py.
Most Cypunct users will not need to concern themselves with this, but this is important
to know if you are experiencing Unicode bugs or want to contribute to Cypunct.
The current UnicodeData.txt is from ftp://ftp.unicode.org/Public/10.0.0/ucd/UnicodeData.txt.


Frequently Asked Questions (FAQ)
Q: I got an installation error involving
“pkg_resources.VersionConflict (setuptools xx.xx.xx”.
How do I fix this?
You have a very old version of setuptools, and we won’t be able to
compile our Cython extension with it. Run
pip install --upgrade setuptools and try installing Cypunct again.
Q: Wouldn’t this be way faster if it were written in Pure C?
Yes, it would. I’m too lazy to hand-code a C CPython extension, but it’s on my todo list.
Right now, Cypunct is “fast enough”, and I can move onto other things in my
daily life.
However, if you want to take on the challenge of rewriting Cypunct in C and having
the exact same functionality as the current Cython version, I’ll send you $100 USD.

License

For personal and professional use. You cannot resell or redistribute these repositories in their original state.

Files:

Customer Reviews

There are no reviews.