pycld3 0.22

Last updated:

0 purchases

pycld3 0.22 Image
pycld3 0.22 Images
Add to Cart

Description:

pycld3 0.22

pycld3
Python bindings to the Compact Language Detector v3 (CLD3).







Newer Alternative: gcld3
Note: Since the original publication of this pycld3, Google's cld3 authors have published the Python package gcld3, which are official Python bindings built with pybind. Please check that project out as it is part of the canonical cld3 repository and will likely stay in better lock step with any cld3 changes over time.
Overview
This package contains Python bindings (via Cython) to Google's CLD3 library.
>>> import cld3
>>> cld3.get_language("影響包含對氣候的變化以及自然資源的枯竭程度")
LanguagePrediction(language='zh', probability=0.999969482421875, is_reliable=True, proportion=1.0)

The library outputs BCP-47-style language codes. For some languages, output is differentiated by script. Language and script names from Unicode CLDR. It supports over 100 languages/scripts. See full list of supported languages/scripts in Google's CLD3 documentation.
Installing with Wheels: Supported Versions and Platforms
This project supports CPython versions 3.6 through 3.9.
We publish wheels for the following matrix:

MacOS: CPython 3.6 thru 3.9
Linux: CPython 3.6 thru 3.9; (manylinux1)

The wheels for both MacOS and manylinux1 include the external protobuf library copied into the wheel itself
via auditwheel or
delocate so that you won't need to install any extra non-PyPI dependencies.
If you are installing on one of the variants listed above, you should not need to have protoc or libprotobuf installed:
python -m pip install -U pycld3

Installing from Source: Prerequisites
If you are not on a platform variant that is eligible to use a wheel, you may still be able to use pycld3 via its source distribution (tar.gz), but a bit more work is required to install.
Namely, you'll also need:

the Protobuf compiler (the protoc executable)
the Protobuf development headers and libprotoc library
a compiler, preferably g++

Please consult the official protobuf repository for information on installing Protobuf.
The project contains an Installation README that covers installation
on Windows and Unix.
If for whatever reason you are on a Unix host but unable to use the wheels (for instance, if you have an i686 architecture), here is a quick-and-dirty guide to installing.
Debian/Ubuntu
sudo apt-get update -y
sudo apt-get install -y --no-install-recommends \
g++ \
protobuf-compiler \
libprotobuf-dev
python -m pip install -U pycld3

Alpine Linux
Note:
Alpine Linux does not support PyPI wheels
as of April 2020. The steps below are mandatory on Alpine Linux because you will need
to install from the source distribution. If the situation permits, using a Debian distro
should be much easier (and faster).
apk --update add g++ protobuf protobuf-dev
python -m pip install -U pycld3

CentOS/RHEL
Install from source, as root/UID 0:
sudo su -
set -ex
pushd /opt
PROTOBUF_VERSION='3.11.4'
yum update -y
yum install -y autoconf automake gcc-c++ glibc-headers gzip libtool make python3-devel zlib-devel
curl -Lo /opt/protobuf.tar.gz \
"https://github.com/protocolbuffers/protobuf/releases/download/v${PROTOBUF_VERSION}/protobuf-cpp-${PROTOBUF_VERSION}.tar.gz"
tar -xzvf protobuf.tar.gz
rm -f protobuf.tar.gz
pushd "protobuf-${PROTOBUF_VERSION}"
./configure --with-zlib --disable-debug && make && make install && ldconfig --verbose
popd && rm -rf "protobuf-${PROTOBUF_VERSION}" && popd && set +ex

python -m pip install -U pycld3

Note: the steps above are for CentOS 8. For earlier versions, you may need to replace:

gcc-c++ with g++
python3-devel with python-devel

MacOS/Homebrew
brew update
brew upgrade protobuf || brew install -v protobuf
python -m pip install -U pycld3

Windows
Please consult Protobuf's
C++ Installation - Windows
section for help with installing Protobuf on Windows.
If you would like to help contribute Windows wheels (preferably as a job within the project's
CI/CD pipelines), please file an issue.
Usage
cld3 exports two module-level functions, get_language() and get_frequent_languages():
>>> import cld3

>>> cld3.get_language("影響包含對氣候的變化以及自然資源的枯竭程度")
LanguagePrediction(language='zh', probability=0.999969482421875, is_reliable=True, proportion=1.0)

>>> cld3.get_language("This is a test")
LanguagePrediction(language='en', probability=0.9999980926513672, is_reliable=True, proportion=1.0)

>>> for lang in cld3.get_frequent_languages(
... "This piece of text is in English. Този текст е на Български.",
... num_langs=3
... ):
... print(lang)
...
LanguagePrediction(language='bg', probability=0.9173890948295593, is_reliable=True, proportion=0.5853658318519592)
LanguagePrediction(language='en', probability=0.9999790191650391, is_reliable=True, proportion=0.4146341383457184)

FAQ
cld3 incorrectly detects my input. How can I fix this?
A first resort is to preprocess (clean) your input text based on conditions specific to your program.
A salient example is to remove URLs and email addresses from the input. CLD3 (unlike CLD2)
does almost none of this cleaning for you, in the spirit of not penalizing other users with overhead that they may not need.
Here's such an example using a simplified URL regex from Regular Expressions Cookbook, 2nd ed.:
>>> import re
>>> import cld3

# cld3 does not ignore the URL components by default
>>> s = "Je veux que: https://site.english.com/this/is/a/url/path/component#fragment"
>>> cld3.get_language(s)
LanguagePrediction(language='en', probability=0.5319557189941406, is_reliable=False, proportion=1.0)

>>> url_re = r"\b(?:https?://|www\.)[a-z0-9-]+(\.[a-z0-9-]+)+(?:[/?].*)?"
>>> new_s = re.sub(url_re, "", s)
>>> new_s
'Je veux que: '
>>> cld3.get_language(new_s)
LanguagePrediction(language='fr', probability=0.9799421429634094, is_reliable=True, proportion=1.0)

Note: This URL regex aims for simplicity. It requires a domain name, and doesn't allow a username or password; it allows the scheme
(http or https) to be omitted if it can be inferred from the subdomain (www). Source: Regular Expressions Cookbook, 2nd ed. - Goyvaerts & Levithan.
In some other cases, you cannot fix the incorrect detection.
Language detection algorithms in general may perform poorly with very short inputs.
Rarely should you trust the output of something like detect("hi"). Keep this limitation in mind regardless
of what library you are using.
Please remember that, at the end of the day, this project is just a Python wrapper to the CLD3 C++ library that does the actual heavy-lifting.
I'm seeing an error during pip installation. How can I fix this?
First, please make sure you have read the installation section that that you have
installed Protobuf if necessary.
If that doesn't help, please file an issue in this repository.
The build process for this project is somewhat complex because it involves both Cython and Protobuf, but I do my best
to make it work everywhere possible.
Protobuf is installed, but I'm still seeing "cannot open shared object file"
If you've installed Protobuf, but are seeing an error such as:
ImportError: libprotobuf.so.22: cannot open shared object file: No such file or directory

This likely means that Python is not finding the libprotobuf shared object,
possibly because ldconfig didn't do what it was supposed to.
You may need to tell it where to look.
You can find where the library sits via:
$ find /usr -name 'libprotoc.so' \( -type l -o -type f \)
/usr/local/lib/libprotoc.so

Then, you can add the directory containing this file to LD_LIBRARY_PATH:
export LD_LIBRARY_PATH="$(dirname $(find /usr -name 'libprotoc.so' \( -type l -o -type f \))):$LD_LIBRARY_PATH"

You can quickly test that this worked:
$ python -c 'import cld3; print(cld3.get_language("影響包含對氣候的變化以及自然資源的枯竭程度"))'
LanguagePrediction(language='zh', probability=0.999969482421875, is_reliable=True, proportion=1.0)

Authors
This repository contains a fork of google/cld3 at commit 06f695f. The license for google/cld3 can be found at
LICENSES/CLD3_LICENSE.
This repository is a combination of changes introduced by various forks of google/cld3 by the following people:

Johannes Baiter (@jbaiter)
Elizabeth Myers (@Elizafox)
Witold Bołt (@houp)
Alfredo Luque (@iamthebot)
WISESIGHT (@wisesight)
RNogales (@RNogales94)
Brad Solomon (@bsolomon1124)

License:

For personal and professional use. You cannot resell or redistribute these repositories in their original state.

Customer Reviews

There are no reviews.