GitLocker: The Coding Marketplace

Description:

extremetext 0.8.4

extremeText is an
extension of fastText
library for multi-label classification including extreme cases with
hundreds of thousands and millions of labels.
extremeText implements:

Probabilistic Labels Tree (PLT) loss for extreme multi-Label
classification with top-down hierarchical clustering (k-means) for
tree building,
sigmoid loss for multi-label classification,
L2 regularization and FOBOS update for all losses,
ensemble of loss layers with bagging,
calculation of hidden (document) vector as a weighted average of the
word vectors,
calculation of TF-IDF weights for words.

Requirements
extremeText builds on
modern Mac OS and Linux distributions. Since it uses C++11 features, it
requires a compiler with good C++11 support. These include:

(gcc-4.8 or newer) or (clang-3.3 or newer)

You will need:

Python version 2.7 or >=3.4
NumPy &
SciPy
pybind11

Installing extremeText
The easiest way to get
extremeText is to use
pip.
$ pip install extremetext
Installing on MacOS may require setting
MACOSX_DEPLOYMENT_TARGET=10.9 first:
$ export MACOSX_DEPLOYMENT_TARGET=10.9
$ pip install extremetext
The latest version of
extremeText can be build
from sources using pip or alternatively setuptools.
$ git clone https://github.com/mwydmuch/extremeText.git
$ cd extremeText
$ pip install .
(or) $ python setup.py install
Now you can import this library with:
import extremeText

Examples
In general it is assumed that the reader already has good knowledge of
fastText/extremeText. For this consider the main
README
and the tutorials on fastText
website.
We recommend you look at the examples within the doc
folder.
As with any package you can get help on any Python function using the
help function.
For example:
+>>> import extremeText
+>>> help(extremeText.ExtremeText)

Help on module extremeText.ExtremeText in extremeText:

NAME
extremeText.ExtremeText

DESCRIPTION
# Copyright (c) 2017-present, Facebook, Inc.
# All rights reserved.
#
# This source code is licensed under the BSD-style license found in the
# LICENSE file in the root directory of this source tree. An additional grant
# of patent rights can be found in the PATENTS file in the same directory.

FUNCTIONS
load_model(path)
Load a model given a filepath and return a model object.

tokenize(text)
Given a string of text, tokenize it and return a list of tokens
[...]

IMPORTANT: Preprocessing data / enconding conventions
In general it is important to properly preprocess your data. Example
scripts in the root
folder do this.
extremeText like fastText assumes UTF-8 encoded text. All text must be
unicode for
Python2
and str for
Python3.
The passed text will be encoded as UTF-8 by
pybind11
before passed to the extremeText C++ library. This means it is important
to use UTF-8 encoded text when building a model. On Unix-like systems
you can convert text using
iconv.
extremeText will tokenize (split text into pieces) based on the
following ASCII characters (bytes). In particular, it is not aware of
UTF-8 whitespace. We advice the user to convert UTF-8 whitespace / word
boundaries into one of the following symbols as appropiate.

space
tab
vertical tab
carriage return
formfeed
the null character

The newline character is used to delimit lines of text. In particular,
the EOS token is appended to a line of text if a newline character is
encountered. The only exception is if the number of tokens exceeds the
MAX_LINE_SIZE constant as defined in the Dictionary
header.
This means if you have text that is not separate by newlines, such as
the fil9 dataset, it will be
broken into chunks with MAX_LINE_SIZE of tokens and the EOS token is
not appended.
The length of a token is the number of UTF-8 characters by considering
the leading two bits of a
byte to identify
subsequent bytes of a multi-byte
sequence.
Knowing this is especially important when choosing the minimum and
maximum length of subwords. Further, the EOS token (as specified in the
Dictionary
header)
is considered a character and will not be broken into subwords.

Reference
Please cite below work if using this package for extreme classification.
M. Wydmuch, K. Jasinska, M. Kuznetsov, R. Busa-Fekete, K. Dembczyński.
*A no-regret generalization of hierarchical softmax to extreme
multi-label
classification*.
Advances in Neural Information Processing Systems 31, 2018.