cleanlab-cli 0.1.14

Creator: coderz1093

Last updated:

Add to Cart

Description:

cleanlabcli 0.1.14

cleanlab-cli
Command line interface for all things Cleanlab Studio.
This currently supports generating dataset schema, uploading
datasets into Cleanlab Studio, and downloading cleansets from Cleanlab Studio.
Installation
You can install the Cleanlab Studio CLI from PyPI with:
pip install cleanlab-cli

If you already have the CLI installed and wish to upgrade to the latest version, run:
pip install --upgrade cleanlab-cli

Workflow
Uploading datasets to Cleanlab Studio is a two-step process.

We generate a schema describing the dataset and its data and feature
types, which is verified by the user.
Based on this schema, the dataset is parsed and uploaded to Cleanlab Studio.

Upload a dataset
To upload a dataset without
first generating a schema (i.e. Cleanlab will
suggest one for you):
cleanlab dataset upload -f [dataset filepath]
You will be asked to "Specify your dataset modality (text, tabular):".

Enter text to only find label errors based on a single column of text in your dataset.
Enter tabular to find data and label issues based on any subset of the column features.

To upload a dataset with a schema:
cleanlab dataset upload -f [dataset filepath] -s [schema filepath]
To resume uploading a dataset whose upload was interrupted:
cleanlab dataset upload -f [dataset filepath] --id [dataset ID]
A dataset ID is generated and printed to the terminal the first time the dataset is uploaded. It can also be accessed by
visiting https://app.cleanlab.ai/datasets and selecting 'Resume' for the relevant dataset.
Generate dataset schema
To generate a dataset schema (prior to uploading your dataset):
cleanlab dataset schema generate -f [dataset filepath]

For Id column: , please enter the string name of of the column in your dataset that contains the id of each row.
For Modality (text, tabular): , please enter text to only find label errors based on a single column of text,
otherwise enter tabular to find data and label issues based on any subset of the column features.

To validate an existing schema, i.e. check that it is complete, well-formatted, and
has data types with sensible feature types:
cleanlab dataset schema validate -s [schema filepath]
You may then wish to inspect the generated schema to check that the fields and metadata are correct.
Download clean labels
To download clean labels (i.e. labels that have been fixed through the Cleanlab Studio interface):
cleanlab cleanset download --id [cleanset ID]
To download clean labels and combine them with your local dataset:
cleanlab cleanset download --id [cleanset ID] -f [dataset filepath]
Commands
cleanlab login authenticates you
Authenticates you when uploading datasets to Cleanlab Studio. Pass in your API key using --key [API key]. Your API key
can be accessed at https://app.cleanlab.ai/upload.
cleanlab dataset schema generate generates dataset schemas
Generates a schema based on your dataset. Specify your target dataset with --filepath [dataset filepath]. You will be
prompted to save the generated schema JSON and to specify a save location. This can be specified
using --output [output filepath].
cleanlab dataset schema validate validates a schema JSON file
Validates a schema JSON file, checking that a schema is complete, well-formatted, and
has data types with sensible feature types. Specify your target schema
with --schema [schema filepath].
You may also validate an existing schema with respect to a dataset (-d [dataset filepath]), i.e. all previously
mentioned checks and the additional check that all fields in the schema are present in the dataset.
cleanlab dataset upload uploads your dataset
Uploads your dataset to Cleanlab Studio. Specify your target dataset with --filepath [dataset filepath]. You will be
prompted for further details about the dataset's modality and ID column. These may be supplied to the command
with --modality [modality], --id-column [name of ID column], and you may also specify a custom dataset name
with--name [custom dataset name].
After uploading your dataset, you will be prompted to save the list of dataset issues (if any) encountered during the
upload process. These issues include missing IDs, duplicate IDs, missing values, and values whose types do not match the
schema. You may specify the save location with --output [output filepath].
cleanlab cleanset download downloads Cleanlab columns from your cleanset
Cleansets are initialized through the Cleanlab Studio interface. In a cleanset, users can inspect their dataset and
verify their labels. Clean labels are the labels after this set of manual fixes have been applied.
This command downloads the clean labels and saves them locally as a .csv, .xls/.xlsx, or .json, with columns id
and clean_label. Include the --filepath [dataset filepath] to combine the clean labels with the original dataset as
a new column clean_label, which will be outputted to --output [output filepath]. Include the --all flag to
include all Cleanlab columns, i.e. issue, label quality, suggested label, clean label, instead of only the clean
label column.
Dataset format
Cleanlab currently only supports text and tabular dataset modalities.
(If your dataset contains both text and tabular data, treat it as tabular.)
The accepted dataset file types are: .csv, .json, and .xls/.xlsx.
Below are some examples of how to format your dataset depending on modality and file type.
Every dataset must have an ID column (i.e. a column containing identifiers that uniquely identify each row) and a
label column (for the prediction task).
Apart from the reserved column name: clean_label, You are free to name the columns in your dataset in any way you
want.

Tabular


.csv, .xls/.xlsx



flower_id
width
length
color
species




flower_01
4
3
red
rose


flower_02
7
2
white
lily





.json
{
"rows": [
{
"flower_id": "flower_01",
"width": 4,
"length": 3,
"color": "red",
"species": "rose"
},
{
"flower_id": "flower_02",
"width": 7,
"length": 2,
"color": "white",
"species": "lily"
}
]
}




Text


.csv, .xls/.xlsx



review_id
review
sentiment




review_1
The sales rep was fantastic!
positive


review_2
He was a bit wishy-washy.
negative





.json
{
"rows": [
{
"review_id": "review_1",
"review": "The sales rep was fantastic!",
"label": "positive"
},
{
"review_id": "review_2",
"review": "He was a bit wishy-washy.",
"label": "negative"
}
]
}



Schema
To specify the column types in your dataset, create a JSON file named schema.json. We recommend
using cleanlab dataset schema generate to generate an initial schema and editing from there.
Your schema file should be formatted as follows:
{
"metadata": {
"id_column": "tweet_id",
"modality": "text",
"name": "Tweets.csv"
},
"fields": {
"tweet_id": {
"data_type": "string",
"feature_type": "identifier"
},
"sentiment": {
"data_type": "string",
"feature_type": "categorical"
},
"sentiment_confidence": {
"data_type": "float",
"feature_type": "numeric"
},
"retweet_count": {
"data_type": "integer",
"feature_type": "numeric"
},
"text": {
"data_type": "string",
"feature_type": "text"
},
"tweet_created": {
"data_type": "boolean",
"feature_type": "boolean"
},
"tweet_created": {
"data_type": "string",
"feature_type": "datetime"
},
},
"version": "0.1.12"
}

This is the schema of a hypothetical dataset Tweets.csv that contains tweets, where the column tweet_id contains a
unique identifier for each record. Each column in the dataset is specified under fields with its data type and feature
type.
Data types and Feature types
Data type refers to the type of the field's values: string, integer, float, or boolean.
Note that the integer type is strict, meaning floats will be rejected. In contrast, the float type is lenient,
meaning integers are accepted. Users should select the float type if the field may include float values. Note too that
integers can have categorical and identifier feature types, whereas floats cannot.
For booleans, the list of accepted values are: true/false, t/f, yes/no, and 1/0.
Feature type refers to the secondary type of the field, relating to how it is used in a machine learning model, such
as whether it is:

a categorical value
a numeric value
a datetime value
a boolean value
text
an identifier — a string / integer that identifies some entity

Some feature types can only correspond to specific data types. The list of possible feature types for each data type is
shown below



Data type
Feature type




string
text, categorical, datetime, identifier


integer
categorical, datetime, identifier, numeric


float
datetime, numeric


boolean
boolean



The datetime type should be used for datetime strings, e.g. "2015-02-24 11:35:52 -0800", and Unix timestamps (which
will be integers or floats). Datetime values must be parsable
by pandas.to_datetime().
version indicates the version of the Cleanlab CLI package version used to generate the schema. The current Cleanlab
schema version is 0.1.14.

License

For personal and professional use. You cannot resell or redistribute these repositories in their original state.

Customer Reviews

There are no reviews.