auto-gptq 0.7.1

Last updated:

0 purchases

auto-gptq 0.7.1 Image
auto-gptq 0.7.1 Images
Add to Cart

Description:

autogptq 0.7.1

AutoGPTQ
An easy-to-use LLM quantization package with user-friendly APIs, based on GPTQ algorithm (weight-only quantization).










English |
δΈ­ζ–‡


News or Update

2024-02-15 - (News) - AutoGPTQ 0.7.0 is released, with Marlin int4*fp16 matrix multiplication kernel support, with the argument use_marlin=True when loading models.
2023-08-23 - (News) - πŸ€— Transformers, optimum and peft have integrated auto-gptq, so now running and training GPTQ models can be more available to everyone! See this blog and it's resources for more details!

For more histories please turn to here
Performance Comparison
Inference Speed

The result is generated using this script, batch size of input is 1, decode strategy is beam search and enforce the model to generate 512 tokens, speed metric is tokens/s (the larger, the better).
The quantized model is loaded using the setup that can gain the fastest inference speed.




model
GPU
num_beams
fp16
gptq-int4




llama-7b
1xA100-40G
1
18.87
25.53


llama-7b
1xA100-40G
4
68.79
91.30


moss-moon 16b
1xA100-40G
1
12.48
15.25


moss-moon 16b
1xA100-40G
4
OOM
42.67


moss-moon 16b
2xA100-40G
1
06.83
06.78


moss-moon 16b
2xA100-40G
4
13.10
10.80


gpt-j 6b
1xRTX3060-12G
1
OOM
29.55


gpt-j 6b
1xRTX3060-12G
4
OOM
47.36



Perplexity
For perplexity comparison, you can turn to here and here
Installation
AutoGPTQ is available on Linux and Windows only. You can install the latest stable release of AutoGPTQ from pip with pre-built wheels:



CUDA/ROCm version
Installation
Built against PyTorch




CUDA 11.8
pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/
2.2.0+cu118


CUDA 12.1
pip install auto-gptq
2.2.0+cu121


ROCm 5.7
pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/rocm573/
2.2.0+rocm5.7



AutoGPTQ can be installed with the Triton dependency with pip install auto-gptq[triton] in order to be able to use the Triton backend (currently only supports linux, no 3-bits quantization).
For older AutoGPTQ, please refer to the previous releases installation table.
Install from source
Clone the source code:
git clone https://github.com/PanQiWei/AutoGPTQ.git && cd AutoGPTQ

A few packages are required in order to build from source: pip install numpy gekko pandas.
Then, install locally from source:
pip install -vvv -e .

You can set BUILD_CUDA_EXT=0 to disable pytorch extension building, but this is strongly discouraged as AutoGPTQ then falls back on a slow python implementation.
On ROCm systems
To install from source for AMD GPUs supporting ROCm, please specify the ROCM_VERSION environment variable. Example:
ROCM_VERSION=5.6 pip install -vvv -e .

The compilation can be speeded up by specifying the PYTORCH_ROCM_ARCH variable (reference) in order to build for a single target device, for example gfx90a for MI200 series devices.
For ROCm systems, the packages rocsparse-dev, hipsparse-dev, rocthrust-dev, rocblas-dev and hipblas-dev are required to build.
Quick Tour
Quantization and Inference

warning: this is just a showcase of the usage of basic apis in AutoGPTQ, which uses only one sample to quantize a much small model, quality of quantized model using such little samples may not good.

Below is an example for the simplest use of auto_gptq to quantize a model and inference after quantization:
from transformers import AutoTokenizer, TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import logging

logging.basicConfig(
format="%(asctime)s %(levelname)s [%(name)s] %(message)s", level=logging.INFO, datefmt="%Y-%m-%d %H:%M:%S"
)

pretrained_model_dir = "facebook/opt-125m"
quantized_model_dir = "opt-125m-4bit"

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
examples = [
tokenizer(
"auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."
)
]

quantize_config = BaseQuantizeConfig(
bits=4, # quantize model to 4-bit
group_size=128, # it is recommended to set the value to 128
desc_act=False, # set to False can significantly speed up inference but the perplexity may slightly bad
)

# load un-quantized model, by default, the model will always be loaded into CPU memory
model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)

# quantize model, the examples should be list of dict whose keys can only be "input_ids" and "attention_mask"
model.quantize(examples)

# save quantized model
model.save_quantized(quantized_model_dir)

# save quantized model using safetensors
model.save_quantized(quantized_model_dir, use_safetensors=True)

# push quantized model to Hugging Face Hub.
# to use use_auth_token=True, Login first via huggingface-cli login.
# or pass explcit token with: use_auth_token="hf_xxxxxxx"
# (uncomment the following three lines to enable this feature)
# repo_id = f"YourUserName/{quantized_model_dir}"
# commit_message = f"AutoGPTQ model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}"
# model.push_to_hub(repo_id, commit_message=commit_message, use_auth_token=True)

# alternatively you can save and push at the same time
# (uncomment the following three lines to enable this feature)
# repo_id = f"YourUserName/{quantized_model_dir}"
# commit_message = f"AutoGPTQ model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}"
# model.push_to_hub(repo_id, save_dir=quantized_model_dir, use_safetensors=True, commit_message=commit_message, use_auth_token=True)

# load quantized model to the first GPU
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0")

# download quantized model from Hugging Face Hub and load to the first GPU
# model = AutoGPTQForCausalLM.from_quantized(repo_id, device="cuda:0", use_safetensors=True, use_triton=False)

# inference with model.generate
print(tokenizer.decode(model.generate(**tokenizer("auto_gptq is", return_tensors="pt").to(model.device))[0]))

# or you can also use pipeline
pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer)
print(pipeline("auto-gptq is")[0]["generated_text"])

For more advanced features of model quantization, please reference to this script
Customize Model

Below is an example to extend `auto_gptq` to support `OPT` model, as you will see, it's very easy:
from auto_gptq.modeling import BaseGPTQForCausalLM


class OPTGPTQForCausalLM(BaseGPTQForCausalLM):
# chained attribute name of transformer layer block
layers_block_name = "model.decoder.layers"
# chained attribute names of other nn modules that in the same level as the transformer layer block
outside_layer_modules = [
"model.decoder.embed_tokens", "model.decoder.embed_positions", "model.decoder.project_out",
"model.decoder.project_in", "model.decoder.final_layer_norm"
]
# chained attribute names of linear layers in transformer layer module
# normally, there are four sub lists, for each one the modules in it can be seen as one operation,
# and the order should be the order when they are truly executed, in this case (and usually in most cases),
# they are: attention q_k_v projection, attention output projection, MLP project input, MLP project output
inside_layer_modules = [
["self_attn.k_proj", "self_attn.v_proj", "self_attn.q_proj"],
["self_attn.out_proj"],
["fc1"],
["fc2"]
]

After this, you can use OPTGPTQForCausalLM.from_pretrained and other methods as shown in Basic.

Evaluation on Downstream Tasks
You can use tasks defined in auto_gptq.eval_tasks to evaluate model's performance on specific down-stream task before and after quantization.
The predefined tasks support all causal-language-models implemented in πŸ€— transformers and in this project.

Below is an example to evaluate `EleutherAI/gpt-j-6b` on sequence-classification task using `cardiffnlp/tweet_sentiment_multilingual` dataset:
from functools import partial

import datasets
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from auto_gptq.eval_tasks import SequenceClassificationTask


MODEL = "EleutherAI/gpt-j-6b"
DATASET = "cardiffnlp/tweet_sentiment_multilingual"
TEMPLATE = "Question:What's the sentiment of the given text? Choices are {labels}.\nText: {text}\nAnswer:"
ID2LABEL = {
0: "negative",
1: "neutral",
2: "positive"
}
LABELS = list(ID2LABEL.values())


def ds_refactor_fn(samples):
text_data = samples["text"]
label_data = samples["label"]

new_samples = {"prompt": [], "label": []}
for text, label in zip(text_data, label_data):
prompt = TEMPLATE.format(labels=LABELS, text=text)
new_samples["prompt"].append(prompt)
new_samples["label"].append(ID2LABEL[label])

return new_samples


# model = AutoModelForCausalLM.from_pretrained(MODEL).eval().half().to("cuda:0")
model = AutoGPTQForCausalLM.from_pretrained(MODEL, BaseQuantizeConfig())
tokenizer = AutoTokenizer.from_pretrained(MODEL)

task = SequenceClassificationTask(
model=model,
tokenizer=tokenizer,
classes=LABELS,
data_name_or_path=DATASET,
prompt_col_name="prompt",
label_col_name="label",
**{
"num_samples": 1000, # how many samples will be sampled to evaluation
"sample_max_len": 1024, # max tokens for each sample
"block_max_len": 2048, # max tokens for each data block
# function to load dataset, one must only accept data_name_or_path as input
# and return datasets.Dataset
"load_fn": partial(datasets.load_dataset, name="english"),
# function to preprocess dataset, which is used for datasets.Dataset.map,
# must return Dict[str, list] with only two keys: [prompt_col_name, label_col_name]
"preprocess_fn": ds_refactor_fn,
# truncate label when sample's length exceed sample_max_len
"truncate_prompt": False
}
)

# note that max_new_tokens will be automatically specified internally based on given classes
print(task.run())

# self-consistency
print(
task.run(
generation_config=GenerationConfig(
num_beams=3,
num_return_sequences=3,
do_sample=True
)
)
)


Learn More
tutorials provide step-by-step guidance to integrate auto_gptq with your own project and some best practice principles.
examples provide plenty of example scripts to use auto_gptq in different ways.
Supported Models

you can use model.config.model_type to compare with the table below to check whether the model you use is supported by auto_gptq.
for example, model_type of WizardLM, vicuna and gpt4all are all llama, hence they are all supported by auto_gptq.




model type
quantization
inference
peft-lora
peft-ada-lora
peft-adaption_prompt




bloom
βœ…
βœ…
βœ…
βœ…



gpt2
βœ…
βœ…
βœ…
βœ…



gpt_neox
βœ…
βœ…
βœ…
βœ…
βœ…requires this peft branch


gptj
βœ…
βœ…
βœ…
βœ…
βœ…requires this peft branch


llama
βœ…
βœ…
βœ…
βœ…
βœ…


moss
βœ…
βœ…
βœ…
βœ…
βœ…requires this peft branch


opt
βœ…
βœ…
βœ…
βœ…



gpt_bigcode
βœ…
βœ…
βœ…
βœ…



codegen
βœ…
βœ…
βœ…
βœ…



falcon(RefinedWebModel/RefinedWeb)
βœ…
βœ…
βœ…
βœ…




Supported Evaluation Tasks
Currently, auto_gptq supports: LanguageModelingTask, SequenceClassificationTask and TextSummarizationTask; more Tasks will come soon!
Running tests
Tests can be run with:
pytest tests/ -s

FAQ
Which kernel is used by default?
AutoGPTQ defaults to using exllamav2 int4*fp16 kernel for matrix multiplication.
How to use Marlin kernel?
Marlin is an optimized int4 * fp16 kernel was recently proposed at https://github.com/IST-DASLab/marlin. This is integrated in AutoGPTQ when loading a model with use_marlin=True. This kernel is available only on devices with compute capability 8.0 or 8.6 (Ampere GPUs).
Acknowledgement

Special thanks Elias Frantar, Saleh Ashkboos, Torsten Hoefler and Dan Alistarh for proposing GPTQ algorithm and open source the code, and for releasing Marlin kernel for mixed precision computation.
Special thanks qwopqwop200, for code in this project that relevant to quantization are mainly referenced from GPTQ-for-LLaMa.
Special thanks to turboderp, for releasing Exllama and Exllama v2 libraries with efficient mixed precision kernels.

License:

For personal and professional use. You cannot resell or redistribute these repositories in their original state.

Customer Reviews

There are no reviews.