GitLocker: The Coding Marketplace

Description:

identitychain 0.1.0

IdentityChain

The IdentityChain Framework for Code Large Language Models (Code LLMs) Evaluation. Official implementation of the ICLR 2024 paper Beyond Accuracy: Evaluating Self-Consistency of Code Large Language Models with IdentityChain.
The IdentityChain Framework evaluates the NL-to-PL (Code Generation) Accuracy, PL-to-NL (Code Summurization) Accuracy, and the Self-Consistency across the two tasks. It also provides a fine-grained analysis of the model's performance so that you can pinpoint the exact step and problem where the model makes a self-inconsistency violation.

Installation
Create and Activate a Conda Environment.
conda create -n idchain python=3.10
conda activate idchain

Install from PyPI with all Dependencies.
pip3 install identitychain
pip3 install -r requirements.txt

Install from Source with all Dependencies.
git clone https://github.com/marcusm117/IdentityChain.git
cd IdentityChain
make develop

Usage
Before the self-consistency evaluation, you need to make sure that one of the followings is satisfied:

Your model is an Instruction-tuned Code LLM, and it's trained on both NL-to-PL and PL-to-NL tasks.
Your model is a Foundation Code LLM, and it's trained on both completion and fill-in-the-middle tasks.

To evaluate your model using IdentityChain, you need to prepare the followings:

An evaluation dataset from one of the followings (or one of your own in the same format):

EvalPlus-Mini-v0.1.6_reformatted.jsonl
EvalPlus-Mini-v0.1.10_reformatted.jsonl
MBPP-S_test_reformatted.jsonl

An NL-to-PL prompt for your model
A PL-to-NL prompt for your model
An NL-to-PL generation function for your model
A PL-to-NL generation function for your model

See run_identity_chain_openai.py for an example of how to use IdentityChain to evaluate OpenAI models.
See run_identity_chain_google.py for an example of how to use IdentityChain to evaluate Google models.
See run_identity_chain_huggingface.py for an example of how to use IdentityChain to evaluate HuggingFace open-source models. This example script already includes the following models:

CodeLlama-Instruct-hf (7B, 13B, 34B, 70B)
CodeLlama-hf (7B, 13B, 34B, 70B)
StarChat-Beta
StarCoder
StarCoderPlus
StarCoderBase (1B, 3B, 7B, 15B)
DeepSeekCoder-Instruct (1.3B, 6.7B, 33B, 7B-v1.5)
DeepSeekCoder (1.3B, 6.7B, 33B, 7B-v1.5)

Example
Use run_identity_chain.sh to execute scripts run_identity_chain_openai.py or run_identity_chain_huggingface.py, which conducts several IdentityChain evaluation in a batch. Make sure that you modify the followings before running the script:

export CUDA_VISIBLE_DEVICES=0 to specify the local GPU device you want to use
export HF_HOME=YOUR_OWN_PATH/huggingface to specify your own huggingface home path, where the model checkpoints will be cached
export IDENTITY_CHAIN_HOME=YOUR_OWN_PATH/IdentityChain to your own IdentityChain home path
other parameters in the script for your own needs

Then run the script:
cd examples
bash run_identity_chain.sh

This script will create a temporary folder tmp under your IdentityChain home path, and store the results of IdentityChain evaluation in this folder, which will be a jsonl file. For example, tmp/starcoderbase-1b/IDChain_starcoderbase-1b_tmp0.0g_len5_pb_all_m_v1_EvalPlus-Mini-v0.1.6_reformatted.jsonl.
Use analyze_results.py to analyze the results of IdentityChain evaluation. It will geneartes an xlsx file, which contains the following information:

The SC (Self-Consistency) and SSC (Strong Self-Consistency) scores of the model at each self-iteration step. Note that SSC_0 is just Pass@1
The aggregated TOM score (also BLEU and CodeBLEU) information at each step for the following 4 types of resulsts: Pass-Pass, Pass-Fail, Fail-Fail, Fail-Pass
The TOM score (also BLEU and CodeBLEU) trajectory at each self-iteration step for each sample in the eavluation set.
The raw test case outputs at each self-iteration step

cd ../scripts
python analyze_results.py --input_path ../tmp/starcoderbase-1b/IDChain_starcoderbase-1b_tmp0.0g_len5_pb_all_m_v1_EvalPlus-Mini-v0.1.6_reformatted.jsonl --chain_length 5

The analyzed results will give you a sense of the model's overall performance, and the TOM score trajectory will help you pinpoint the exact step where the model makes a mistake.
Use browse_results.py to browse the results of IdentityChain evaluation. You can use this script to manually examine and study the mistakes made by the model for specific samples.
cd ../scripts
python browse_results.py --input_path ../tmp/starcoderbase-1b/IDChain_starcoderbase-1b_tmp0.0g_len5_pb_all_m_v1_EvalPlus-Mini-v0.1.6_reformatted.jsonl --chain_length 5 --start 0

Linting & Testing
We use a Makefile as a command registry:

make format: autoformat this library with black
make lint: perform static analysis of this library with black and flake8
make annotate: run type checking using mypy
make test: run automated tests
make check: check assets for packaging

Make sure that make lint, make test, and make check all pass locally before submitting a Pull Request.