Skip to content
Snippets Groups Projects
Select Git revision
  • feat/hg-hub
  • main default protected
  • feat/improve
  • gh-pages
  • fix/wrong-model
  • fix/consistent-storage-classname
  • feat/smoke-test-ci
  • ialacol-0.12.0
  • ialacol-0.11.5
  • ialacol-0.11.4
  • ialacol-0.11.3
  • ialacol-0.11.2
  • ialacol-0.11.1
  • ialacol-0.11.0
  • ialacol-0.10.4
  • ialacol-0.10.3
  • ialacol-0.10.2
  • ialacol-0.10.1
  • ialacol-0.10.0
  • ialacol-0.9.0
  • ialacol-0.8.0
  • ialacol-0.7.3
  • ialacol-0.7.2
  • ialacol-0.7.1
  • ialacol-0.7.0
  • ialacol-0.6.3
  • ialacol-0.6.2
27 results

ialacol-build

  • Clone with SSH
  • Clone with HTTPS
  • user avatar
    Hung-Han (Henry) Chen authored
    Signed-off-by: default avatarHung-Han (Henry) Chen <chenhungh@gmail.com>
    572258ab
    History

    ialacol (l-o-c-a-l-a-i)

    Docker Repository on Quay

    Introduction

    ialacol (pronounced "localai") is a lightweight drop-in replacement for OpenAI API.

    It is an OpenAI API-compatible wrapper ctransformers supporting GGML/GPTQ with optional CUDA/Metal acceleration.

    ialacol is inspired by other similar projects like LocalAI, privateGPT, local.ai, llama-cpp-python, closedai, and mlc-llm, with a specific focus on Kubernetes deployment.

    Features

    • Compatibility with OpenAI APIs, compatible with langchain.
    • Lightweight, easy deployment on Kubernetes clusters with a 1-click Helm installation.
    • Streaming first! For better UX.
    • Optional CUDA acceleration.
    • Compatible with Github Copilot VSCode Extension, see Copilot

    Supported Models

    See Receipts below for instructions of deployments.

    And all LLMs supported by ctransformers.

    Blogs

    Quick Start

    Kubernetes

    ialacol offer first class citizen support for Kubernetes, which means you can automate/configure everything compare to runing without.

    To quickly get started with ialacol on Kubernetes, follow the steps below:

    helm repo add ialacol https://chenhunghan.github.io/ialacol
    helm repo update
    helm install llama-2-7b-chat ialacol/ialacol

    By defaults, it will deploy Meta's Llama 2 Chat model quantized by TheBloke.

    Port-forward

    kubectl port-forward svc/llama-2-7b-chat 8000:8000

    Chat with the default model llama-2-7b-chat.ggmlv3.q4_0.bin using curl

    curl -X POST \
         -H 'Content-Type: application/json' \
         -d '{ "messages": [{"role": "user", "content": "How are you?"}], "model": "llama-2-7b-chat.ggmlv3.q4_0.bin", "stream": false}' \
         http://localhost:8000/v1/chat/completions

    Alternatively, using OpenAI's client library (see more examples in the examples/openai folder).

    openai -k "sk-fake" \
         -b http://localhost:8000/v1 -vvvvv \
         api chat_completions.create -m llama-2-7b-chat.ggmlv3.q4_0.bin \
         -g user "Hello world!"

    Configuration

    All configuration is done via environmental variable.

    Parameter Description Default Example
    DEFAULT_MODEL_HG_REPO_ID The Hugging Face repo id to download the model None TheBloke/orca_mini_3B-GGML
    DEFAULT_MODEL_HG_REPO_REVISION The Hugging Face repo revision main gptq-4bit-32g-actorder_True
    DEFAULT_MODEL_FILE The file name to download from the repo, optional for GPTQ models None orca-mini-3b.ggmlv3.q4_0.bin
    MODE_TYPE Model type to override the auto model type detection None gptq, gpt_bigcode, llama, mpt, replit, falcon, gpt_neox gptj
    LOGGING_LEVEL Logging level INFO DEBUG
    TOP_K top-k for sampling. 40 Integers
    TOP_P top-p for sampling. 1.0 Floats
    REPETITION_PENALTY rp for sampling. 1.1 Floats
    LAST_N_TOKENS The last n tokens for repetition penalty. 1.1 Integers
    SEED The seed for sampling. -1 Integers
    BATCH_SIZE The batch size for evaluating tokens, only for GGUF/GGML models 8 Integers
    THREADS Thread number override auto detect by CPU/2, set 1 for GPTQ models Auto Integers
    MAX_TOKENS The max number of token to generate 512 Integers
    STOP The token to stop the generation None `<
    CONTEXT_LENGTH Override the auto detect context length 512 Integers
    GPU_LAYERS The number of layers to off load to GPU 0 Integers
    TRUNCATE_PROMPT_LENGTH Truncate the prompt if set 0 Integers

    Sampling parameters including TOP_K, TOP_P, REPETITION_PENALTY, LAST_N_TOKENS, SEED, MAX_TOKENS, STOP can be override per request via request body, for example:

    curl -X POST \
         -H 'Content-Type: application/json' \
         -d '{ "messages": [{"role": "user", "content": "Tell me a story."}], "model": "llama-2-7b-chat.ggmlv3.q4_0.bin", "stream": false, "temperature": "2", "top_p": "1.0", "top_k": "0" }' \
         http://localhost:8000/v1/chat/completions

    will use temperature=2, top_p=1 and top_k=0for this request.

    Run in Container

    Image from Github Registry

    There is a image hosted on ghcr.io (alternatively CUDA11,CUDA12,METAL,GPTQ variants).

    docker run --rm -it -p 8000:8000 \
         -e DEFAULT_MODEL_HG_REPO_ID="TheBloke/Llama-2-7B-Chat-GGML" \
         -e DEFAULT_MODEL_FILE="llama-2-7b-chat.ggmlv3.q4_0.bin" \
         ghcr.io/chenhunghan/ialacol:latest

    From Source

    For developers/contributors

    Python
    python3 -m venv .venv
    source .venv/bin/activate
    python3 -m pip install -r requirements.txt
    DEFAULT_MODEL_HG_REPO_ID="TheBloke/stablecode-completion-alpha-3b-4k-GGML" DEFAULT_MODEL_FILE="stablecode-completion-alpha-3b-4k.ggmlv1.q4_0.bin" LOGGING_LEVEL="DEBUG" THREAD=4 uvicorn main:app --reload --host 0.0.0.0 --port 9999
    Docker

    Build image

    docker build --file ./Dockerfile -t ialacol .

    Run container

    export DEFAULT_MODEL_HG_REPO_ID="TheBloke/orca_mini_3B-GGML"
    export DEFAULT_MODEL_FILE="orca-mini-3b.ggmlv3.q4_0.bin"
    docker run --rm -it -p 8000:8000 \
         -e DEFAULT_MODEL_HG_REPO_ID=$DEFAULT_MODEL_HG_REPO_ID \
         -e DEFAULT_MODEL_FILE=$DEFAULT_MODEL_FILE ialacol

    GPU Acceleration

    To enable GPU/CUDA acceleration, you need to use the container image built for GPU and add GPU_LAYERS environment variable. GPU_LAYERS is determine by the size of your GPU memory. See the PR/discussion in llama.cpp to find the best value.

    CUDA 11

    • deployment.image = ghcr.io/chenhunghan/ialacol-cuda11:latest
    • deployment.env.GPU_LAYERS is the layer to off loading to GPU.

    CUDA 12

    • deployment.image = ghcr.io/chenhunghan/ialacol-cuda12:latest
    • deployment.env.GPU_LAYERS is the layer to off loading to GPU.

    Only llama, falcon, mpt and gpt_bigcode(StarCoder/StarChat) support CUDA.

    Llama with CUDA12

    helm install llama2-7b-chat-cuda12 ialacol/ialacol -f examples/values/llama2-7b-chat-cuda12.yaml

    Deploys llama2 7b model with 40 layers offloadind to GPU. The inference is accelerated by CUDA 12.

    StarCoderPlus with CUDA12

    helm install starcoderplus-guanaco-cuda12 ialacol/ialacol -f examples/values/starcoderplus-guanaco-cuda12.yaml

    Deploys Starcoderplus-Guanaco-GPT4-15B-V1.0 model with 40 layers offloadind to GPU. The inference is accelerated by CUDA 12.

    CUDA Driver Issues

    If you see CUDA driver version is insufficient for CUDA runtime version when making the request, you are likely using a Nvidia Driver that is not compatible with the CUDA version.

    Upgrade the driver manually on the node (See here if you are using CUDA11 + AMI). Or try different version of CUDA.

    Metal

    To enable Metal support, use the image ialacol-metal built for metal.

    • deployment.image = ghcr.io/chenhunghan/ialacol-metal:latest

    For example

    helm install llama2-7b-chat-metal ialacol/ialacol -f examples/values/llama2-7b-chat-metal.yaml.yaml

    GPTQ

    To use GPTQ, you must

    • deployment.image = ghcr.io/chenhunghan/ialacol-gptq:latest
    • deployment.env.MODEL_TYPE = gptq

    For example

    helm install llama2-7b-chat-gptq ialacol/ialacol -f examples/values/llama2-7b-chat-gptq.yaml.yaml
    kubectl port-forward svc/llama2-7b-chat-gptq 8000:8000
    openai -k "sk-fake" -b http://localhost:8000/v1 -vvvvv api chat_completions.create -m gptq_model-4bit-128g.safetensors -g user "Hello world!"

    Tips

    Copilot

    ialacol can be use as a copilot client as GitHub's Copilot is almost identical API as OpenAI completion API.

    However, few things need to keep in mind:

    1. Copilot client sends a lenthy prompt, to include all the related context for code completion, see copilot-explorer, which give heavy load on the server, if you are trying to run ialacol locally, opt-in TRUNCATE_PROMPT_LENGTH environmental variable to truncate the prompt from the beginning to reduce the workload.

    2. Copilot sends request in parallel, to increase the throughput, you probably need a queue like text-inference-batcher.

    Start two instances of ialacol:

    gh repo clone chenhunghan/ialacol && cd ialacol && python3 -m venv .venv && source .venv/bin/activate && python3 -m pip install -r requirements.txt
    LOGGING_LEVEL="DEBUG"
    THREAD=2
    DEFAULT_MODEL_HG_REPO_ID="TheBloke/stablecode-completion-alpha-3b-4k-GGML"
    DEFAULT_MODEL_FILE="stablecode-completion-alpha-3b-4k.ggmlv1.q4_0.bin"
    TRUNCATE_PROMPT_LENGTH=100 # optional
    uvicorn main:app --host 0.0.0.0 --port 9998
    uvicorn main:app --host 0.0.0.0 --port 9999

    Start tib, pointing to upstream ialacol instances.

    gh repo clone ialacol/text-inference-batcher && cd text-inference-batcher && npm install
    UPSTREAMS="http://localhost:9998,http://localhost:9999" npm start

    Configure VSCode Github Copilot to use tib.

    "github.copilot.advanced": {
         "debug.overrideEngine": "stablecode-completion-alpha-3b-4k.ggmlv1.q4_0.bin",
         "debug.testOverrideProxyUrl": "http://localhost:8000",
         "debug.overrideProxyUrl": "http://localhost:8000"
    }

    Creative v.s. Conservative

    LLMs are known to be sensitive to parameters, the higher temperature leads to more "randomness" hence LLM becomes more "creative", top_p and top_k also contribute to the "randomness"

    If you want to make LLM be creative.

    curl -X POST \
         -H 'Content-Type: application/json' \
         -d '{ "messages": [{"role": "user", "content": "Tell me a story."}], "model": "llama-2-7b-chat.ggmlv3.q4_0.bin", "stream": false, "temperature": "2", "top_p": "1.0", "top_k": "0" }' \
         http://localhost:8000/v1/chat/completions

    If you want to make LLM be more consistent and genereate the same result with the same input.

    curl -X POST \
         -H 'Content-Type: application/json' \
         -d '{ "messages": [{"role": "user", "content": "Tell me a story."}], "model": "llama-2-7b-chat.ggmlv3.q4_0.bin", "stream": false, "temperature": "0.1", "top_p": "0.1", "top_k": "40" }' \
         http://localhost:8000/v1/chat/completions

    Roadmap

    Star History

    Star History Chart

    Receipts

    Llama-2

    Deploy Meta's Llama 2 Chat model quantized by TheBloke.

    7B Chat

    helm repo add ialacol https://chenhunghan.github.io/ialacol
    helm repo update
    helm install llama2-7b-chat ialacol/ialacol -f examples/values/llama2-7b-chat.yaml

    13B Chat

    helm repo add ialacol https://chenhunghan.github.io/ialacol
    helm repo update
    helm install llama2-13b-chat ialacol/ialacol -f examples/values/llama2-13b-chat.yaml

    70B Chat

    helm repo add ialacol https://chenhunghan.github.io/ialacol
    helm repo update
    helm install llama2-70b-chat ialacol/ialacol -f examples/values/llama2-70b-chat.yaml

    OpenLM Research's OpenLLaMA Models

    Deploy OpenLLaMA 7B model quantized by rustformers.

    ℹ️ This is a base model, likely only useful for text completion.

    helm repo add ialacol https://chenhunghan.github.io/ialacol
    helm repo update
    helm install openllama-7b ialacol/ialacol -f examples/values/openllama-7b.yaml

    VMWare's OpenLlama 13B Open Instruct

    Deploy OpenLLaMA 13B Open Instruct model quantized by TheBloke.

    helm repo add ialacol https://chenhunghan.github.io/ialacol
    helm repo update
    helm install openllama-13b-instruct ialacol/ialacol -f examples/values/openllama-13b-instruct.yaml

    Mosaic's MPT Models

    Deploy MosaicML's MPT-7B model quantized by rustformers. ℹ️ This is a base model, likely only useful for text completion.

    helm repo add ialacol https://chenhunghan.github.io/ialacol
    helm repo update
    helm install mpt-7b ialacol/ialacol -f examples/values/mpt-7b.yaml

    Deploy MosaicML's MPT-30B Chat model quantized by TheBloke.

    helm repo add ialacol https://chenhunghan.github.io/ialacol
    helm repo update
    helm install mpt-30b-chat ialacol/ialacol -f examples/values/mpt-30b-chat.yaml

    Falcon Models

    Deploy Uncensored Falcon 7B model quantized by TheBloke.

    helm repo add ialacol https://chenhunghan.github.io/ialacol
    helm repo update
    helm install falcon-7b ialacol/ialacol -f examples/values/falcon-7b.yaml

    Deploy Uncensored Falcon 40B model quantized by TheBloke.

    helm repo add ialacol https://chenhunghan.github.io/ialacol
    helm repo update
    helm install falcon-40b ialacol/ialacol -f examples/values/falcon-40b.yaml

    StarCoder Models (startcoder, startchat, starcoderplus, WizardCoder)

    Deploy starchat-beta model quantized by TheBloke.

    helm repo add starchat https://chenhunghan.github.io/ialacol
    helm repo update
    helm install starchat-beta ialacol/ialacol -f examples/values/starchat-beta.yaml

    Deploy WizardCoder model quantized by TheBloke.

    helm repo add starchat https://chenhunghan.github.io/ialacol
    helm repo update
    helm install wizard-coder-15b ialacol/ialacol -f examples/values/wizard-coder-15b.yaml

    Pythia Models

    Deploy light-weight pythia-70m model with only 70 millions paramters (~40MB) quantized by rustformers.

    helm repo add ialacol https://chenhunghan.github.io/ialacol
    helm repo update
    helm install pythia70m ialacol/ialacol -f examples/values/pythia-70m.yaml

    RedPajama Models

    Deploy RedPajama 3B model

    helm repo add ialacol https://chenhunghan.github.io/ialacol
    helm repo update
    helm install redpajama-3b ialacol/ialacol -f examples/values/redpajama-3b.yaml

    StableLM Models

    Deploy stableLM 7B model

    helm repo add ialacol https://chenhunghan.github.io/ialacol
    helm repo update
    helm install stablelm-7b ialacol/ialacol -f examples/values/stablelm-7b.yaml

    Development

    python3 -m venv .venv
    source .venv/bin/activate
    python3 -m pip install -r requirements.txt
    pip freeze > requirements.txt