Skip to content
Snippets Groups Projects
Unverified Commit d944d150 authored by Henry Chen's avatar Henry Chen Committed by GitHub
Browse files

Add support for VSCode Github Copilot (#62)

parent a0ee699e
No related branches found
No related tags found
No related merge requests found
...@@ -12,10 +12,11 @@ ialacol is inspired by other similar projects like [LocalAI](https://github.com/ ...@@ -12,10 +12,11 @@ ialacol is inspired by other similar projects like [LocalAI](https://github.com/
## Features ## Features
- Compatibility with OpenAI APIs, allowing you to use any frameworks that are built on top of OpenAI APIs such as [langchain](https://github.com/hwchase17/langchain). - Compatibility with OpenAI APIs, compatible with [langchain](https://github.com/hwchase17/langchain).
- Lightweight, easy deployment on Kubernetes clusters with a 1-click Helm installation. - Lightweight, easy deployment on Kubernetes clusters with a 1-click Helm installation.
- Streaming first! For better UX. - Streaming first! For better UX.
- Optional CUDA acceleration. - Optional CUDA acceleration.
- Compatible with [Github Copilot VSCode Extension](https://marketplace.visualstudio.com/items?itemName=GitHub.copilot), see [Copilot](#copilot)
## Supported Models ## Supported Models
...@@ -96,6 +97,17 @@ docker run --rm -it -p 8000:8000 \ ...@@ -96,6 +97,17 @@ docker run --rm -it -p 8000:8000 \
For developers/contributors For developers/contributors
##### Python
```bash
python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -r requirements.txt
DEFAULT_MODEL_HG_REPO_ID="TheBloke/stablecode-completion-alpha-3b-4k-GGML" DEFAULT_MODEL_FILE="stablecode-completion-alpha-3b-4k.ggmlv1.q4_0.bin" LOGGING_LEVEL="DEBUG" THREAD=4 uvicorn main:app --reload --host 0.0.0.0 --port 9999
```
##### Docker
Build image Build image
```sh ```sh
...@@ -182,6 +194,46 @@ openai -k "sk-fake" -b http://localhost:8000/v1 -vvvvv api chat_completions.crea ...@@ -182,6 +194,46 @@ openai -k "sk-fake" -b http://localhost:8000/v1 -vvvvv api chat_completions.crea
## Tips ## Tips
### Copilot
`ialacol` can be use as a copilot client as GitHub's Copilot is almost identical API as OpenAI completion API.
However, few things need to keep in mind:
1. Copilot client sends a lenthy prompt, to include all the related context for code completion, see [copilot-explorer](https://github.com/thakkarparth007/copilot-explorer), which give heavy load on the server, if you are trying to run `ialacol` locally, opt-in `TRUNCATE_PROMPT_LENGTH` environmental variable to truncate the prompt from the beginning to reduce the workload.
2. Copilot sends request in parallel, to increase the throughput, you probably need a queue like [text-inference-batcher]([text-inference-batcher](https://github.com/ialacol/text-inference-batcher).
Start two instances of ialacol:
```bash
gh repo clone chenhunghan/ialacol && cd ialacol && python3 -m venv .venv && source .venv/bin/activate && python3 -m pip install -r requirements.txt
LOGGING_LEVEL="DEBUG"
THREAD=2
DEFAULT_MODEL_HG_REPO_ID="TheBloke/stablecode-completion-alpha-3b-4k-GGML"
DEFAULT_MODEL_FILE="stablecode-completion-alpha-3b-4k.ggmlv1.q4_0.bin"
TRUNCATE_PROMPT_LENGTH=100 # optional
uvicorn main:app --host 0.0.0.0 --port 9998
uvicorn main:app --host 0.0.0.0 --port 9999
```
Start [tib](https://github.com/ialacol/text-inference-batcher), pointing to upstream ialacol instances.
```bash
gh repo clone ialacol/text-inference-batcher && cd text-inference-batcher && npm install
UPSTREAMS="http://localhost:9998,http://localhost:9999" npm start
```
Configure VSCode Github Copilot to use [tib](https://github.com/ialacol/text-inference-batcher).
```json
"github.copilot.advanced": {
"debug.overrideEngine": "stablecode-completion-alpha-3b-4k.ggmlv1.q4_0.bin",
"debug.testOverrideProxyUrl": "http://localhost:8000",
"debug.overrideProxyUrl": "http://localhost:8000"
}
```
### Creative v.s. Conservative ### Creative v.s. Conservative
LLMs are known to be sensitive to parameters, the higher `temperature` leads to more "randomness" hence LLM becomes more "creative", `top_p` and `top_k` also contribute to the "randomness" LLMs are known to be sensitive to parameters, the higher `temperature` leads to more "randomness" hence LLM becomes more "creative", `top_p` and `top_k` also contribute to the "randomness"
......
apiVersion: v2 apiVersion: v2
appVersion: 0.10.4 appVersion: 0.11.0
description: A Helm chart for ialacol description: A Helm chart for ialacol
name: ialacol name: ialacol
type: application type: application
version: 0.10.4 version: 0.11.0
...@@ -7,7 +7,6 @@ from log import log ...@@ -7,7 +7,6 @@ from log import log
THREADS = int(get_env("THREADS", str(get_default_thread()))) THREADS = int(get_env("THREADS", str(get_default_thread())))
def get_config( def get_config(
body: CompletionRequestBody | ChatCompletionRequestBody, body: CompletionRequestBody | ChatCompletionRequestBody,
) -> Config: ) -> Config:
......
...@@ -23,6 +23,7 @@ from streamers import chat_completions_streamer, completions_streamer ...@@ -23,6 +23,7 @@ from streamers import chat_completions_streamer, completions_streamer
from model_generate import chat_model_generate, model_generate from model_generate import chat_model_generate, model_generate
from get_env import get_env from get_env import get_env
from log import log from log import log
from truncate import truncate
DEFAULT_MODEL_HG_REPO_ID = get_env( DEFAULT_MODEL_HG_REPO_ID = get_env(
"DEFAULT_MODEL_HG_REPO_ID", "TheBloke/Llama-2-7B-Chat-GGML" "DEFAULT_MODEL_HG_REPO_ID", "TheBloke/Llama-2-7B-Chat-GGML"
...@@ -44,7 +45,8 @@ def set_downloading_model(boolean: bool): ...@@ -44,7 +45,8 @@ def set_downloading_model(boolean: bool):
""" """
globals()["DOWNLOADING_MODEL"] = boolean globals()["DOWNLOADING_MODEL"] = boolean
log.debug("DOWNLOADING_MODEL set to %s", globals()["DOWNLOADING_MODEL"]) log.debug("DOWNLOADING_MODEL set to %s", globals()["DOWNLOADING_MODEL"])
def set_loading_model(boolean: bool): def set_loading_model(boolean: bool):
"""_summary_ """_summary_
...@@ -102,7 +104,11 @@ async def startup_event(): ...@@ -102,7 +104,11 @@ async def startup_event():
gpu_layers=GPU_LAYERS, gpu_layers=GPU_LAYERS,
) )
model_type = get_model_type(DEFAULT_MODEL_FILE) model_type = get_model_type(DEFAULT_MODEL_FILE)
log.info("Creating llm singleton with model_type: %s for DEFAULT_MODEL_FILE %s", model_type, DEFAULT_MODEL_FILE) log.info(
"Creating llm singleton with model_type: %s for DEFAULT_MODEL_FILE %s",
model_type,
DEFAULT_MODEL_FILE,
)
set_loading_model(True) set_loading_model(True)
llm = LLM( llm = LLM(
model_path=f"{os.getcwd()}/models/{DEFAULT_MODEL_FILE}", model_path=f"{os.getcwd()}/models/{DEFAULT_MODEL_FILE}",
...@@ -137,7 +143,6 @@ async def models(): ...@@ -137,7 +143,6 @@ async def models():
"object": "list", "object": "list",
} }
@app.post("/v1/completions", response_model=CompletionResponseBody) @app.post("/v1/completions", response_model=CompletionResponseBody)
async def completions( async def completions(
body: Annotated[CompletionRequestBody, Body()], body: Annotated[CompletionRequestBody, Body()],
...@@ -177,6 +182,40 @@ async def completions( ...@@ -177,6 +182,40 @@ async def completions(
) )
return model_generate(prompt, model_name, llm, config) return model_generate(prompt, model_name, llm, config)
@app.post("/v1/engines/{engine}/completions")
async def engine_completions(
# Can't use body as FastAPI require corrent context-type header
# But copilot client maybe not send such header
request: Request,
# copilot client ONLY request param
engine: str,
):
"""_summary_
Similar to https://platform.openai.com/docs/api-reference/completions
but with engine param and with /v1/engines
Args:
body (CompletionRequestBody): parsed request body
Returns:
StreamingResponse: streaming response
"""
if DOWNLOADING_MODEL is True:
raise HTTPException(status_code=503, detail="Downloading model")
json = await request.json()
log.debug("Body:%s", str(json))
body = CompletionRequestBody(**json, model=engine)
prompt = truncate(body.prompt)
config = get_config(body)
llm = request.app.state.llm
if body.stream is True:
log.debug("Streaming response from %s", engine)
return StreamingResponse(
completions_streamer(prompt, engine, llm, config),
media_type="text/event-stream",
)
return model_generate(prompt, engine, llm, config)
@app.post("/v1/chat/completions", response_model=ChatCompletionResponseBody) @app.post("/v1/chat/completions", response_model=ChatCompletionResponseBody)
async def chat_completions( async def chat_completions(
......
from get_env import get_env_or_none
def truncate(string, beginning=True):
"""Shorten the given string to the given length.
:Parameters:
length (int) = The maximum allowed length before truncating.
beginning (bool) = Trim starting chars, else; ending.
:Return:
(str)
ex. call: truncate('12345678', 4)
returns: '5678'
"""
TRUNCATE_PROMPT_LENGTH = get_env_or_none("TRUNCATE_PROMPT_LENGTH")
if (TRUNCATE_PROMPT_LENGTH is None):
return string
length = int(TRUNCATE_PROMPT_LENGTH)
if len(string) > length:
# trim starting chars.
if beginning:
string = string[-length:]
# trim ending chars.
else:
string = string[:length]
return string
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment