OpenAI Compatible LLM Inference

A Single Inference Wrapper for OpenAI, Together AI, Hugging Face Inference TGI, Ollama, etc.

Introduction

Until recently I thought that the openai library was only for connecting to OpenAI endpoints. It was not until I was testing out LLM inference with together.ai that I came across a section in their documentation on OpenAI API compatibility. The idea of using the openai client to do inference with open source models was completely new to me. In the together.ai documentation example they use the openai library to connect to an open source model.

import os
import openai

system_content = "You are a travel agent. Be descriptive and helpful."
user_content = "Tell me about San Francisco"

client = openai.OpenAI(
    api_key=os.environ.get("TOGETHER_API_KEY"),
    base_url="https://api.together.xyz/v1",
    )
chat_completion = client.chat.completions.create(
    model="mistralai/Mixtral-8x7B-Instruct-v0.1",
    messages=[
        {"role": "system", "content": system_content},
        {"role": "user", "content": user_content},
    ],
    temperature=0.7,
    max_tokens=1024,
)
response = chat_completion.choices[0].message.content
print("Together response:\n", response)

Then a week later I saw that Hugging Face had also released support for OpenAI compatibility with Text Generation Inference (TGI) and Inference Endpoints. Again, you simply modify the base_url, api_key, and model as seen is this example from their blog post announcement.

from openai import OpenAI

# initialize the client but point it to TGI
client = OpenAI(
    base_url="<ENDPOINT_URL>" + "/v1/",  # replace with your endpoint url
    api_key="<HF_API_TOKEN>",  # replace with your token
)
chat_completion = client.chat.completions.create(
    model="tgi",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Why is open-source software important?"},
    ],
    stream=True,
    max_tokens=500
)

# iterate and print stream
for message in chat_completion:
    print(message.choices[0].delta.content, end="")

What about working with LLMs locally? Two such options are Ollama and LM Studio. Ollama recently added support for the openai client and LM Studio supports it too. For example, here is how one can use mistral-7b locally with Ollama to run inference with the openai client:

ollama pull mistral
from openai import OpenAI

client = OpenAI(
    base_url = 'http://localhost:11434/v1',
    api_key='ollama', # required, but unused
)

response = client.chat.completions.create(
  model="mistral",
  messages=[
    {"role": "system", "content": "You are a helpful assistant and always talk like a pirate."},
    {"role": "user", "content": "Write a haiku."},
  ])
print(response.choices[0].message.content)

There are other services and libraries for running LLM inference that are compatible with the openai library too. I find it all very exciting because it is less code I have to write and maintain for running inference with LLMs. All I need to change is a base_url, an api_key, and the name of the model.

At the same time that I was learning about openai client compatibility, I was also looking into the instructor library. Since it patches in some additional functionality into the openai client, I thought it would be fun to discuss here too.

ENV Setup

Start by creating a virtual environment:

python3 -m venv env
source env/bin/activate

Then install:

pip install openai
pip install instructor # only if you want to try out instructor library
pip install python-dotenv # or define your environment variables differently

I also have:

In my .env file I have the following:

OPENAI_API_KEY=your_key
HUGGING_FACE_ACCESS_TOKEN=your_key
TOGETHER_API_KEY=your_key
import os

from dotenv import load_dotenv

load_dotenv()

LLM Inference Class

You could go ahead and just start using client.chat.completions.create directly as in the examples from the introduction. However, I do like wrapping third party services into classes for reusability, maintainability, etc.

The class below, OpenAIChatCompletion, does several things:

  • manages the different client connections in the clients dict
  • exposes client.chat.completions.create in the __call__ method
  • provides functionality for making multiple calls in parallel. I know alternatively that one could use the AsyncOpenAI client, but sometimes I prefer simply using futures.ThreadPoolExecutor as seen in the function create_chat_completions_async.
  • patches the OpenAI client with the instructor library. If you don't want to play around with instructor library then simply remove the instructor.patch code.

I also added some logging functionality which keeps track of every outgoing LLM request. This was inspired by the awesome blog post by Hamel Husain, Fuck You, Show Me The Prompt.. In that post, Hamel writes about how various LLM tools can often hide the prompts, making it tricky to see what requests are actually sent to the LLM behind the scenes. I created a simple logger class OpenAIMessagesLogger which keeps track of all the requests sent to the openai client. Later when we try out the instructor library for getting structured output, we will utilize this debugging logger to see some additional messages that were sent to the client.

import ast
import logging
import re
from concurrent import futures
from typing import Any, Dict, List, Optional, Union

import instructor
from openai import APITimeoutError, OpenAI
from openai._streaming import Stream
from openai.types.chat.chat_completion import ChatCompletion
from openai.types.chat.chat_completion_chunk import ChatCompletionChunk


class OpenAIChatCompletion:
    clients: Dict = dict()

    @classmethod
    def _load_client(cls, base_url: Optional[str] = None, api_key: Optional[str] = None) -> OpenAI:
        client_key = (base_url, api_key)
        if OpenAIChatCompletion.clients.get(client_key) is None:
            OpenAIChatCompletion.clients[client_key] = instructor.patch(OpenAI(base_url=base_url, api_key=api_key))
        return OpenAIChatCompletion.clients[client_key]

    def __call__(
        self,
        model: str,
        messages: list,
        base_url: Optional[str] = None,
        api_key: Optional[str] = None,
        **kwargs: Any,
    ) -> Union[ChatCompletion, Stream[ChatCompletionChunk]]:
        # https://platform.openai.com/docs/api-reference/chat/create
        # https://github.com/openai/openai-python
        client = self._load_client(base_url, api_key)
        return client.chat.completions.create(model=model, messages=messages, **kwargs)

    @classmethod
    def create_chat_completions_async(
        cls, task_args_list: List[Dict], concurrency: int = 10
    ) -> List[Union[ChatCompletion, Stream[ChatCompletionChunk]]]:
        """
        Make a series of calls to chat.completions.create endpoint in parallel and collect back
        the results.
        :param task_args_list: A list of dictionaries where each dictionary contains the keyword
            arguments required for __call__ method.
        :param concurrency: the max number of workers
        """

        def create_chat_task(
            task_args: Dict,
        ) -> Union[None, ChatCompletion, Stream[ChatCompletionChunk]]:
            try:
                return cls().__call__(**task_args)
            except APITimeoutError:
                return None

        with futures.ThreadPoolExecutor(max_workers=concurrency) as executor:
            results = list(executor.map(create_chat_task, task_args_list))
        return results


class OpenAIMessagesLogger(logging.Handler):
    def __init__(self):
        super().__init__()
        self.log_messages = []

    def emit(self, record):
        # Append the log message to the list
        log_record_str = self.format(record)
        match = re.search(r"Request options: (.+)", log_record_str, re.DOTALL)
        if match:
            text = match[1].replace("\n", "")
            log_obj = ast.literal_eval(text)
            self.log_messages.append(log_obj)


def debug_messages():
    msg = OpenAIMessagesLogger()
    openai_logger = logging.getLogger("openai")
    openai_logger.setLevel(logging.DEBUG)
    openai_logger.addHandler(msg)
    return msg

Here is how you use the inference class to call the LLM. If you have ever used the openai client you will be familiar with the input and output format.

llm = OpenAIChatCompletion()
message_logger = debug_messages()  # optional for keeping track of all outgoing requests
print(llm(model="gpt-3.5-turbo-0125", messages=[dict(role="user", content="Hello!")]))
ChatCompletion(id='chatcmpl-90N4hSh3AG1Sz68zjUnfcEtAjvFn5', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Hello! How can I assist you today?', role='assistant', function_call=None, tool_calls=None))], created=1709875727, model='gpt-3.5-turbo-0125', object='chat.completion', system_fingerprint='fp_2b778c6b35', usage=CompletionUsage(completion_tokens=9, prompt_tokens=9, total_tokens=18))

And our logger is keeping track of all the outgoing requests:

message_logger.log_messages
[{'method': 'post',
  'url': '/chat/completions',
  'files': None,
  'json_data': {'messages': [{'role': 'user', 'content': 'Hello!'}],
   'model': 'gpt-3.5-turbo-0125'}}]

Now we can define some different models that can all be accessed through the same inference class.

class Models:
    # OpenAI GPT Models
    GPT4 = dict(model="gpt-4-0125-preview", base_url=None, api_key=None)
    GPT3 = dict(model="gpt-3.5-turbo-0125", base_url=None, api_key=None)
    # Hugging Face Inference Endpoints
    OPENHERMES2_5_MISTRAL_7B = dict(
        model="tgi",
        base_url="https://xofunqxk66baupmf.us-east-1.aws.endpoints.huggingface.cloud" + "/v1/",
        api_key=os.environ["HUGGING_FACE_ACCESS_TOKEN"],
    )
    # Ollama Models
    LLAMA2 = dict(
        model="llama2",
        base_url="http://localhost:11434/v1",
        api_key="ollama",
    )
    GEMMA2B = dict(
        model="gemma:2b-instruct",
        base_url="http://localhost:11434/v1",
        api_key="ollama",
    )
    # together AI endpoints
    GEMMA7B = dict(model="google/gemma-7b-it", base_url="https://api.together.xyz/v1", api_key=os.environ.get("TOGETHER_API_KEY"))
    MISTRAL7B = dict(model="mistralai/Mistral-7B-Instruct-v0.1", base_url="https://api.together.xyz/v1", api_key=os.environ.get("TOGETHER_API_KEY"))
all_models = [(model_name, model_config) for model_name, model_config in Models.__dict__.items() if not model_name.startswith("__")]
messages = [
    {"role": "system", "content": "You are a helpful assistant. Your replies are short, brief and to the point."},
    {"role": "user", "content": "Who was the first person to walk on the Moon, and in what year did it happen?"},
]
for model_name, model_config in all_models:
    resp = llm(messages=messages, **model_config)
    print(f"Model: {model_name}")
    print(f"Response: {resp.choices[0].message.content}")
Model: GPT4
Response: Neil Armstrong, 1969.
Model: GPT3
Response: The first person to walk on the Moon was Neil Armstrong in 1969.
Model: OPENHERMES2_5_MISTRAL_7B
Response: Neil Armstrong was the first person to walk on the Moon. It happened on July 20, 1969.
Model: LLAMA2
Response: The first person to walk on the Moon was Neil Armstrong, who stepped onto the lunar surface on July 20, 1969 as part of the Apollo 11 mission.
Model: GEMMA2B
Response: There is no evidence to support the claim that a person walked on the Moon in any year.
Model: GEMMA7B
Response: Sure, here is the answer:

Neil Armstrong was the first person to walk on the Moon in 1969.
Model: MISTRAL7B
Response:  The first person to walk on the Moon was Neil Armstrong, and it happened on July 20, 1969.

We can also send the same requests in parallel like this:

task_args_list = []
for model_name, model_config in all_models:
    task_args_list.append(dict(messages=messages, **model_config))

# execute the same calls in parallel
model_names = [m[0] for m in all_models]
resps = llm.create_chat_completions_async(task_args_list)
for model_name, resp in zip(model_names, resps):
    print(f"Model: {model_name}")
    print(f"Response: {resp.choices[0].message.content}")
Model: GPT4
Response: Neil Armstrong, 1969.
Model: GPT3
Response: The first person to walk on the Moon was Neil Armstrong in 1969.
Model: OPENHERMES2_5_MISTRAL_7B
Response: The first person to walk on the Moon was Neil Armstrong, and it happened in 1969.
Model: LLAMA2
Response: Nice question! The first person to walk on the Moon was Neil Armstrong, and it happened in 1969 during the Apollo 11 mission. Armstrong stepped onto the lunar surface on July 20, 1969, famously declaring "That's one small step for man, one giant leap for mankind" as he took his first steps.
Model: GEMMA2B
Response: There is no evidence or record of any person walking on the Moon.
Model: GEMMA7B
Response: Sure, here is the answer:

Neil Armstrong was the first person to walk on the Moon in 1969.
Model: MISTRAL7B
Response:  The first person to walk on the Moon was Neil Armstrong, and it happened on July 20, 1969.
assert len(message_logger.log_messages) == 15
message_logger.log_messages[-1]
{'method': 'post',
 'url': '/chat/completions',
 'files': None,
 'json_data': {'messages': [{'role': 'system',
    'content': 'You are a helpful assistant. Your replies are short, brief and to the point.'},
   {'role': 'user',
    'content': 'Who was the first person to walk on the Moon, and in what year did it happen?'}],
  'model': 'mistralai/Mistral-7B-Instruct-v0.1'}}

Structured Output

There are various approaches to getting structured output from LLMs. For example see JSON mode and Function calling. Some open source models and inference providers are also starting to offer these capabilities. For example see the together.ai docs. The instructor blog also has lots of examples and tips for getting structured output from LLMs. See this recent blog post for getting structured output from open source and Local LLMs.

One thing that is neat about the instructor library is you can define a Pydantic schema and then pass it to the patched openai client. It also adds in schema validation and retry logic.

First we will clear out our debugging log messages.

message_logger.log_messages = []
from typing import List

from pydantic import BaseModel, field_validator


class Character(BaseModel):
    name: str
    race: str
    fun_fact: str
    favorite_food: str
    skills: List[str]
    weapons: List[str]


class Characters(BaseModel):
    characters: List[Character]

    @field_validator("characters")
    @classmethod
    def validate_characters(cls, v):
        if len(v) < 20:
            raise ValueError(f"The number of characters must be at least 20, but it is {len(v)}")
        return v
res = llm(
    messages=[dict(role="user", content="Who are the main characters from Lord of the Rings?.")],
    response_model=Characters,
    max_retries=4,
    **Models.GPT4,
)
for character in res.characters:
    for k, v in character.model_dump().items():
        print(f"{k}: {v}")
    print()
name: Frodo Baggins
race: Hobbit
fun_fact: Bearer of the One Ring
favorite_food: Mushrooms
skills: ['Courage', 'Stealth']
weapons: ['Sting', 'Elven Dagger']

name: Samwise Gamgee
race: Hobbit
fun_fact: Frodo's gardener and friend
favorite_food: Potatoes
skills: ['Loyalty', 'Cooking']
weapons: ['Barrow-blade']

name: Gandalf
race: Maia
fun_fact: Known as Gandalf the Grey and later as Gandalf the White
favorite_food: N/A
skills: ['Wisdom', 'Magic']
weapons: ['Glamdring', 'Staff']

name: Aragorn
race: Human
fun_fact: Heir of Isildur and rightful king of Gondor
favorite_food: Elvish waybread
skills: ['Swordsmanship', 'Leadership']
weapons: ['Andúril', 'Bow']

name: Legolas
race: Elf
fun_fact: Prince of the Woodland Realm
favorite_food: Lembas bread
skills: ['Archery', 'Agility']
weapons: ['Elven bow', 'Daggers']

name: Gimli
race: Dwarf
fun_fact: Son of Glóin
favorite_food: Meat
skills: ['Axe fighting', 'Stout-heartedness']
weapons: ['Battle axe', 'Throwing axes']

name: Boromir
race: Human
fun_fact: Son of Denethor, Steward of Gondor
favorite_food: Stew
skills: ['Swordsmanship', 'Leadership']
weapons: ['Sword', 'Shield']

name: Meriadoc Brandybuck
race: Hobbit
fun_fact: Member of the Fellowship
favorite_food: Ale
skills: ['Stealth', 'Strategy']
weapons: ['Elven dagger']

name: Peregrin Took
race: Hobbit
fun_fact: Often known simply as Pippin
favorite_food: Cakes
skills: ['Curiosity', 'Bravery']
weapons: ['Sword']

name: Galadriel
race: Elf
fun_fact: Lady of Lothlórien
favorite_food: N/A
skills: ['Wisdom', 'Telepathy']
weapons: ['Nenya (Ring of Power)']

name: Elrond
race: Elf
fun_fact: Lord of Rivendell
favorite_food: N/A
skills: ['Wisdom', 'Healing']
weapons: ['Sword']

name: Eowyn
race: Human
fun_fact: Niece of King Théoden of Rohan; slayer of the Witch-king
favorite_food: Bread
skills: ['Swordsmanship', 'Courage']
weapons: ['Sword', 'Shield']

name: Faramir
race: Human
fun_fact: Brother of Boromir
favorite_food: Bread
skills: ['Archery', 'Strategy']
weapons: ['Bow', 'Sword']

name: Gollum
race: Hobbit-like creature
fun_fact: Once the bearer of the One Ring, known as Sméagol
favorite_food: Raw fish
skills: ['Stealth', 'Persuasion']
weapons: ['Teeth and claws']

name: Saruman
race: Maia
fun_fact: Head of the White Council before being corrupted
favorite_food: N/A
skills: ['Magic', 'Persuasion']
weapons: ['Staff']

name: Sauron
race: Maia
fun_fact: The Dark Lord and creator of the One Ring
favorite_food: N/A
skills: ['Necromancy', 'Deception']
weapons: ['One Ring', 'Mace']

name: Bilbo Baggins
race: Hobbit
fun_fact: Original discoverer of the One Ring
favorite_food: Everything
skills: ['Stealth', 'Story-telling']
weapons: ['Sting']

name: Théoden
race: Human
fun_fact: King of Rohan
favorite_food: Meat
skills: ['Leadership', 'Horsemanship']
weapons: ['Herugrim', 'Sword']

name: Treebeard
race: Ent
fun_fact: Oldest of the Ents, protectors of Fangorn Forest
favorite_food: Water
skills: ['Strength', 'Wisdom']
weapons: ['None']

name: Witch-king of Angmar
race: Undead/Nazgûl
fun_fact: Leader of the Nazgûl
favorite_food: N/A
skills: ['Fear-induction', 'Swordsmanship']
weapons: ['Morgul-blade', 'Flail']

name: Gríma Wormtongue
race: Human
fun_fact: Advisor to King Théoden under Saruman's influence
favorite_food: N/A
skills: ['Deception', 'Speechcraft']
weapons: ['Knife']

name: Éomer
race: Human
fun_fact: Nephew of King Théoden; later king of Rohan
favorite_food: Meat
skills: ['Swordsmanship', 'Horsemanship']
weapons: ['Sword', 'Spear']

It is probably likely that GPT would not return 20 characters in the first request. If max_retries=0 then it would likely raise a Pydantic validation error. But since we have max_retries=4 then the instructor library sends back the validation error as a message and asks again. How exactly does it do that? We can look at the messages that we have logged for debugging.

assert len(message_logger.log_messages) > 1
len(message_logger.log_messages)
2
message_logger.log_messages
[{'method': 'post',
  'url': '/chat/completions',
  'files': None,
  'json_data': {'messages': [{'role': 'user',
     'content': 'Who are the main characters from Lord of the Rings?.'}],
   'model': 'gpt-4-0125-preview',
   'tool_choice': {'type': 'function', 'function': {'name': 'Characters'}},
   'tools': [{'type': 'function',
     'function': {'name': 'Characters',
      'description': 'Correctly extracted `Characters` with all the required parameters with correct types',
      'parameters': {'$defs': {'Character': {'properties': {'name': {'title': 'Name',
           'type': 'string'},
          'race': {'title': 'Race', 'type': 'string'},
          'fun_fact': {'title': 'Fun Fact', 'type': 'string'},
          'favorite_food': {'title': 'Favorite Food', 'type': 'string'},
          'skills': {'items': {'type': 'string'},
           'title': 'Skills',
           'type': 'array'},
          'weapons': {'items': {'type': 'string'},
           'title': 'Weapons',
           'type': 'array'}},
         'required': ['name',
          'race',
          'fun_fact',
          'favorite_food',
          'skills',
          'weapons'],
         'title': 'Character',
         'type': 'object'}},
       'properties': {'characters': {'items': {'$ref': '#/$defs/Character'},
         'title': 'Characters',
         'type': 'array'}},
       'required': ['characters'],
       'type': 'object'}}}]}},
 {'method': 'post',
  'url': '/chat/completions',
  'files': None,
  'json_data': {'messages': [{'role': 'user',
     'content': 'Who are the main characters from Lord of the Rings?.'},
    {'role': 'assistant',
     'content': '',
     'tool_calls': [{'id': 'call_kjUg9ogoR1OdRr0OkmTzabue',
       'function': {'arguments': '{"characters":[{"name":"Frodo Baggins","race":"Hobbit","fun_fact":"Bearer of the One Ring","favorite_food":"Mushrooms","skills":["Courage","Stealth"],"weapons":["Sting","Elven Dagger"]},{"name":"Samwise Gamgee","race":"Hobbit","fun_fact":"Frodo\'s gardener and friend","favorite_food":"Potatoes","skills":["Loyalty","Cooking"],"weapons":["Barrow-blade"]},{"name":"Gandalf","race":"Maia","fun_fact":"Known as Gandalf the Grey and later as Gandalf the White","favorite_food":"N/A","skills":["Wisdom","Magic"],"weapons":["Glamdring","Staff"]},{"name":"Aragorn","race":"Human","fun_fact":"Heir of Isildur and rightful king of Gondor","favorite_food":"Elvish waybread","skills":["Swordsmanship","Leadership"],"weapons":["Andúril","Bow"]},{"name":"Legolas","race":"Elf","fun_fact":"Prince of the Woodland Realm","favorite_food":"Lembas bread","skills":["Archery","Agility"],"weapons":["Elven bow","Daggers"]},{"name":"Gimli","race":"Dwarf","fun_fact":"Son of Glóin","favorite_food":"Meat","skills":["Axe fighting","Stout-heartedness"],"weapons":["Battle axe","Throwing axes"]}]}',
        'name': 'Characters'},
       'type': 'function'}]},
    {'role': 'tool',
     'tool_call_id': 'call_kjUg9ogoR1OdRr0OkmTzabue',
     'name': 'Characters',
     'content': "Recall the function correctly, fix the errors and exceptions found\n1 validation error for Characters\ncharacters\n  Value error, The number of characters must be at least 20, but it is 6 [type=value_error, input_value=[{'name': 'Frodo Baggins'...axe', 'Throwing axes']}], input_type=list]\n    For further information visit https://errors.pydantic.dev/2.6/v/value_error"}],
   'model': 'gpt-4-0125-preview',
   'tool_choice': {'type': 'function', 'function': {'name': 'Characters'}},
   'tools': [{'type': 'function',
     'function': {'name': 'Characters',
      'description': 'Correctly extracted `Characters` with all the required parameters with correct types',
      'parameters': {'$defs': {'Character': {'properties': {'name': {'title': 'Name',
           'type': 'string'},
          'race': {'title': 'Race', 'type': 'string'},
          'fun_fact': {'title': 'Fun Fact', 'type': 'string'},
          'favorite_food': {'title': 'Favorite Food', 'type': 'string'},
          'skills': {'items': {'type': 'string'},
           'title': 'Skills',
           'type': 'array'},
          'weapons': {'items': {'type': 'string'},
           'title': 'Weapons',
           'type': 'array'}},
         'required': ['name',
          'race',
          'fun_fact',
          'favorite_food',
          'skills',
          'weapons'],
         'title': 'Character',
         'type': 'object'}},
       'properties': {'characters': {'items': {'$ref': '#/$defs/Character'},
         'title': 'Characters',
         'type': 'array'}},
       'required': ['characters'],
       'type': 'object'}}}]}}]

If you look through the above messages carefully you can see the retry asking logic.

Recall the function correctly, fix the errors and exceptions found\n1 validation error for Characters\ncharacters\n Value error, The number of characters must be at least 20, ...

You can even use the structured output with some of the open source models. I would refer to the instructor blog or documentation for further information on that. I have not fully looked into the different patching modes yet. But here is a simple example of using MISTRAL7B through together.ai.

res = llm(
    messages=[dict(role="user", content="Give me a character from a movie or book.")],
    response_model=Character,
    max_retries=2,
    **Models.MISTRAL7B,
)
print(res.model_dump())
{'name': 'Superman', 'race': 'Kryptonian', 'fun_fact': 'Can fly', 'favorite_food': 'Pizza', 'skills': ['Super strength', 'Flight', 'Heat vision', 'X-ray vision'], 'weapons': ['Laser vision', 'Heat vision', 'X-ray vision']}

Conclusion

Again, I really like the idea of using a single interface for interacting with multiple LLMs. I hope the space continues to mature so that more open source models and services support JSON mode and function calling. I think instructor is a cool library and the corresponding blog is interesting too. I also like the idea of logging all the outgoing prompts/messages just to make sure I fully understand what is happening under the hood.