Close Menu

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Cyber attackers damage Jaguar Land Rover production

    Meet the IT leader In Lebanon who became an IT entrepreneur

    Will AI wipe out entry-level jobs?

    Facebook X (Twitter) Instagram
    • Artificial Intelligence
    • Business Technology
    • Cryptocurrency
    • Gadgets
    • Gaming
    • Health
    • Software and Apps
    • Technology
    Facebook X (Twitter) Instagram Pinterest Vimeo
    Tech AI Verse
    • Home
    • Artificial Intelligence

      Blue-collar jobs are gaining popularity as AI threatens office work

      August 17, 2025

      Man who asked ChatGPT about cutting out salt from his diet was hospitalized with hallucinations

      August 15, 2025

      What happens when chatbots shape your reality? Concerns are growing online

      August 14, 2025

      Scientists want to prevent AI from going rogue by teaching it to be bad first

      August 8, 2025

      AI models may be accidentally (and secretly) learning each other’s bad behaviors

      July 30, 2025
    • Business

      Cloudflare hit by data breach in Salesloft Drift supply chain attack

      September 2, 2025

      Cloudflare blocks largest recorded DDoS attack peaking at 11.5 Tbps

      September 2, 2025

      Why Certified VMware Pros Are Driving the Future of IT

      August 24, 2025

      Murky Panda hackers exploit cloud trust to hack downstream customers

      August 23, 2025

      The rise of sovereign clouds: no data portability, no party

      August 20, 2025
    • Crypto

      Trump Death Rumors Fueled $1.6 Million In Prediction Market Bets This Weekend

      September 3, 2025

      3 US Crypto Stocks to Watch This Week

      September 3, 2025

      The Shocking Cost Of Bitcoin Payments: One Transaction Can Power a UK Home For 3 Weeks

      September 3, 2025

      Analysts Increase IREN Price Target: Will The Stock Keep Rallying?

      September 3, 2025

      ​​Pi Network Gears Up for Version 23 Upgrade, But Market Demand Stays Flat

      September 3, 2025
    • Technology

      Cyber attackers damage Jaguar Land Rover production

      September 3, 2025

      Meet the IT leader In Lebanon who became an IT entrepreneur

      September 3, 2025

      Will AI wipe out entry-level jobs?

      September 3, 2025

      Interview: Holland & Barrett CDO preparing a Michelin star-worthy data strategy

      September 3, 2025

      Scottish Widows completes migration of millions of accounts to TCS platform

      September 3, 2025
    • Others
      • Gadgets
      • Gaming
      • Health
      • Software and Apps
    Check BMI
    Tech AI Verse
    You are at:Home»Technology»How to Migrate from OpenAI to Cerebrium for Cost-Predictable AI Inference
    Technology

    How to Migrate from OpenAI to Cerebrium for Cost-Predictable AI Inference

    TechAiVerseBy TechAiVerseJuly 22, 2025No Comments19 Mins Read2 Views
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
    How to Migrate from OpenAI to Cerebrium for Cost-Predictable AI Inference
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp Email

    BMI Calculator – Check your Body Mass Index for free!

    How to Migrate from OpenAI to Cerebrium for Cost-Predictable AI Inference

    If you’re building an AI application, you probably started with OpenAI’s convenient APIs. However, as your application scales, you’ll need more control over costs, models, and infrastructure.

    Cerebrium is a serverless AI infrastructure platform that lets you run open-source models on dedicated hardware with predictable, time-based pricing instead of token-based billing.

    This guide will show you how to build a complete chat application with OpenAI, migrate it to Cerebrium by changing just two lines of code, and add performance and cost tracking to compare the two approaches to AI inference using real data. When you’re done, you’ll have a working chat application that demonstrates the practical differences between token-based and compute-based pricing models, and the insights you need to choose the right approach for your use case.

    Prerequisites

    To follow along with this guide, you’ll need Python 3.10 or higher installed on your system. You’ll also need the following (all free):

    • OpenAI API key.
    • Cerebrium account (includes free tier access to test GPU instances up to A10 level).
    • Hugging Face token (free account required).
    • Llama 3.1 model access on Hugging Face. Visit meta-llama/Meta-Llama-3.1-8B-Instruct and click “Request access” to get approval from Meta (typically takes a few minutes to a few hours).

    Familiarity with Python and API calls is helpful, but we’ll walk through each step in detail.

    Creating an OpenAI Chatbot

    We’ll build a complete chat application that works with OpenAI as our foundation and enhance it throughout the tutorial without ever needing to modify the core chat logic.

    Create a new directory for the project and set up the basic structure:

    mkdir openai-cerebrium-migration
    cd openai-cerebrium-migration
    

    Install the dependencies:

    pip install openai==1.55.0 python-dotenv==1.0.0 art==6.1 colorama==0.4.6
    

    Create a .env file to store API credentials:

    OPENAI_API_KEY=your_openai_api_key_here
    
    CEREBRIUM_API_KEY=your_cerebrium_api_key_here
    
    CEREBRIUM_ENDPOINT_URL=your_cerebrium_endpoint_url_here
    

    Replace your_openai_api_key_here with your actual OpenAI API key.

    Now we’ll build the chat.py file step by step.

    Start by creating the file and adding the imports:

    import os
    import time
    from dotenv import load_dotenv
    from openai import OpenAI
    from art import text2art
    from colorama import init, Fore, Style
    

    These imports handle environment variables, OpenAI client creation, ASCII art generation, and colored terminal output.

    Add the initialization below the imports:

    load_dotenv()
    
    init(autoreset=True)
    

    Add this display_intro function:

    def display_intro(use_cerebrium, endpoint_name):
        print("n")
    
        if use_cerebrium:
            ascii_art = text2art("Cerebrium", font="tarty1")
            print(f"{Fore.MAGENTA}{ascii_art}{Style.RESET_ALL}")
        else:
            ascii_art = text2art("OpenAI", font="tarty1")
            print(f"{Fore.WHITE}{ascii_art}{Style.RESET_ALL}")
    
        print(f"Connected to: {Fore.CYAN}{endpoint_name}{Style.RESET_ALL}")
        print("nType 'quit' or 'exit' to end the chatn")
    

    This function provides visual feedback when we switch between endpoints.

    Add the main function that handles the chat logic:

    def main():    
        # OpenAI endpoint
        client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
        model = "gpt-4o-mini"
    
        endpoint_name = "OpenAI (GPT-4o-mini)"
        use_cerebrium = False
    
        display_intro(use_cerebrium, endpoint_name)
    
        conversation = []
    
        while True:
            user_input = input("You: ").strip()
    
            if user_input.lower() in ['quit', 'exit', 'bye']:
                print("Goodbye!")
                break
    
            if not user_input:
                continue
    
            conversation.append({"role": "user", "content": user_input})
    

    This function sets up the endpoint configuration and handles the basic chat loop.

    Add the response handling logic inside the main function’s while loop:

            try:
                print("Bot: ", end="", flush=True)
    
                chat_completion = client.chat.completions.create(
                    messages=conversation,
                    model=model,
                    stream=True,
                    stream_options={"include_usage": True},
                    temperature=0.7
                )
    
                bot_response = ""
                for chunk in chat_completion:
    
                    if chunk.choices[0].delta.content:
                        content = chunk.choices[0].delta.content
                        print(content, end="", flush=True)
                        bot_response += content
    
                print()
    
                conversation.append({"role": "assistant", "content": bot_response})
    
            except Exception as e:
                print(f"❌ Error: {e}")
                conversation.pop()
    

    Finally, add the script execution guard at the end of the file:

    if __name__ == "__main__":
        main()
    

    Test the chatbot by running:

    You’ll see the OpenAI ASCII art, and you can start chatting with GPT-4o mini. Ask a question to verify that the app works correctly. Responses will stream in real-time.

    Deploying a Cerebrium Endpoint With vLLM and Llama 3.1

    Now we’ll create a Cerebrium endpoint that serves the same OpenAI-compatible interface using vLLM and an open-source model. When we’re done, we’ll be able to switch to a self-hosted open-source model endpoint by changing just two lines of code.

    Configuring Hugging Face Access for Llama 3.1

    First, make sure you have access to the Llama 3.1 model on Hugging Face. If you haven’t already requested access, visit meta-llama/Meta-Llama-3.1-8B-Instruct and click “Request access”.

    Next, create a Hugging Face token by going to Hugging Face settings, clicking “New token”, and selecting “Read” permissions.

    Add your Hugging Face token to your Cerebrium project secrets. Go to your Cerebrium dashboard, select your project, and add HF_AUTH_TOKEN with your Hugging Face token as the value.

    Setting Up a Cerebrium Account and API Access

    Create a free Cerebrium account and navigate to your dashboard. In the “API Keys” section, copy your session token and save it for later – you’ll need it to authenticate with the deployed endpoint.

    Add the session token to the .env file as a CEREBRIUM_API_KEY variable:

    OPENAI_API_KEY=your_openai_api_key_here
    
    CEREBRIUM_API_KEY=your_cerebrium_api_key_here
    
    CEREBRIUM_ENDPOINT_URL=your_cerebrium_endpoint_url_here
    

    Building the OpenAI-Compatible vLLM Endpoint

    Start by installing the Cerebrium CLI and creating a new project:

    pip install cerebrium
    cerebrium login
    cerebrium init openai-compatible-endpoint
    cd openai-compatible-endpoint
    

    We’ll build the main.py file step by step to understand each component.

    Start with the imports and authentication:

    from vllm import SamplingParams, AsyncLLMEngine
    from vllm.engine.arg_utils import AsyncEngineArgs
    from pydantic import BaseModel
    from typing import Any, List, Optional, Union, Dict
    import time
    import json
    import os
    from huggingface_hub import login
    
    login(token=os.environ.get("HF_AUTH_TOKEN"))
    

    These imports provide the vLLM async engine for model inference, Pydantic models for data validation, and Hugging Face authentication for model access.

    Add the vLLM engine configuration:

    engine_args = AsyncEngineArgs(
        model="meta-llama/Meta-Llama-3.1-8B-Instruct",
        gpu_memory_utilization=0.9,  # Set GPU memory utilization
        max_model_len=8192  # Set max model length
    )
    engine = AsyncLLMEngine.from_engine_args(engine_args)
    

    This configuration uses 90% of available GPU memory and sets an 8K-token context window, optimizing for throughput while maintaining reasonable memory usage.

    Now add the Pydantic models that define the OpenAI-compatible response format:

    class Message(BaseModel):
        role: str
        content: str
    
    class ChoiceDelta(BaseModel):
        content: Optional[str] = None
        function_call: Optional[Any] = None
        refusal: Optional[Any] = None
        role: Optional[str] = None
        tool_calls: Optional[Any] = None
    
    class Choice(BaseModel):
        delta: ChoiceDelta
        finish_reason: Optional[str] = None
        index: int
        logprobs: Optional[Any] = None
    
    class Usage(BaseModel):
        completion_tokens: int = 0
        prompt_tokens: int = 0
        total_tokens: int = 0
    
    class ChatCompletionResponse(BaseModel):
        id: str
        object: str
        created: int
        model: str
        choices: List[Choice]
        service_tier: Optional[str] = "default"
        system_fingerprint: Optional[str] = "fp_cerebrium_vllm"
        usage: Optional[Usage] = None
    

    These models ensure the Cerebrium endpoint returns the same JSON structure as OpenAI’s API, enabling drop-in compatibility.

    Add the chat template formatting function:

    def format_llama_chat_prompt(messages: list) -> str:
        formatted_prompt = "<|begin_of_text|>"
    
        for message in messages:
            msg = Message(**message)
    
            formatted_prompt += f"<|start_header_id|>{msg.role}<|end_header_id|>nn"
    
            formatted_prompt += f"{msg.content}<|eot_id|>"
    
        formatted_prompt += "<|start_header_id|>assistant<|end_header_id|>nn"
    
        return formatted_prompt
    

    This function converts OpenAI’s message format to Llama 3.1’s specific chat template.

    Add the main inference function:

    async def run(
        messages: list, 
        model: str, 
        run_id: str, 
        stream: bool = True, 
        temperature: float = 0.7, 
        top_p: float = 0.95,
        max_tokens: Optional[int] = None,
        max_completion_tokens: Optional[int] = None,
        **kwargs
    ):
        formatted_prompt = format_llama_chat_prompt(messages)
    
        effective_max_tokens = max_tokens or max_completion_tokens or 2000
    
        stop_tokens = [
            "",                  
            "<|eot_id|>",            
            "<|start_header_id|>",   
            "<|end_header_id|>",     
        ]
    
        sampling_params = SamplingParams(
            temperature=temperature,
            top_p=top_p,
            max_tokens=effective_max_tokens,
            stop=stop_tokens,
            skip_special_tokens=True,
        )
    
        results_generator = engine.generate(formatted_prompt, sampling_params, run_id)
        previous_text = ""
        token_count = 0
    
        async for output in results_generator:
            outputs = output.outputs
            new_text = outputs[0].text[len(previous_text):]
            previous_text = outputs[0].text
            token_count += len(new_text.split())  
    
            is_final_chunk = outputs[0].finish_reason is not None
    
            choice = Choice(
                delta=ChoiceDelta(content=new_text),
                finish_reason=outputs[0].finish_reason if is_final_chunk else None,
                index=0
            )
    
            usage = None
            if is_final_chunk and stream:
                usage = Usage(
                    completion_tokens=token_count,
                    prompt_tokens=len(formatted_prompt.split()),
                    total_tokens=token_count + len(formatted_prompt.split())
                )
    
            response = ChatCompletionResponse(
                id=run_id,
                object="chat.completion.chunk",
                created=int(time.time()),
                model=model,
                choices=[choice],
                usage=usage
            )
    
            print(response.model_dump())
            yield f"data: {json.dumps(response.model_dump())}nn"
    
        yield "data: [DONE]nn"
    

    This function handles the core inference logic, streaming responses in OpenAI’s response format while using vLLM’s async engine for efficient processing.

    Replace the contents of the cerebrium.toml configuration file with this configuration:

    [cerebrium.deployment]
    name = "1-openai-compatible-endpoint"
    python_version = "3.10"
    docker_base_image_url = "debian:bookworm-slim"
    disable_auth = true
    include = ['./*', 'main.py', 'cerebrium.toml']
    exclude = ['.*']
    
    [cerebrium.hardware]
    cpu = 2
    memory = 12.0
    compute = "AMPERE_A10"
    
    [cerebrium.dependencies.pip]
    vllm = "latest"
    pydantic = "latest"
    
    [cerebrium.scaling]
    min_replicas = 0
    max_replicas = 5
    cooldown = 30
    replica_concurrency = 1
    response_grace_period = 900
    scaling_metric = "concurrency_utilization"
    scaling_target = 100
    scaling_buffer = 0
    roll_out_duration_seconds = 0
    

    This configuration specifies an A10 GPU with 2 CPU cores and 12GB of memory, providing a good balance of performance and cost for most applications.

    Deploy the endpoint:

    When the app has successfully deployed, you should see a message like this:

    ╭─────────────────────────────────  openai-compatible-endpoint is now live!   ──────────────────────────────────╮
    │ App Dashboard: https://dashboard.cerebrium.ai/projects/p-your-project-id/apps/p-your-project-id-openai-compatible-endpoint  │
    │                                                                                                                 │
    │ Endpoints:                                                                                                      │
    │ POST https://api.cortex.cerebrium.ai/v4/p-your-project-id/openai-compatible-endpoint/{function_name}                 │
    ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
    

    After deployment, copy the endpoint URL and add it to the .env file, replacing {function_name} with run:

    OPENAI_API_KEY=your_openai_api_key_here
    
    CEREBRIUM_ENDPOINT_URL=https://api.cortex.cerebrium.ai/v4/p-your-project-id/openai-compatible-endpoint/run
    
    CEREBRIUM_API_KEY=your_jwt_token_here
    

    Migrating From OpenAI to Cerebrium by Changing Just Two Lines of Code

    Now you can migrate from OpenAI to Cerebrium by changing just two lines in the chat.py file. Navigate back to the main project directory and open chat.py.

    Replace the current endpoint:

    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    model = "gpt-4o-mini"
    

    With this:

    client = OpenAI(
        base_url=os.getenv("CEREBRIUM_ENDPOINT_URL"),
        api_key=os.getenv("CEREBRIUM_API_KEY"),
    )
    model = "meta-llama/Meta-Llama-3.1-8B-Instruct"
    

    To migrate to Cerebrium, you only need to make two changes:

    1. Add the base_url parameter to the OpenAI client.
    2. Update the model name.

    Next, update endpoint_name and use_cerebrium to ensure the CLI provides visual feedback for the change.

    Replace these lines:

    endpoint_name = "OpenAI (GPT-4o-mini)"
    use_cerebrium = False
    

    With these updated lines:

    endpoint_name = "Cerebrium vLLM (Llama 3.1)"
    use_cerebrium = True
    

    Run the application again:

    You’ll see the same streaming interface, but now it’s using the Cerebrium endpoint with Llama 3.1 instead of OpenAI with GPT-4o-mini. The chat functionality remains identical – same streaming, same interface – but now it’s running on your infrastructure.

    Note: You may notice a delay of 30-60 seconds on your first prompt as Cerebrium spins up the GPU instance. This is called a “cold start,” which occurs because min_replicas is set to 0 in the configuration, meaning the instance shuts down when not in use to save costs. Cerebrium doesn’t charge for cold start time – you only pay once the model starts processing your request. For production applications with consistent traffic, you can set min_replicas = 1 to keep an instance always running and eliminate cold starts.

    Implementing the Automatic Cost and Performance Comparison

    To add comprehensive cost and performance tracking, we’ll create a pricing.py file that automatically enhances the existing chat application without requiring any changes to chat.py.

    Start by creating the pricing.py file with the pricing constants:

    import re
    from colorama import Fore, Style
    
    OPENAI_PRICING = {
        "gpt-4o-mini": {
            "input_per_1m_tokens": 0.15,
            "output_per_1m_tokens": 0.60,
            "cached_input_per_1m_tokens": 0.075
        }
    }
    
    CEREBRIUM_PRICING = {
        "AMPERE_A10": {
            "gpu_per_second": 0.000306,
            "cpu_per_vcpu_per_second": 0.00000655,
            "memory_per_gb_per_second": 0.00000222
        }
    }
    
    CEREBRIUM_HARDWARE = {
        "cpu_vcores": 2,
        "memory_gb": 12.0,
        "gpu_type": "AMPERE_A10"
    }
    

    These constants define the pricing models from both services.

    Next, add the utility functions for text analysis:

    def estimate_tokens(text):
        """Rough token estimation: ~4 characters per token"""
        return len(text) // 4
    
    def count_words(text):
        """Count words in text"""
        return len(re.findall(r'bw+b', text))
    

    These functions collect word count and estimate token count (we’ll use these to calculate costs later).

    Add the OpenAI cost calculation function:

    def calculate_openai_stats(bot_response, response_time, chunks, final_usage, model):
        """Calculate OpenAI stats from response data"""
        pricing = OPENAI_PRICING[model]
    
        # Use actual usage data
        prompt_tokens = final_usage.prompt_tokens
        completion_tokens = final_usage.completion_tokens
        total_tokens = final_usage.total_tokens
    
        # Handle cached tokens if available
        cached_tokens = 0
        if hasattr(final_usage, 'prompt_tokens_details') and final_usage.prompt_tokens_details:
            cached_tokens = getattr(final_usage.prompt_tokens_details, 'cached_tokens', 0)
    
        regular_input_tokens = prompt_tokens - cached_tokens
    
        # Calculate costs
        input_cost = (regular_input_tokens / 1_000_000) * pricing["input_per_1m_tokens"]
        cached_cost = (cached_tokens / 1_000_000) * pricing["cached_input_per_1m_tokens"]
        output_cost = (completion_tokens / 1_000_000) * pricing["output_per_1m_tokens"]
        total_cost = input_cost + cached_cost + output_cost
    
        return {
            'prompt_tokens': prompt_tokens,
            'completion_tokens': completion_tokens,
            'total_tokens': total_tokens,
            'cached_tokens': cached_tokens,
            'regular_input_tokens': regular_input_tokens,
            'input_cost': input_cost,
            'cached_cost': cached_cost,
            'output_cost': output_cost,
            'total_cost': total_cost,
            'cost_per_token': total_cost / max(total_tokens, 1),
            'response_time': response_time,
            'chunks': chunks,
            'bot_response': bot_response
        }
    

    This function calculates a response cost estimate using OpenAI’s token-based pricing, including separate costs for input, cached input, and output tokens.

    Add the Cerebrium cost calculation function:

    def calculate_cerebrium_stats(bot_response, response_time, chunks):
        if response_time == 0:
            return None
    
        hardware = CEREBRIUM_HARDWARE
        pricing = CEREBRIUM_PRICING[hardware["gpu_type"]]
    
        # Calculate per-second costs
        gpu_cost_per_second = pricing["gpu_per_second"]
        cpu_cost_per_second = pricing["cpu_per_vcpu_per_second"] * hardware["cpu_vcores"]
        memory_cost_per_second = pricing["memory_per_gb_per_second"] * hardware["memory_gb"]
        total_cost_per_second = gpu_cost_per_second + cpu_cost_per_second + memory_cost_per_second
    
        # Calculate total costs
        gpu_cost = response_time * gpu_cost_per_second
        cpu_cost = response_time * cpu_cost_per_second
        memory_cost = response_time * memory_cost_per_second
        total_cost = response_time * total_cost_per_second
    
        # Estimate tokens and words
        estimated_tokens = estimate_tokens(bot_response)
        word_count = count_words(bot_response)
    
        return {
            'response_time': response_time,
            'chunks': chunks,
            'gpu_cost': gpu_cost,
            'cpu_cost': cpu_cost,
            'memory_cost': memory_cost,
            'total_cost': total_cost,
            'estimated_tokens': estimated_tokens,
            'word_count': word_count,
            'cost_per_token': total_cost / max(estimated_tokens, 1),
            'cost_per_word': total_cost / max(word_count, 1),
            'tokens_per_second': estimated_tokens / max(response_time, 1),
            'hardware': hardware
        }
    

    This function calculates time-based costs for Cerebrium, breaking down GPU, CPU, and memory costs separately.

    Add these functions to format and display the findings:

    def create_aligned_box(lines, title="Response Stats"):
        if not lines:
            return ""
    
        max_content_width = max(len(line) for line in lines) + 2
        title_width = len(f"─ {title} ") + 4
        box_width = max(max_content_width, title_width)
    
        top_border = f"─ {title} " + "─" * (box_width - len(f"─ {title} "))
        bottom_border = "─" * box_width
    
        padded_lines = []
        for line in lines:
            padded_lines.append(f"  {line}")
    
        return "n".join([top_border] + padded_lines + [bottom_border])
    
    def display_openai_stats(stats):
        if not stats:
            return
    
        word_count = count_words(stats['bot_response'])
        tokens_per_second = stats['completion_tokens'] / max(stats['response_time'], 1)
        cost_per_word = stats['total_cost'] / max(word_count, 1)
    
        lines = [
            f"{Fore.CYAN}🚀 Speed: {stats['response_time']:.2f}s | {tokens_per_second:.1f} tokens/sec | {stats['chunks']} chunks",
            f"{Fore.YELLOW}💰 Cost Breakdown:",
            f"{Fore.YELLOW}   • Input ({stats['regular_input_tokens']} tokens): ${stats['input_cost']:.6f}",
        ]
    
        if stats['cached_tokens'] > 0:
            lines.append(f"{Fore.YELLOW}   • Cached input ({stats['cached_tokens']} tokens): ${stats['cached_cost']:.6f}")
    
        lines.extend([
            f"{Fore.YELLOW}   • Output ({stats['completion_tokens']} tokens): ${stats['output_cost']:.6f}",
            f"{Fore.YELLOW}   • Total: ${stats['total_cost']:.6f}",
            f"{Fore.GREEN}📊 Efficiency: ${stats['cost_per_token']:.8f}/token | ${cost_per_word:.6f}/word",
            f"{Fore.MAGENTA}🔧 Method: Token-based pricing (15¢/1M input, 60¢/1M output)",
            f"{Fore.WHITE}   Model: gpt-4o-mini | Total tokens: {stats['total_tokens']}"
        ])
    
        box = create_aligned_box(lines, "OpenAI Response Stats")
        print(f"n{box}")
    
    def display_cerebrium_stats(stats):
        if not stats:
            return
    
        lines = [
            f"{Fore.CYAN}🚀 Speed: {stats['response_time']:.2f}s | {stats['chunks']} chunks | ~{stats['tokens_per_second']:.0f} tokens/sec",
            f"{Fore.YELLOW}💰 Cost Breakdown:",
            f"{Fore.YELLOW}   • A10 GPU: ${stats['gpu_cost']:.6f} ({stats['response_time']:.2f}s × $0.000306/s)",
            f"{Fore.YELLOW}   • CPU (2 cores): ${stats['cpu_cost']:.6f} ({stats['response_time']:.2f}s × $0.0000131/s)",
            f"{Fore.YELLOW}   • Memory (12GB): ${stats['memory_cost']:.6f} ({stats['response_time']:.2f}s × $0.0000267/s)",
            f"{Fore.YELLOW}   • Total: ${stats['total_cost']:.6f}",
            f"{Fore.GREEN}📊 Efficiency: ~${stats['cost_per_token']:.6f}/token | ${stats['cost_per_word']:.6f}/word",
            f"{Fore.MAGENTA}🔧 Method: Time-based pricing - you pay for compute seconds",
            f"{Fore.WHITE}   Hardware: AMPERE_A10 + 2 vCPUs + 12GB RAM",
            f"{Fore.WHITE}   Estimated: {stats['estimated_tokens']} tokens | {stats['word_count']} words"
        ]
    
        box = create_aligned_box(lines, "Cerebrium Response Stats")
        print(f"n{box}")
    

    These functions create a formatted output that displays the different pricing models and performance metrics.

    Finally, add the entry point function that will be used in the chat.py file:

    def calculate_and_display_stats(bot_response, response_time, chunks, final_usage, use_cerebrium, model):
        if use_cerebrium:
            stats = calculate_cerebrium_stats(bot_response, response_time, chunks)
            display_cerebrium_stats(stats)
        else:
            stats = calculate_openai_stats(bot_response, response_time, chunks, final_usage, model)
            display_openai_stats(stats)
    

    This function automatically detects which endpoint is being used and displays the appropriate statistics.

    Adding the Cost and Performance Analysis to the Chatbot

    Let’s integrate the pricing module with the chat application by including performance tracking in chat.py.

    First, add the pricing module import below the existing imports:

    from pricing import calculate_and_display_stats
    

    Next, add performance tracking variables to the response handling section. Find the line that starts the response handling:

    print("Bot: ", end="", flush=True)
    

    Add tracking variables right after it:

    print("Bot: ", end="", flush=True)
    
    start_time = time.time()
    chunks = 0
    final_usage = None
    

    These variables track response time, number of streaming chunks, and token usage data.

    Now update the streaming loop to capture the tracking data. Find the current streaming loop:

    for chunk in chat_completion:
        if chunk.choices[0].delta.content:
            content = chunk.choices[0].delta.content
            print(content, end="", flush=True)
            bot_response += content
    

    Replace it with this enhanced version:

    for chunk in chat_completion:
        chunks += 1
    
        if hasattr(chunk, 'usage') and chunk.usage:
            final_usage = chunk.usage
    
        if chunk.choices and chunk.choices[0].delta.content:
            content = chunk.choices[0].delta.content
            print(content, end="", flush=True)
            bot_response += content
    

    This captures the number of chunks and token usage information during streaming.

    Finally, add the cost and performance analysis right after the response completes. Find the line that prints a newline after the response:

    Add the analysis call right after this line:

    response_time = time.time() - start_time
    calculate_and_display_stats(
        bot_response, response_time, chunks, final_usage, 
        use_cerebrium, model
    )
    

    This calculates the total response time and calls our pricing analysis function with all the collected data.

    Now run your chat application again:

    You’ll see that detailed cost and performance statistics automatically appear after each response.

    Comparing OpenAI and Cerebrium Performance With Real Data

    Now that we have comprehensive tracking in place, let’s examine actual performance data from identical prompts sent to both services. The results show clear differences in both speed and cost that impact infrastructure decisions.

    What the Numbers Tell Us

    For simple questions like, “What is the capital of France?”:

    OpenAI:

    ─ OpenAI Response Stats ──────────────────────────────────────
      🚀 Speed: 1.62s | 4.3 tokens/sec | 10 chunks
      💰 Cost Breakdown:
         • Input (14 tokens): $0.000002
         • Output (7 tokens): $0.000004
         • Total: $0.000006
      📊 Efficiency: $0.00000030/token | $0.000001/word
      🔧 Method: Token-based pricing (15¢/1M input, 60¢/1M output)
         Model: gpt-4o-mini | Total tokens: 21
    ─────────────────────────────────────────────────────────────────
    

    Cerebrium:

    ─ Cerebrium Response Stats ──────────────────────────────────────
      🚀 Speed: 2.13s | 8 chunks | ~3 tokens/sec
      💰 Cost Breakdown:
         • A10 GPU: $0.000652 (2.13s × $0.000306/s)
         • CPU (2 cores): $0.000028 (2.13s × $0.0000131/s)
         • Memory (12GB): $0.000057 (2.13s × $0.0000267/s)
         • Total: $0.000737
      📊 Efficiency: ~$0.000105/token | $0.000123/word
      🔧 Method: Time-based pricing - you pay for compute seconds
         Hardware: AMPERE_A10 + 2 vCPUs + 12GB RAM
         Estimated: 7 tokens | 6 words
    ─────────────────────────────────────────────────────────────────
    

    For longer responses like, “Explain the difference between machine learning and deep learning”:

    OpenAI:

    ─ OpenAI Response Stats ──────────────────────────────────────
      🚀 Speed: 10.69s | 54.5 tokens/sec | 585 chunks
      💰 Cost Breakdown:
         • Input (16 tokens): $0.000002
         • Output (582 tokens): $0.000349
         • Total: $0.000352
      📊 Efficiency: $0.00000059/token | $0.000001/word
      🔧 Method: Token-based pricing (15¢/1M input, 60¢/1M output)
         Model: gpt-4o-mini | Total tokens: 598
    ─────────────────────────────────────────────────────────────────
    

    Cerebrium:

    ─ Cerebrium Response Stats ──────────────────────────────────────
      🚀 Speed: 34.86s | 541 chunks | ~21 tokens/sec
      💰 Cost Breakdown:
         • A10 GPU: $0.010668 (34.86s × $0.000306/s)
         • CPU (2 cores): $0.000457 (34.86s × $0.0000131/s)
         • Memory (12GB): $0.000929 (34.86s × $0.0000267/s)
         • Total: $0.012054
      📊 Efficiency: ~$0.000017/token | $0.000029/word
      🔧 Method: Time-based pricing - you pay for compute seconds
         Hardware: AMPERE_A10 + 2 vCPUs + 12GB RAM
         Estimated: 730 tokens | 414 words
    ─────────────────────────────────────────────────────────────────
    

    These numbers are expected – OpenAI has heavily optimized infrastructure running at massive scale, while our Cerebrium deployment uses default settings on a single A10 GPU.

    The Advantages of Self-Hosting

    Despite the initial performance gap, self-hosting with Cerebrium offers advantages that don’t show up in these raw numbers:

    Hardware control: You can upgrade from an A10 to an H100 GPU and see 3-5 times faster inference speeds. OpenAI’s hardware is fixed – you have no control over the underlying infrastructure.

    Cost predictability: OpenAI’s costs scale unpredictably with output length. A chatbot that generates long responses during peak hours can blow through budgets. Cerebrium’s time-based pricing gives you precise cost control.

    Model flexibility: This example runs Llama 3.1 8B, but it could deploy Llama 3.1 70B for better quality, or switch to specialized models for coding, math, or other domains. OpenAI limits users to pre-selected models.

    Optimization potential: These Cerebrium numbers represent an unoptimized deployment. You can tune GPU memory usage, implement request batching, adjust inference parameters, and optimize for your specific use case.

    Data privacy: Your data never leaves your infrastructure. For applications handling sensitive information, this control is often a legal requirement rather than a preference.

    This guide compares OpenAI’s production-optimized service against a basic Cerebrium deployment. The real question is whether the control and optimization potential justify the initial performance difference.

    Optimizing a Cerebrium Deployment

    The performance gap between OpenAI and Cerebrium reveals significant optimization potential: You can improve the performance of Cerebrium deployments by changing configurations and upgrading hardware. Let’s explore how vLLM optimization can close this gap.

    Memory and Context Optimization

    The gpu_memory_utilization=0.9 setting in the main.py allocates 90% of available GPU memory to maximize throughput. For applications with varying load patterns, reduce this to 0.7 to allow for memory spikes:

    engine_args = AsyncEngineArgs(
        model="meta-llama/Meta-Llama-3.1-8B-Instruct",
        gpu_memory_utilization=0.7,  # More conservative memory usage
        max_model_len=8192
    )
    

    The max_model_len=8192 parameter controls the context window. Decrease it for faster responses when you don’t need long conversations:

    engine_args = AsyncEngineArgs(
        model="meta-llama/Meta-Llama-3.1-8B-Instruct",
        gpu_memory_utilization=0.9,
        max_model_len=4096  # Faster responses, shorter context
    )
    

    Reducing the context window from 8192 to 4096 tokens typically improves response times by 20-30% while using less GPU memory.

    Batch Processing for High-Volume Applications

    For production workloads, implement batching by adjusting the replica_concurrency setting in cerebrium.toml:

    [cerebrium.scaling]
    replica_concurrency = 4  # Allow 4 concurrent requests per replica
    

    This setting allows multiple requests to share GPU resources simultaneously. Instead of processing requests one at a time, vLLM can batch them together, dramatically improving cost efficiency. A single A10 GPU can handle 4-8 concurrent requests with minimal performance impact per request.

    Hardware Upgrades for Maximum Performance

    Upgrading your GPU hardware provides the biggest performance improvement. The example in this guide runs Cerebrium’s entry-level A10 deployment option, but more powerful GPUs offer significantly improved inference times.

    Update cerebrium.toml to upgrade your GPU:

    [cerebrium.hardware]
    cpu = 4
    memory = 24.0
    compute = "AMPERE_A100_40GB"  # Upgrade from A10
    

    Hardware upgrade options:

    • A10 → L40s: 2-3x faster inference, better for production workloads.
    • A10 → A100 (40GB): 3-4x faster inference, ideal for high-throughput applications.
    • A10 → H100: 5-8x faster inference, matches OpenAI’s performance levels.

    An H100 upgrade would likely bring response times from 34.86s to under 8s for long responses, making Cerebrium competitive with OpenAI’s speed while maintaining cost predictability.

    The hardware upgrade path gives you control that OpenAI’s hosted service can’t match. Scale performance based on your specific needs rather than hoping for infrastructure improvements from a third party.

    Conclusion

    Migrating from OpenAI to Cerebrium requires changing just two lines of code, but the decision involves more than technical convenience. Our hands-on testing revealed clear trade-offs:

    • OpenAI excels at: Speed, cost for short responses, and ease of use.
    • Cerebrium excels at: Cost predictability, model choice, data control, and optimization potential.

    The real value proposition depends on your specific requirements. If you need cost predictability, model flexibility, data privacy, or performance optimization, Cerebrium offers compelling advantages. If you prioritize speed and cost-efficiency for short responses, OpenAI remains competitive.

    Migration has never been easier thanks to OpenAI-compatible endpoints. Change two lines of code, and applications run on self-hosted infrastructure with the same API they already use.

    Try Cerebrium today with $30 credit available on the free tier, plus step-by-step tutorials that walk through setting up and optimizing today’s top-performing models. Take control of AI infrastructure before the next OpenAI bill surprises you.

    Further Reading

    Ready to explore self-hosting? These tutorials will give you hands-on experience with the tools and techniques covered in this article:

    • Deploy Mistral 7B with vLLM – Start with a popular open-source model and the vLLM inference engine.
    • Create an OpenAI-compatible endpoint with vLLM – Build drop-in replacements for OpenAI API calls.
    • Benchmarking vLLM, SGLang and TensorRT for Llama 3.1 API – Compare performance across different inference engines.
    • Running Llama 3 8B with TensorRT-LLM – Achieve maximum performance with NVIDIA’s optimized serving engine.

    BMI Calculator – Check your Body Mass Index for free!

    Share. Facebook Twitter Pinterest LinkedIn Reddit WhatsApp Telegram Email
    Previous ArticleKapa.ai (YC S23) is hiring a software engineers (EU remote)
    Next Article A ChatGPT ‘router’ that automatically selects the right OpenAI model for your job appears imminent
    TechAiVerse
    • Website

    Jonathan is a tech enthusiast and the mind behind Tech AI Verse. With a passion for artificial intelligence, consumer tech, and emerging innovations, he deliver clear, insightful content to keep readers informed. From cutting-edge gadgets to AI advancements and cryptocurrency trends, Jonathan breaks down complex topics to make technology accessible to all.

    Related Posts

    Cyber attackers damage Jaguar Land Rover production

    September 3, 2025

    Meet the IT leader In Lebanon who became an IT entrepreneur

    September 3, 2025

    Will AI wipe out entry-level jobs?

    September 3, 2025
    Leave A Reply Cancel Reply

    Top Posts

    Ping, You’ve Got Whale: AI detection system alerts ships of whales in their path

    April 22, 2025174 Views

    6.7 Cummins Lifter Failure: What Years Are Affected (And Possible Fixes)

    April 14, 202548 Views

    New Akira ransomware decryptor cracks encryptions keys using GPUs

    March 16, 202530 Views

    Is Libby Compatible With Kobo E-Readers?

    March 31, 202529 Views
    Don't Miss
    Technology September 3, 2025

    Cyber attackers damage Jaguar Land Rover production

    Cyber attackers damage Jaguar Land Rover production Jaguar Land Rover reports a cyber attack has…

    Meet the IT leader In Lebanon who became an IT entrepreneur

    Will AI wipe out entry-level jobs?

    Interview: Holland & Barrett CDO preparing a Michelin star-worthy data strategy

    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    About Us
    About Us

    Welcome to Tech AI Verse, your go-to destination for everything technology! We bring you the latest news, trends, and insights from the ever-evolving world of tech. Our coverage spans across global technology industry updates, artificial intelligence advancements, machine learning ethics, and automation innovations. Stay connected with us as we explore the limitless possibilities of technology!

    Facebook X (Twitter) Pinterest YouTube WhatsApp
    Our Picks

    Cyber attackers damage Jaguar Land Rover production

    September 3, 20252 Views

    Meet the IT leader In Lebanon who became an IT entrepreneur

    September 3, 20252 Views

    Will AI wipe out entry-level jobs?

    September 3, 20252 Views
    Most Popular

    Xiaomi 15 Ultra Officially Launched in China, Malaysia launch to follow after global event

    March 12, 20250 Views

    Apple thinks people won’t use MagSafe on iPhone 16e

    March 12, 20250 Views

    French Apex Legends voice cast refuses contracts over “unacceptable” AI clause

    March 12, 20250 Views
    © 2025 TechAiVerse. Designed by Divya Tech.
    • Home
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms & Conditions

    Type above and press Enter to search. Press Esc to cancel.