LLM AI Clients Technical Documentation

1. Overview

Purpose: The LLM client module of the AIECS system provides a unified large language model interface, supporting three major AI service providers: OpenAI, Google Vertex AI, and xAI (Grok). Through abstract base class design, this module implements a unified calling interface for multiple vendors, solving critical issues such as complex AI service integration, vendor lock-in, and cost control.

Core Value:

  • Unified Interface: Provides consistent API calling methods, shielding differences between vendors

  • Multi-Vendor Support: Simultaneously supports three major services: OpenAI, Vertex AI, and xAI

  • Cost Control: Built-in token cost estimation and usage statistics

  • High Availability: Integrated retry mechanisms and error handling

  • Streaming Support: Supports real-time streaming text generation

2. Problem Background & Design Motivation

2.1 Business Pain Points

The following challenges are faced in AI application development:

  1. Vendor Lock-in Risk: Single vendor dependency leads to business risks

  2. Inconsistent Interfaces: Different vendor APIs vary greatly, high development costs

  3. Uncontrollable Costs: Lack of unified cost monitoring and optimization mechanisms

  4. Availability Guarantee: Single point of failure affects business continuity

  5. Performance Optimization: Lack of unified performance monitoring and tuning

2.2 Design Motivation

Based on the above pain points, a unified LLM client architecture was designed:

  • Abstraction Design: Define unified interface through BaseLLMClient

  • Multi-Vendor Adaptation: Implement dedicated clients for each vendor

  • Cost Transparency: Built-in token cost calculation and statistics

  • Fault Tolerance: Integrated retry and degradation strategies

3. Architecture Positioning & Context

3.1 System Architecture Diagram

graph TB
    subgraph "Business Layer"
        A[Chat Service] --> B[Document Processing]
        B --> C[Code Generation]
    end
    
    subgraph "LLM Client Layer"
        D[BaseLLMClient] --> E[OpenAIClient]
        D --> F[VertexAIClient]
        D --> G[XAIClient]
    end
    
    subgraph "External Services"
        H[OpenAI API] --> I[GPT-4/GPT-3.5]
        J[Google Vertex AI] --> K[Gemini 2.5]
        L[xAI API] --> M[Grok Series]
    end
    
    subgraph "Infrastructure"
        N[Configuration Management] --> O[API Keys]
        P[Monitoring System] --> Q[Cost Statistics]
    end
    
    A --> D
    E --> H
    F --> J
    G --> L
    D --> N
    D --> P

3.2 Upstream and Downstream Dependencies

Upstream Callers:

  • Business service layer (chat, document processing, etc.)

  • Callback processors (token statistics, monitoring, etc.)

Downstream Dependencies:

  • Vendor API services

  • Configuration management system

  • Monitoring and logging systems

4. Core Features & Use Cases

4.1 OpenAI Client

Core Features:

  • Supports GPT-4, GPT-3.5, and other models

  • Built-in cost estimation

  • Retry mechanism and error handling

  • Streaming text generation

Usage Scenarios:

from aiecs.llm.openai_client import OpenAIClient
from aiecs.llm.base_client import LLMMessage

# Create client
client = OpenAIClient()

# Basic text generation
messages = [
    LLMMessage(role="user", content="Explain quantum computing")
]
response = await client.generate_text(messages, model="gpt-4")
print(f"Response: {response.content}")
print(f"Cost: ${response.cost_estimate:.4f}")

# Streaming generation
async for chunk in client.stream_text(messages):
    print(chunk, end="", flush=True)

4.2 Vertex AI Client

Core Features:

  • Supports Gemini 2.5 series models

  • Google Cloud authentication integration

  • Safety filter configuration

  • Asynchronous execution support

Usage Scenarios:

from aiecs.llm.vertex_client import VertexAIClient

# Create client
client = VertexAIClient()

# Multi-turn conversation
messages = [
    LLMMessage(role="user", content="Hello"),
    LLMMessage(role="assistant", content="Hello! How can I help you?"),
    LLMMessage(role="user", content="Please write a poem")
]
response = await client.generate_text(messages, model="gemini-2.5-pro")

4.3 xAI Client

Core Features:

  • Supports Grok series models

  • OpenAI API format compatibility

  • Model mapping and alias support

  • Long timeout configuration

Usage Scenarios:

from aiecs.llm.xai_client import XAIClient

# Create client
client = XAIClient()

# Use different Grok models
models = ["grok-4", "grok-3-reasoning", "grok-3-mini"]
for model in models:
    response = await client.generate_text(messages, model=model)
    print(f"{model}: {response.content[:100]}...")

5. API Reference

5.1 BaseLLMClient Abstract Base Class

Constructor

def __init__(self, provider_name: str)

Abstract Methods

generate_text
async def generate_text(
    self,
    messages: List[LLMMessage],
    model: Optional[str] = None,
    temperature: float = 0.7,
    max_tokens: Optional[int] = None,
    **kwargs
) -> LLMResponse

Parameters:

  • messages: Message list

  • model: Model name (optional)

  • temperature: Temperature parameter (0.0-1.0)

  • max_tokens: Maximum token count

  • **kwargs: Additional parameters

Returns: LLMResponse object

stream_text
async def stream_text(
    self,
    messages: List[LLMMessage],
    model: Optional[str] = None,
    temperature: float = 0.7,
    max_tokens: Optional[int] = None,
    **kwargs
) -> AsyncGenerator[str, None]
close
async def close(self)

5.2 OpenAIClient

Constructor

def __init__(self)

Features

  • Supports GPT-4, GPT-3.5, and other models

  • Built-in cost estimation

  • Retry mechanism (3 attempts, exponential backoff)

  • Streaming support

Cost Estimation

token_costs = {
    "gpt-4": {"input": 0.03, "output": 0.06},
    "gpt-4-turbo": {"input": 0.01, "output": 0.03},
    "gpt-3.5-turbo": {"input": 0.0015, "output": 0.002},
    "gpt-4o": {"input": 0.005, "output": 0.015},
    "gpt-4o-mini": {"input": 0.00015, "output": 0.0006},
}

5.3 VertexAIClient

Constructor

def __init__(self)

Features

  • Supports Gemini 2.5 series

  • Google Cloud authentication

  • Safety filter configuration

  • Asynchronous execution

Safety Settings

safety_settings = {
    HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_NONE,
    HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_NONE,
    HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_NONE,
    HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_NONE,
}

5.4 XAIClient

Constructor

def __init__(self)

Features

  • Supports Grok series models

  • OpenAI API compatibility

  • Model mapping support

  • Long timeout configuration (360 seconds)

Model Mapping

model_map = {
    "grok-4": "grok-4",
    "grok-3": "grok-3",
    "grok-3-reasoning": "grok-3-reasoning",
    "grok-3-mini": "grok-3-mini",
    # ... more models
}

6. Technical Implementation Details

6.1 Asynchronous Processing Mechanism

Design Principles:

  • All API calls are asynchronous

  • Use asyncio for concurrency control

  • Support both streaming and non-streaming modes

Implementation Details:

# Asynchronous text generation
async def generate_text(self, messages, **kwargs):
    client = self._get_client()
    response = await client.chat.completions.create(
        model=model,
        messages=openai_messages,
        **kwargs
    )
    return self._process_response(response)

6.2 Retry Mechanism

Strategy:

  • Use tenacity library for retry implementation

  • Exponential backoff strategy: 1s, 2s, 4s

  • Maximum 3 retries

  • Retry for specific exception types

Configuration:

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=4, max=10),
    retry=retry_if_exception_type((httpx.RequestError, RateLimitError))
)

6.3 Error Handling

Layered Processing:

  1. API Level: Handle HTTP errors and API limits

  2. Client Level: Handle authentication and configuration errors

  3. Business Level: Handle content filtering and response errors

Exception Types:

  • ProviderNotAvailableError: Service unavailable

  • RateLimitError: Rate limiting

  • LLMClientError: General client error

6.4 Cost Estimation

OpenAI Costs:

def _estimate_cost(self, model, input_tokens, output_tokens, token_costs):
    if model in token_costs:
        costs = token_costs[model]
        return (input_tokens * costs["input"] + output_tokens * costs["output"]) / 1000
    return 0.0

Vertex AI Costs:

  • Gemini 2.5 Pro: $0.00125/1K input, $0.00375/1K output

  • Gemini 2.5 Flash: $0.000075/1K input, $0.0003/1K output

6.5 Streaming Processing

Implementation:

async def stream_text(self, messages, **kwargs):
    stream = await client.chat.completions.create(
        model=model,
        messages=messages,
        stream=True,
        **kwargs
    )
    
    async for chunk in stream:
        if chunk.choices[0].delta.content:
            yield chunk.choices[0].delta.content

7. Configuration & Deployment

7.1 Environment Variable Configuration

OpenAI Configuration:

OPENAI_API_KEY=sk-...

Vertex AI Configuration:

VERTEX_PROJECT_ID=your-project-id
VERTEX_LOCATION=us-central1
GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json

xAI Configuration:

XAI_API_KEY=xai-...
# Or backward compatible
GROK_API_KEY=xai-...

7.2 Dependency Management

Core Dependencies:

# requirements.txt
openai>=1.0.0
google-cloud-aiplatform>=1.0.0
tenacity>=8.0.0
httpx>=0.24.0

7.3 Deployment Configuration

Docker Configuration:

FROM python:3.9-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . .
CMD ["python", "-m", "aiecs.llm"]

Kubernetes Configuration:

apiVersion: v1
kind: ConfigMap
metadata:
  name: llm-config
data:
  OPENAI_API_KEY: "sk-..."
  VERTEX_PROJECT_ID: "your-project"
  XAI_API_KEY: "xai-..."

8. Maintenance & Troubleshooting

8.1 Monitoring Metrics

Key Metrics:

  • API call success rate

  • Response time distribution

  • Token usage

  • Cost statistics

  • Error rate

Monitoring Configuration:

from prometheus_client import Counter, Histogram

api_calls_total = Counter('llm_api_calls_total', 'Total API calls', ['provider', 'model'])
api_duration = Histogram('llm_api_duration_seconds', 'API call duration', ['provider'])
api_errors = Counter('llm_api_errors_total', 'API errors', ['provider', 'error_type'])

8.2 Common Issues and Solutions

8.2.1 API Key Issues

Symptoms:

  • ProviderNotAvailableError: API key not configured

  • Authentication failure errors

Solution:

# Check environment variables
echo $OPENAI_API_KEY
echo $XAI_API_KEY

# Validate key format
python -c "import os; print(len(os.getenv('OPENAI_API_KEY', '')))"

8.2.2 Rate Limiting

Symptoms:

  • RateLimitError: Rate limit exceeded

  • HTTP 429 errors

Solution:

# Increase retry delay
@retry(
    stop=stop_after_attempt(5),
    wait=wait_exponential(multiplier=2, min=4, max=60)
)

8.2.3 Content Filtering

Symptoms:

  • Vertex AI returns empty content

  • Safety filters block responses

Solution:

# Adjust safety settings
safety_settings = {
    HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_NONE,
    # ... other settings
}

8.3 Performance Optimization

Caching Strategy:

from functools import lru_cache

@lru_cache(maxsize=128)
def _get_client(self):
    return AsyncOpenAI(api_key=self.api_key)

Connection Pool Configuration:

client = AsyncOpenAI(
    api_key=api_key,
    http_client=httpx.AsyncClient(
        limits=httpx.Limits(max_keepalive_connections=20, max_connections=100)
    )
)

9. Visualizations

9.1 Architecture Flow Diagram

sequenceDiagram
    participant App as Application
    participant Client as LLM Client
    participant API as External API
    participant Monitor as Monitoring System

    App->>Client: generate_text(messages)
    Client->>Client: Validate configuration
    Client->>API: Send request
    API->>Client: Return response
    Client->>Client: Process response
    Client->>Monitor: Record metrics
    Client->>App: Return result

9.2 Cost Analysis Diagram

pie title "Token Cost Distribution"
    "GPT-4" : 45
    "GPT-3.5" : 30
    "Gemini 2.5" : 20
    "Grok" : 5

9.3 Performance Monitoring Diagram

xychart-beta
    title "API Response Time Trend"
    x-axis ["Jan", "Feb", "Mar", "Apr", "May"]
    y-axis "Response Time (ms)" 0 --> 5000
    line [1000, 1200, 800, 900, 1100]

10. Version History

v1.0.0 (2024-01-15)

New Features:

  • Implement BaseLLMClient abstract base class

  • Support OpenAI GPT series models

  • Integrate cost estimation and retry mechanism

  • Add streaming text generation support

v1.1.0 (2024-02-01)

New Features:

  • Add Vertex AI client support

  • Integrate Google Cloud authentication

  • Support Gemini 2.5 series models

  • Add safety filter configuration

v1.2.0 (2024-03-01)

New Features:

  • Add xAI (Grok) client support

  • Implement model mapping and alias system

  • Support all Grok series models

  • Add long timeout configuration

v1.3.0 (2024-04-01) [Planned]

Planned Features:

  • Add Anthropic Claude support

  • Implement intelligent model selection

  • Add caching mechanism

  • Support batch processing


Appendix

B. Example Code

C. Technical Support

  • Technical Documentation: https://docs.aiecs.com

  • Issue Reporting: https://github.com/aiecs/issues