De POC à Production : Industrialiser votre Prompt Engineering

17 min de lecture 3483 mots

tl;dr: Industrialisez vos prompts : architecture en couches (API→Prompt Manager→LLM Gateway), versioning YAML avec tests automatisés, CI/CD GitHub Actions, monitoring Prometheus + alertes, et best practices production (graceful degradation, rate limiting, caching Redis, circuit breakers). De POC à scale.

Votre prompt fonctionne en local… mais comment le déployer à l’échelle ? Gérer les versions ? Monitorer ? Rollback si problème ?

Cet article couvre le passage de POC à Production : architecture, CI/CD, monitoring, et best practices pour opérer des prompts comme du code classique.

Exemples visuels de l’utilisation de prompts en environnement de production pour optimiser les interactions avec les modèles de langage

POC vs Production : Le Gouffre

Aspect	POC (Prototype)	Production
Code	Script Python unique	Architecture multi-couches
Prompts	Hardcodés dans code	Versionnés, externalisés (YAML)
Tests	Manuels	Automatisés (CI/CD)
Déploiement	Copy-paste	Pipeline automatisé
Monitoring	`print()` statements	Prometheus + Grafana
Erreurs	Crash complet	Graceful degradation + fallbacks
Versions	Aucune	Semantic versioning + changelog
Coûts	Inconnus	Trackés temps réel

⚠️ Warning
La vallée de la mort : 78% des POCs IA n’atteignent jamais la production. La raison #1 : sous-estimation de l’infrastructure nécessaire (versioning, monitoring, fallbacks). Planifiez 3-5x plus de temps pour l’industrialisation que pour le POC initial.

Objectif : Transformer vos prompts en actif industriel scalable et maintenable.

Architecture Production

Stack Technique Recommandée

┌─────────────────────────────────────────────────┐
│                   Frontend                       │
│        (React, Vue, Mobile App, etc.)           │
└──────────────────┬──────────────────────────────┘
                   │ HTTPS / WSS
┌──────────────────▼──────────────────────────────┐
│                API Layer                         │
│  (FastAPI, Express, Django)                     │
│  • Authentication (JWT, OAuth)                  │
│  • Rate Limiting (per user/IP)                  │
│  • Input Validation                             │
│  • Logging (requests)                           │
└──────────────────┬──────────────────────────────┘
                   │
┌──────────────────▼──────────────────────────────┐
│            Prompt Manager                        │
│  • Versioning (YAML/JSON)                       │
│  • Template Engine (Jinja2)                     │
│  • A/B Testing (traffic split)                  │
│  • Validation (schema)                          │
└──────────────────┬──────────────────────────────┘
                   │
┌──────────────────▼──────────────────────────────┐
│             LLM Gateway                          │
│  (LiteLLM, Portkey, Custom)                     │
│  • Model Routing (intelligent)                  │
│  • Load Balancing (multi-providers)             │
│  • Caching (Redis)                              │
│  • Fallbacks (GPT-4→GPT-3.5→Claude)             │
│  • Circuit Breakers                             │
└──────────────────┬──────────────────────────────┘
                   │
     ┌─────────────┼─────────────┬─────────────┐
     │             │             │             │
┌────▼────┐  ┌────▼────┐  ┌────▼────┐  ┌────▼────┐
│ OpenAI  │  │Anthropic│  │ Google  │  │ Local   │
│   API   │  │   API   │  │ Gemini  │  │  Model  │
└─────────┘  └─────────┘  └─────────┘  └─────────┘

┌─────────────────────────────────────────────────┐
│            Observability Layer                   │
│  • Logging (LangSmith, Phoenix)                 │
│  • Metrics (Prometheus)                         │
│  • Tracing (OpenTelemetry)                      │
│  • Dashboards (Grafana)                         │
│  • Alerting (PagerDuty, Slack)                  │
└─────────────────────────────────────────────────┘

Implémentation API Layer (FastAPI)

from fastapi import FastAPI, HTTPException, Depends, Header
from pydantic import BaseModel, Field
from typing import Optional, Dict
import time
from datetime import datetime

app = FastAPI(title="Prompt Engineering API", version="1.0.0")

# Models
class PromptRequest(BaseModel):
    prompt_id: str = Field(..., description="ID du prompt (ex: 'summarization')")
    version: str = Field(default="latest", description="Version (ex: '1.2.0' ou 'latest')")
    variables: Dict = Field(..., description="Variables pour template")
    user_id: str = Field(..., description="ID utilisateur pour analytics")
    model: Optional[str] = Field(default=None, description="Modèle LLM (auto si None)")
    stream: bool = Field(default=False, description="Streaming response")

class PromptResponse(BaseModel):
    output: str
    model_used: str
    version_used: str
    latency_ms: float
    tokens: Dict[str, int]
    cost_usd: float
    cache_hit: bool

# Dépendances
def verify_api_key(x_api_key: str = Header(...)):
    """Vérifier clé API"""
    # En production: vérifier en DB/cache
    valid_keys = {"sk_test_123", "sk_prod_456"}
    if x_api_key not in valid_keys:
        raise HTTPException(status_code=401, detail="Invalid API key")
    return x_api_key

# Singletons (initialisés au startup)
prompt_manager = None
llm_gateway = None
analytics = None

@app.on_event("startup")
async def startup():
    """Initialiser services"""
    global prompt_manager, llm_gateway, analytics

    prompt_manager = PromptManager(prompts_dir="./prompts")
    llm_gateway = LLMGateway(cache_enabled=True)
    analytics = PromptAnalytics(db_path="analytics.db")

# Endpoints
@app.post("/v1/prompt/execute", response_model=PromptResponse)
async def execute_prompt(
    request: PromptRequest,
    api_key: str = Depends(verify_api_key)
):
    """
    Exécuter un prompt géré.

    Workflow :
    1. Charger template (avec version)
    2. Formater avec variables
    3. Exécuter via LLM Gateway (routing + caching + fallbacks)
    4. Logger analytics
    5. Retourner résultat
    """
    start_time = time.time()

    try:
        # 1. Charger prompt template
        prompt = prompt_manager.get(
            prompt_id=request.prompt_id,
            version=request.version
        )

        # 2. Valider variables
        required_vars = [v['name'] for v in prompt.variables if v.get('required', True)]
        missing = set(required_vars) - set(request.variables.keys())
        if missing:
            raise HTTPException(
                status_code=400,
                detail=f"Missing required variables: {missing}"
            )

        # 3. Formater template
        formatted_prompt = prompt.template.format(**request.variables)

        # 4. Exécuter via Gateway
        result = await llm_gateway.execute(
            prompt=formatted_prompt,
            model=request.model,  # None = auto-routing
            user_id=request.user_id,
            stream=request.stream
        )

        # 5. Logger analytics
        latency_ms = (time.time() - start_time) * 1000

        analytics.log_execution(PromptExecution(
            prompt_id=request.prompt_id,
            prompt_version=prompt.version,
            model=result['model_used'],
            input_text=formatted_prompt,
            output_text=result['output'],
            input_tokens=result['tokens']['input'],
            output_tokens=result['tokens']['output'],
            latency_ms=latency_ms,
            cost_usd=result['cost'],
            user_id=request.user_id,
            timestamp=datetime.now(),
            metadata={
                'cache_hit': result['cache_hit'],
                'api_key': api_key[:10] + "..."
            }
        ))

        # 6. Retourner
        return PromptResponse(
            output=result['output'],
            model_used=result['model_used'],
            version_used=prompt.version,
            latency_ms=latency_ms,
            tokens=result['tokens'],
            cost_usd=result['cost'],
            cache_hit=result['cache_hit']
        )

    except Exception as e:
        # Logger erreur
        logger.error(f"Error executing prompt: {e}", exc_info=True)
        raise HTTPException(status_code=500, detail=str(e))

@app.get("/v1/prompts")
async def list_prompts(api_key: str = Depends(verify_api_key)):
    """Lister tous les prompts disponibles"""
    prompts = prompt_manager.list_all()
    return {"prompts": prompts}

@app.get("/v1/prompts/{prompt_id}/versions")
async def list_versions(prompt_id: str, api_key: str = Depends(verify_api_key)):
    """Lister versions d'un prompt"""
    versions = prompt_manager.list_versions(prompt_id)
    return {"prompt_id": prompt_id, "versions": versions}

@app.get("/health")
async def health_check():
    """Health check pour load balancer"""
    return {
        "status": "healthy",
        "timestamp": datetime.now().isoformat(),
        "version": "1.0.0"
    }

Gestion des Versions

Structure de Répertoire

🔎 Tip
Semantic versioning obligatoire : Utilisez MAJOR.MINOR.PATCH (ex: 2.1.3). MAJOR = breaking change, MINOR = nouvelle feature rétrocompatible, PATCH = bugfix. Un versioning clair évite 90% des incidents de déploiement (rollback instantané si problème).

prompts/
├── README.md
├── summarization/
│   ├── v1.0.0.yaml  (initial release)
│   ├── v1.1.0.yaml  (minor: improved instructions)
│   ├── v1.2.0.yaml  (minor: added examples)
│   ├── v2.0.0.yaml  (major: complete rewrite)
│   └── CHANGELOG.md
├── classification/
│   ├── v1.0.0.yaml
│   ├── v1.1.0.yaml
│   └── CHANGELOG.md
└── translation/
    ├── v1.0.0.yaml
    └── CHANGELOG.md

Format YAML (Standardisé)

# prompts/summarization/v1.2.0.yaml

# Metadata
id: summarization
version: 1.2.0
description: "Summarize articles concisely while preserving key information"
author: "[email protected]"
created_at: "2025-01-15"
tags: ["text-processing", "summarization", "content"]

# Performance metadata (from tests)
metadata:
  tested: true
  test_date: "2025-01-15"
  performance_score: 0.82  # Quality score
  avg_cost_usd: 0.015
  avg_latency_ms: 1200
  recommended_model: "gpt-4-turbo"

# Template (Jinja2 syntax)
template: |
  You are an expert summarizer with a talent for extracting key information.

  Your task is to summarize the following article in {{max_words}} words or less.
  Focus on:
  - Main arguments and conclusions
  - Key data points and statistics
  - Important context

  {% if tone %}
  Tone: {{tone}}
  {% endif %}

  Article:
  {{article_text}}

  Summary:

# Variables definition
variables:
  - name: article_text
    type: string
    required: true
    description: "Full text of the article to summarize"

  - name: max_words
    type: integer
    required: false
    default: 100
    description: "Maximum words for summary"
    validation:
      min: 20
      max: 500

  - name: tone
    type: string
    required: false
    default: null
    description: "Desired tone (formal, casual, technical)"
    enum: ["formal", "casual", "technical"]

# Few-shot examples (optional)
examples:
  - input:
      article_text: "Artificial intelligence is transforming healthcare..."
      max_words: 50
    output: "AI is revolutionizing healthcare through improved diagnostics, personalized treatment plans, and drug discovery. Machine learning models can now detect diseases earlier and more accurately than traditional methods."

# Automated tests
tests:
  - name: "output_length"
    description: "Summary should respect max_words constraint"
    input:
      article_text: "Long article text here..."
      max_words: 100
    assertions:
      - type: "max_word_count"
        value: 120  # Allow 20% margin

  - name: "key_terms"
    description: "Summary should contain key terms from article"
    input:
      article_text: "Article about quantum computing and AI..."
      max_words: 50
    assertions:
      - type: "contains_keywords"
        keywords: ["quantum", "AI", "computing"]
        min_matches: 2

Prompt Manager (Python)

from dataclasses import dataclass
from typing import Dict, List, Any, Optional
import yaml
from pathlib import Path
from jinja2 import Template

@dataclass
class PromptVariable:
    name: str
    type: str
    required: bool
    default: Any
    description: str

@dataclass
class PromptVersion:
    id: str
    version: str
    description: str
    template: str
    variables: List[PromptVariable]
    metadata: Dict
    tests: List[Dict]

    def format(self, **kwargs) -> str:
        """Format template with variables using Jinja2"""
        jinja_template = Template(self.template)
        return jinja_template.render(**kwargs)

class PromptManager:
    """Gestionnaire de prompts avec versioning"""

    def __init__(self, prompts_dir: str = "./prompts"):
        self.prompts_dir = Path(prompts_dir)
        self._cache = {}

    def get(self, prompt_id: str, version: str = "latest") -> PromptVersion:
        """
        Charger prompt par ID et version.

        Args:
            prompt_id: ID du prompt (ex: 'summarization')
            version: Version spécifique ou 'latest'

        Returns:
            PromptVersion
        """
        # Check cache
        cache_key = f"{prompt_id}:{version}"
        if cache_key in self._cache:
            return self._cache[cache_key]

        # Trouver fichier version
        prompt_dir = self.prompts_dir / prompt_id

        if not prompt_dir.exists():
            raise ValueError(f"Prompt '{prompt_id}' not found")

        if version == "latest":
            # Trouver version la plus récente
            version_files = sorted(
                prompt_dir.glob("v*.yaml"),
                key=lambda p: self._parse_version(p.stem)
            )
            if not version_files:
                raise ValueError(f"No versions found for '{prompt_id}'")
            version_file = version_files[-1]
        else:
            version_file = prompt_dir / f"v{version}.yaml"
            if not version_file.exists():
                raise ValueError(f"Version {version} not found for '{prompt_id}'")

        # Charger YAML
        with open(version_file, 'r', encoding='utf-8') as f:
            data = yaml.safe_load(f)

        # Parser variables
        variables = [
            PromptVariable(
                name=v['name'],
                type=v['type'],
                required=v.get('required', True),
                default=v.get('default'),
                description=v.get('description', '')
            )
            for v in data.get('variables', [])
        ]

        # Créer PromptVersion
        prompt = PromptVersion(
            id=data['id'],
            version=data['version'],
            description=data['description'],
            template=data['template'],
            variables=variables,
            metadata=data.get('metadata', {}),
            tests=data.get('tests', [])
        )

        # Cache
        self._cache[cache_key] = prompt

        return prompt

    def list_versions(self, prompt_id: str) -> List[str]:
        """Lister toutes les versions d'un prompt"""
        prompt_dir = self.prompts_dir / prompt_id
        if not prompt_dir.exists():
            return []

        versions = [
            self._parse_version(f.stem)
            for f in sorted(prompt_dir.glob("v*.yaml"))
        ]
        return versions

    def list_all(self) -> List[Dict]:
        """Lister tous les prompts disponibles"""
        prompts = []
        for prompt_dir in self.prompts_dir.iterdir():
            if prompt_dir.is_dir() and not prompt_dir.name.startswith('.'):
                versions = self.list_versions(prompt_dir.name)
                if versions:
                    latest = self.get(prompt_dir.name, "latest")
                    prompts.append({
                        'id': prompt_dir.name,
                        'description': latest.description,
                        'versions': versions,
                        'latest_version': versions[-1]
                    })
        return prompts

    @staticmethod
    def _parse_version(version_str: str) -> tuple:
        """Parser version string vers tuple (major, minor, patch)"""
        # "v1.2.3" → (1, 2, 3)
        version_str = version_str.lstrip('v')
        parts = version_str.split('.')
        return tuple(int(p) for p in parts)

# Utilisation
pm = PromptManager("./prompts")

# Charger latest version
prompt = pm.get("summarization", version="latest")

# Formater avec variables
formatted = prompt.format(
    article_text="Long article about AI...",
    max_words=100,
    tone="technical"
)

print(formatted)

CI/CD Pipeline

Tests Automatisés (Pytest)

# tests/test_prompts.py
import pytest
from pathlib import Path
import yaml
from prompt_manager import PromptManager
from llm_gateway import LLMGateway

@pytest.fixture
def prompt_manager():
    return PromptManager("./prompts")

@pytest.fixture
def llm_gateway():
    return LLMGateway(cache_enabled=False)  # Pas de cache pour tests

class TestPromptValidity:
    """Tests de validité des fichiers YAML"""

    def test_all_prompts_valid_yaml(self):
        """Tous les fichiers YAML doivent être parsables"""
        for yaml_file in Path("./prompts").rglob("*.yaml"):
            with open(yaml_file) as f:
                data = yaml.safe_load(f)

            # Assertions basiques
            assert 'id' in data, f"{yaml_file}: missing 'id'"
            assert 'version' in data, f"{yaml_file}: missing 'version'"
            assert 'template' in data, f"{yaml_file}: missing 'template'"
            assert 'variables' in data, f"{yaml_file}: missing 'variables'"

    def test_version_format(self):
        """Versions doivent suivre semver (x.y.z)"""
        import re
        version_pattern = re.compile(r'^\d+\.\d+\.\d+$')

        for yaml_file in Path("./prompts").rglob("v*.yaml"):
            with open(yaml_file) as f:
                data = yaml.safe_load(f)

            version = data.get('version')
            assert version_pattern.match(version), \
                f"{yaml_file}: invalid version '{version}'"

class TestPromptFunctionality:
    """Tests fonctionnels des prompts"""

    def test_summarization_prompt(self, prompt_manager, llm_gateway):
        """Test du prompt de summarization"""

        # Charger prompt
        prompt = prompt_manager.get("summarization", version="latest")

        # Cas de test
        test_article = """
        Artificial intelligence is rapidly transforming the healthcare industry.
        Machine learning algorithms can now diagnose diseases with accuracy
        rivaling human experts. Recent studies show AI can detect cancer in
        medical imaging with 95% accuracy, compared to 88% for radiologists.
        """

        # Formater
        formatted = prompt.format(
            article_text=test_article,
            max_words=50
        )

        # Exécuter (utiliser modèle cheap pour tests)
        result = llm_gateway.execute(
            prompt=formatted,
            model="gpt-3.5-turbo"
        )

        output = result['output']

        # Assertions
        word_count = len(output.split())
        assert word_count <= 60, f"Summary too long: {word_count} words"
        assert word_count >= 20, f"Summary too short: {word_count} words"

        # Keywords présents
        assert 'AI' in output or 'artificial intelligence' in output.lower()
        assert 'healthcare' in output.lower() or 'medical' in output.lower()

    @pytest.mark.parametrize("prompt_id,version", [
        ("summarization", "latest"),
        ("classification", "latest"),
    ])
    def test_prompt_automated_tests(self, prompt_manager, llm_gateway, prompt_id, version):
        """Exécuter tests automatisés définis dans YAML"""

        prompt = prompt_manager.get(prompt_id, version)

        for test in prompt.tests:
            # Formater prompt avec inputs de test
            formatted = prompt.format(**test['input'])

            # Exécuter
            result = llm_gateway.execute(formatted, model="gpt-3.5-turbo")
            output = result['output']

            # Vérifier assertions
            for assertion in test.get('assertions', []):
                if assertion['type'] == 'max_word_count':
                    word_count = len(output.split())
                    assert word_count <= assertion['value'], \
                        f"Test '{test['name']}' failed: {word_count} > {assertion['value']}"

                elif assertion['type'] == 'contains_keywords':
                    keywords = assertion['keywords']
                    min_matches = assertion.get('min_matches', len(keywords))

                    matches = sum(1 for kw in keywords if kw.lower() in output.lower())
                    assert matches >= min_matches, \
                        f"Test '{test['name']}' failed: only {matches}/{min_matches} keywords found"

GitHub Actions Workflow

# .github/workflows/test-prompts.yml
name: Test Prompts

on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'tests/**'
      - 'src/**'
  push:
    branches:
      - main

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install yamllint
        run: pip install yamllint

      - name: Lint YAML files
        run: yamllint prompts/ --strict

  test:
    runs-on: ubuntu-latest
    needs: lint

    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Cache dependencies
        uses: actions/cache@v3
        with:
          path: ~/.cache/pip
          key: ${{ runner.os }}-pip-${{ hashFiles('requirements.txt') }}

      - name: Install dependencies
        run: |
          pip install -r requirements.txt
          pip install pytest pytest-cov

      - name: Run tests
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          pytest tests/ -v --cov=src --cov-report=xml

      - name: Upload coverage
        uses: codecov/codecov-action@v3
        with:
          files: ./coverage.xml

  estimate-costs:
    runs-on: ubuntu-latest
    needs: test
    if: github.event_name == 'pull_request'

    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Estimate cost impact
        run: |
          python scripts/estimate_costs.py \
            --base-ref origin/${{ github.base_ref }} \
            --head-ref ${{ github.sha }} \
            --output costs_report.json

      - name: Comment PR with costs
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const costs = JSON.parse(fs.readFileSync('costs_report.json', 'utf8'));

            const body = `## 💰 Cost Impact Analysis

            **Changes detected in prompts:**
            ${costs.changed_prompts.map(p => `- \`${p.id}\` (v${p.version})`).join('\n')}

            **Estimated monthly cost impact:**
            - Current: $${costs.current_monthly_cost.toFixed(2)}
            - Projected: $${costs.projected_monthly_cost.toFixed(2)}
            - Delta: ${costs.delta >= 0 ? '+' : ''}$${costs.delta.toFixed(2)} (${costs.delta_percent.toFixed(1)}%)

            ${costs.delta > 50 ? '⚠️ **Warning:** Significant cost increase detected!' : '✅ Cost impact acceptable'}
            `;

            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: body
            });

  deploy:
    runs-on: ubuntu-latest
    needs: [lint, test]
    if: github.ref == 'refs/heads/main'

    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-east-1

      - name: Deploy prompts to S3
        run: |
          aws s3 sync ./prompts s3://company-prompts/prod/ \
            --delete \
            --cache-control max-age=300

      - name: Invalidate API cache
        run: |
          curl -X POST https://api.company.com/v1/admin/cache/invalidate \
            -H "Authorization: Bearer ${{ secrets.API_ADMIN_TOKEN }}" \
            -H "Content-Type: application/json" \
            -d '{"scope": "prompts"}'

      - name: Create GitHub Release
        uses: actions/create-release@v1
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        with:
          tag_name: prompts-${{ github.sha }}
          release_name: Prompt Release $(date +'%Y-%m-%d')
          body: |
            Automated prompt deployment

            Commit: ${{ github.sha }}
            Author: ${{ github.actor }}

Monitoring et Alerting

Métriques Prometheus

from prometheus_client import Counter, Histogram, Gauge, Info
import time

# Metrics
prompt_executions_total = Counter(
    'prompt_executions_total',
    'Total number of prompt executions',
    ['prompt_id', 'version', 'model', 'status']  # Labels
)

prompt_latency_seconds = Histogram(
    'prompt_latency_seconds',
    'Prompt execution latency in seconds',
    ['prompt_id', 'model'],
    buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0]
)

prompt_cost_usd = Histogram(
    'prompt_cost_usd',
    'Cost per prompt execution in USD',
    ['prompt_id', 'model'],
    buckets=[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0]
)

prompt_cache_hits_total = Counter(
    'prompt_cache_hits_total',
    'Total cache hits',
    ['prompt_id']
)

active_prompts_count = Gauge(
    'active_prompts_count',
    'Number of active prompt templates'
)

# Instrumentation
async def execute_prompt_monitored(
    prompt_id: str,
    version: str,
    model: str,
    variables: dict
):
    """Execute prompt with Prometheus monitoring"""

    start_time = time.time()
    status = 'success'

    try:
        # Execute
        result = await execute_prompt(prompt_id, version, model, variables)

        # Record cache hit
        if result.get('cache_hit'):
            prompt_cache_hits_total.labels(prompt_id=prompt_id).inc()

        # Record cost
        prompt_cost_usd.labels(
            prompt_id=prompt_id,
            model=result['model_used']
        ).observe(result['cost'])

        return result

    except Exception as e:
        status = 'error'
        raise

    finally:
        # Record latency
        latency = time.time() - start_time
        prompt_latency_seconds.labels(
            prompt_id=prompt_id,
            model=model
        ).observe(latency)

        # Record execution
        prompt_executions_total.labels(
            prompt_id=prompt_id,
            version=version,
            model=model,
            status=status
        ).inc()

# Expose metrics endpoint
from prometheus_client import generate_latest, CONTENT_TYPE_LATEST

@app.get("/metrics")
async def metrics():
    """Prometheus metrics endpoint"""
    return Response(
        content=generate_latest(),
        media_type=CONTENT_TYPE_LATEST
    )

Alertes (Prometheus Rules)

# prometheus/alerts.yml
groups:
  - name: prompts
    interval: 30s
    rules:
      - alert: HighPromptErrorRate
        expr: |
          (
            rate(prompt_executions_total{status="error"}[5m])
            /
            rate(prompt_executions_total[5m])
          ) > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate for prompt {{ $labels.prompt_id }}"
          description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)"

      - alert: HighPromptLatency
        expr: |
          histogram_quantile(0.95,
            rate(prompt_latency_seconds_bucket[5m])
          ) > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High P95 latency for {{ $labels.prompt_id }}"
          description: "P95 latency is {{ $value }}s (threshold: 5s)"

      - alert: PromptCostSpike
        expr: |
          rate(prompt_cost_usd_sum[1h]) > 100
        for: 30m
        labels:
          severity: critical
        annotations:
          summary: "High cost rate detected"
          description: "Spending ${{ $value }}/hour (threshold: $100/h)"

      - alert: LowCacheHitRate
        expr: |
          (
            rate(prompt_cache_hits_total[15m])
            /
            rate(prompt_executions_total[15m])
          ) < 0.2
        for: 15m
        labels:
          severity: info
        annotations:
          summary: "Low cache hit rate"
          description: "Cache hit rate is {{ $value | humanizePercentage }} (target: >20%)"

Best Practices Production

Graceful Degradation (Fallbacks)

async def execute_with_fallback(
    prompt: str,
    models: List[str] = ["gpt-4-turbo", "gpt-3.5-turbo", "claude-3-haiku"],
    timeout: int = 10
) -> Dict:
    """
    Try multiple models until success.

    Args:
        prompt: Formatted prompt
        models: List of models to try (order of preference)
        timeout: Timeout per model (seconds)

    Returns:
        Result dict
    """
    last_error = None

    for i, model in enumerate(models):
        try:
            logger.info(f"Trying model {model} (attempt {i+1}/{len(models)})")

            result = await call_llm(
                model=model,
                prompt=prompt,
                timeout=timeout
            )

            logger.info(f"Success with model {model}")
            return result

        except TimeoutError as e:
            logger.warning(f"Model {model} timeout ({timeout}s)")
            last_error = e
            continue

        except RateLimitError as e:
            logger.warning(f"Model {model} rate limited")
            last_error = e
            continue

        except Exception as e:
            logger.error(f"Model {model} failed: {e}")
            last_error = e
            continue

    # All models failed
    raise Exception(
        f"All {len(models)} models failed. Last error: {last_error}"
    )

Rate Limiting

from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded

limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)

@app.post("/v1/prompt/execute")
@limiter.limit("100/minute")  # 100 requests per minute per IP
async def execute_prompt(
    request: Request,
    prompt_request: PromptRequest
):
    ...

# Rate limit différencié par tier
@limiter.limit("1000/hour", key_func=lambda: get_user_tier())
async def execute_prompt_premium(...):
    ...

Caching Intelligent (Redis)

import redis
import hashlib
import json
from typing import Optional

redis_client = redis.Redis(
    host='localhost',
    port=6379,
    db=0,
    decode_responses=True
)

def get_cache_key(prompt: str, model: str) -> str:
    """Generate deterministic cache key"""
    content = f"{model}:{prompt}"
    return hashlib.sha256(content.encode()).hexdigest()

async def cached_llm_call(
    prompt: str,
    model: str,
    ttl: int = 3600  # 1 hour
) -> Dict:
    """
    Call LLM with caching.

    Args:
        prompt: Formatted prompt
        model: Model name
        ttl: Time to live in seconds

    Returns:
        Result dict with 'cache_hit' boolean
    """
    cache_key = get_cache_key(prompt, model)

    # Check cache
    cached = redis_client.get(cache_key)
    if cached:
        logger.info(f"Cache HIT for {cache_key[:8]}...")
        result = json.loads(cached)
        result['cache_hit'] = True
        return result

    logger.info(f"Cache MISS for {cache_key[:8]}...")

    # Call LLM
    result = await call_llm(model, prompt)
    result['cache_hit'] = False

    # Store in cache
    redis_client.setex(
        cache_key,
        ttl,
        json.dumps(result)
    )

    return result

Circuit Breaker

from circuitbreaker import circuit

@circuit(failure_threshold=5, recovery_timeout=60, expected_exception=Exception)
async def call_llm_with_circuit_breaker(model: str, prompt: str):
    """
    Call LLM with circuit breaker pattern.

    - If 5 consecutive failures → circuit opens for 60s
    - During open state → immediately raise CircuitBreakerError
    - After 60s → try again (half-open state)
    """
    return await call_llm(model, prompt)

# Usage
try:
    result = await call_llm_with_circuit_breaker("gpt-4-turbo", prompt)
except CircuitBreakerError:
    # Circuit is open → use fallback
    result = await call_llm("gpt-3.5-turbo", prompt)

6Documentation et Handoff

Template Documentation

Chaque prompt doit avoir sa documentation :

# Prompt: Summarization v1.2.0

## Description {#description}
Summarizes articles concisely (20-500 words) while preserving key information.
Optimized for news articles, blog posts, and research papers.

## Use Cases {#use-cases}
- Blog post summaries for social media
- Newsletter digest generation
- Research paper abstracts
- Content curation

## Parameters {#parameters}

### Required
- `article_text` (string): Full text of the article to summarize

### Optional
- `max_words` (integer, default=100): Maximum summary length
  - Min: 20, Max: 500
- `tone` (string, default=null): Desired tone
  - Options: "formal", "casual", "technical"

## Example Usage {#example-usage}

\`\`\`python
from prompt_manager import PromptManager

pm = PromptManager()
prompt = pm.get("summarization", version="1.2.0")

result = prompt.format(
    article_text="Long article about AI...",
    max_words=75,
    tone="technical"
)

output = llm.execute(result, model="gpt-4-turbo")
\`\`\`

## Performance (tested on 500 articles) {#performance-tested-on-500-articles}
- **Quality Score**: 0.82 (semantic similarity vs human summaries)
- **Average Latency**: 1200ms (GPT-4 Turbo)
- **Average Cost**: $0.015
- **Cache Hit Rate**: 35%

## Recommended Models {#recommended-models}
1. **GPT-4 Turbo** (best quality)
2. **GPT-3.5 Turbo** (good quality, 5x cheaper)
3. **Claude 3 Haiku** (fastest, cheapest)

## Known Limitations {#known-limitations}
- May struggle with highly technical jargon (medical, legal)
- Optimal for articles 500-5000 words
- Not suitable for poetry or creative writing
- Occasional repetition for very short articles (<200 words)

## Changelog {#changelog}
- **v1.0.0** (2025-01-01): Initial release
- **v1.1.0** (2025-01-10): Improved instructions clarity (+8% quality)
- **v1.2.0** (2025-01-15): Added few-shot examples (+12% quality)

## Maintenance {#maintenance}
- **Owner**: Data Science Team
- **Reviewers**: [email protected], [email protected]
- **Next Review**: 2025-04-01

Checklist Production-Ready

Avant le déploiement

Architecture : Multi-layer (API→Manager→Gateway→LLM)
Versioning : Prompts externalisés (YAML), semantic versioning
Tests : Automatisés (pytest), >80% coverage
CI/CD : GitHub Actions avec lint + test + deploy
Monitoring : Prometheus + Grafana + alertes
Sécurité : API keys, rate limiting, input validation
Performance : Caching (Redis), fallbacks, circuit breakers
Documentation : README par prompt, runbooks, architecture diagram

Opérations continues

Monitoring quotidien : Dashboard Grafana, check alertes
Review hebdomadaire : Coûts, latence, erreurs
Review mensuelle : Qualité (A/B tests), optimisations
Updates trimestriels : Nouveaux modèles, best practices

Points Clés à Retenir

Architecture en couches : Séparation API / Prompt Manager / LLM Gateway
Versioning sémantique : YAML externalisés, changelog, tests automatisés
CI/CD automatisé : Lint → Test → Cost estimate → Deploy → Monitor
Observability complète : Prometheus metrics + alertes + dashboards
Resilience : Graceful degradation, fallbacks, circuit breakers, caching
Documentation : Chaque prompt documenté, runbooks, architecture visible

Conclusion de la Série

Vous avez maintenant tous les outils pour maîtriser le Prompt Engineering :

Fondamentaux : Zero-Shot, Few-Shot, Chain-of-Thought
Techniques avancées : ToT, ReAct, Self-Consistency, Constitutional AI
Cas d’usage : Templates prêts pour copywriting, code, data, support
Outils : LangChain, marketplaces, versioning, testing, optimisation auto
Sécurité : Prompt injection, jailbreaking, guardrails, PII protection
Performance : Métriques, A/B testing, optimisation coûts (-60%), latence
Production : Architecture, CI/CD, monitoring, best practices

Prochaine étape : Appliquer ces techniques à vos projets, mesurer les résultats, itérer et partager vos learnings !

Retour : Mesurer et Optimiser Performance (KPIs, A/B Testing, Coûts) Index de la série : Prompt Engineering : Guide Complet 2025