KPIs, A/B Testing, Coûts : Optimiser ROI de vos Prompts
Vos prompts fonctionnent… mais à quel prix ? Quelle qualité réelle ? Comment s’améliorer ?
Sans métriques, vous pilotez à l’aveugle. Cet article couvre les 3 piliers de la performance : Qualité, Coût et Latence, avec frameworks pratiques pour mesurer et optimiser.

Les 3 Dimensions de la Performance
| Dimension | Impact | Levier Principal |
|---|---|---|
| Qualité | Satisfaction utilisateur, taux résolution | Prompt engineering, modèle |
| Coût | Budget, scalabilité | Routing, caching, compression |
| Latence | UX, conversion | Modèle rapide, streaming, async |
Le triangle d’or : Impossible d’optimiser les 3 simultanément. Définissez votre priorité : qualité maximale (GPT-4), coût minimal (Haiku + caching), ou latence ultra-basse (GPT-3.5 + streaming). Amazon a réduit ses coûts LLM de 60% en routant 70% des requêtes simples vers Claude Haiku.
Trade-off : GPT-4 (haute qualité, coût élevé, lent) vs GPT-3.5-Turbo (qualité OK, cheap, rapide).
Objectif : Trouver le sweet spot pour votre use case.
Métriques de Performance
Métriques Qualité (Précision)
| Métrique | Description | Calcul | Cible | Use Case |
|---|---|---|---|---|
| Accuracy | % réponses correctes | Correct / Total | >90% | Classification, QA |
| Precision | % prédictions positives correctes | TP / (TP+FP) | >85% | Détection spam, modération |
| Recall | % vrais positifs détectés | TP / (TP+FN) | >85% | Extraction d’entités |
| F1 Score | Moyenne harmonique P & R | 2×P×R/(P+R) | >0.85 | Équilibre P/R |
| BLEU | Similarité n-grams (traduction) | 0-1 | >0.5 | Traduction, génération |
| ROUGE-L | Overlap avec référence (résumé) | 0-1 | >0.6 | Résumés |
| Semantic Similarity | Cosine embeddings | 0-1 | >0.8 | Paraphrase, QA |
| Human Satisfaction | Note utilisateur | 1-5 | >4.0 | Chatbot, support |
Implémentation :
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import numpy as np
def calculate_quality_metrics(predictions: List[str], references: List[str]) -> dict:
"""
Calculer métriques qualité.
Args:
predictions: Sorties modèle
references: Réponses attendues
Returns:
Dict avec toutes les métriques
"""
# Classification metrics (si applicable)
if all(p in ['positive', 'negative', 'neutral'] for p in predictions):
accuracy = accuracy_score(references, predictions)
precision, recall, f1, _ = precision_recall_fscore_support(
references,
predictions,
average='weighted'
)
return {
'accuracy': accuracy,
'precision': precision,
'recall': recall,
'f1_score': f1
}
# Semantic similarity (pour génération texte)
from openai import OpenAI
client = OpenAI()
similarities = []
for pred, ref in zip(predictions, references):
# Embeddings
emb_pred = client.embeddings.create(
input=pred,
model="text-embedding-3-small"
).data[0].embedding
emb_ref = client.embeddings.create(
input=ref,
model="text-embedding-3-small"
).data[0].embedding
# Cosine similarity
dot_product = np.dot(emb_pred, emb_ref)
norm_pred = np.linalg.norm(emb_pred)
norm_ref = np.linalg.norm(emb_ref)
similarity = dot_product / (norm_pred * norm_ref)
similarities.append(similarity)
return {
'avg_semantic_similarity': np.mean(similarities),
'min_similarity': np.min(similarities),
'max_similarity': np.max(similarities)
}
# Utilisation
predictions = ["Le chat dort.", "Il pleut aujourd'hui.", "Python est génial."]
references = ["Le chat se repose.", "Temps pluvieux.", "J'adore Python."]
metrics = calculate_quality_metrics(predictions, references)
print(metrics)
# {'avg_semantic_similarity': 0.87, 'min_similarity': 0.82, ...}
Métriques Business
| Métrique | Description | Calcul | Cible |
|---|---|---|---|
| Resolution Rate | % requêtes résolues sans humain | Résolues auto / Total | >70% |
| Time to Resolution | Médiane temps résolution | Médiane(temps) | <2 min |
| CSAT | Customer Satisfaction Score | % satisfaits | >80% |
| NPS | Net Promoter Score | % Promoteurs - Détracteurs | >30 |
| Conversion Rate | % utilisateurs convertis | Conversions / Visiteurs | Dépend |
| Cost per Interaction | Coût moyen/requête | Coût total / Requêtes | <$0.01 |
| Deflection Rate | % tickets évités | Tickets évités / Total | >50% |
Implémentation :
from dataclasses import dataclass
from datetime import datetime
from typing import Optional
@dataclass
class InteractionMetrics:
"""Métriques d'une interaction chatbot"""
interaction_id: str
timestamp: datetime
resolved_automatically: bool # True si pas d'escalade humaine
resolution_time_seconds: float
user_rating: Optional[int] # 1-5 ou None
conversion: bool # True si utilisateur a acheté/inscrit
cost_usd: float
def calculate_business_metrics(interactions: List[InteractionMetrics]) -> dict:
"""Calculer métriques business"""
total = len(interactions)
# Resolution rate
auto_resolved = sum(1 for i in interactions if i.resolved_automatically)
resolution_rate = auto_resolved / total if total > 0 else 0
# Time to resolution (médiane)
times = [i.resolution_time_seconds for i in interactions]
median_time = np.median(times)
# CSAT (% ratings >= 4)
ratings = [i.user_rating for i in interactions if i.user_rating is not None]
csat = sum(1 for r in ratings if r >= 4) / len(ratings) if ratings else 0
# NPS (% 9-10 promoteurs - % 0-6 détracteurs)
if ratings:
promoters = sum(1 for r in ratings if r >= 9) / len(ratings)
detractors = sum(1 for r in ratings if r <= 6) / len(ratings)
nps = (promoters - detractors) * 100
else:
nps = 0
# Conversion rate
conversions = sum(1 for i in interactions if i.conversion)
conversion_rate = conversions / total if total > 0 else 0
# Cost per interaction
total_cost = sum(i.cost_usd for i in interactions)
cost_per_interaction = total_cost / total if total > 0 else 0
return {
'total_interactions': total,
'resolution_rate': resolution_rate,
'median_resolution_time_s': median_time,
'csat': csat,
'nps': nps,
'conversion_rate': conversion_rate,
'cost_per_interaction': cost_per_interaction,
'total_cost': total_cost
}
# Exemple
interactions = [
InteractionMetrics(
interaction_id="1",
timestamp=datetime.now(),
resolved_automatically=True,
resolution_time_seconds=45,
user_rating=5,
conversion=True,
cost_usd=0.008
),
# ... plus d'interactions
]
metrics = calculate_business_metrics(interactions)
print(f"Resolution Rate: {metrics['resolution_rate']:.1%}")
print(f"CSAT: {metrics['csat']:.1%}")
print(f"Cost/Interaction: ${metrics['cost_per_interaction']:.4f}")
Métriques Techniques
| Métrique | Description | Cible |
|---|---|---|
| Latence P50/P95/P99 | Temps de réponse (percentiles) | P95 <2s |
| Tokens/requête | Input + output moyen | Minimiser |
| Coût/requête | Tokens × prix | <$0.01 |
| Error Rate | % appels API échoués | <1% |
| Uptime | % disponibilité service | >99.5% |
| Cache Hit Rate | % requêtes servies par cache | >30% |
Implémentation Tracking
Logger chaque Exécution
from dataclasses import dataclass
from datetime import datetime
from typing import Optional, Dict
import json
import sqlite3
@dataclass
class PromptExecution:
"""Track single prompt execution"""
prompt_id: str
prompt_version: str
model: str
input_text: str
output_text: str
input_tokens: int
output_tokens: int
latency_ms: float
cost_usd: float
user_id: Optional[str]
timestamp: datetime
metadata: Dict # A/B variant, quality score, etc.
class PromptAnalytics:
"""Système analytics pour prompts"""
def __init__(self, db_path: str = "prompt_analytics.db"):
self.db_path = db_path
self._init_db()
def _init_db(self):
"""Créer tables si nécessaire"""
conn = sqlite3.connect(self.db_path)
cursor = conn.cursor()
cursor.execute("""
CREATE TABLE IF NOT EXISTS prompt_executions (
id INTEGER PRIMARY KEY AUTOINCREMENT,
prompt_id TEXT,
prompt_version TEXT,
model TEXT,
input_text TEXT,
output_text TEXT,
input_tokens INTEGER,
output_tokens INTEGER,
latency_ms REAL,
cost_usd REAL,
user_id TEXT,
timestamp TEXT,
metadata TEXT
)
""")
conn.commit()
conn.close()
def log_execution(self, execution: PromptExecution):
"""Logger une exécution"""
conn = sqlite3.connect(self.db_path)
cursor = conn.cursor()
cursor.execute("""
INSERT INTO prompt_executions
(prompt_id, prompt_version, model, input_text, output_text,
input_tokens, output_tokens, latency_ms, cost_usd, user_id,
timestamp, metadata)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
""", (
execution.prompt_id,
execution.prompt_version,
execution.model,
execution.input_text,
execution.output_text,
execution.input_tokens,
execution.output_tokens,
execution.latency_ms,
execution.cost_usd,
execution.user_id,
execution.timestamp.isoformat(),
json.dumps(execution.metadata)
))
conn.commit()
conn.close()
def get_metrics(self, prompt_id: str, days: int = 7) -> Dict:
"""Calculer métriques pour un prompt"""
conn = sqlite3.connect(self.db_path)
cursor = conn.cursor()
# Récupérer exécutions
cursor.execute("""
SELECT input_tokens, output_tokens, latency_ms, cost_usd
FROM prompt_executions
WHERE prompt_id = ?
AND timestamp > datetime('now', '-' || ? || ' days')
""", (prompt_id, days))
rows = cursor.fetchall()
conn.close()
if not rows:
return {}
# Calculer métriques
input_tokens = [r[0] for r in rows]
output_tokens = [r[1] for r in rows]
latencies = [r[2] for r in rows]
costs = [r[3] for r in rows]
return {
'total_executions': len(rows),
'avg_input_tokens': np.mean(input_tokens),
'avg_output_tokens': np.mean(output_tokens),
'avg_latency_ms': np.mean(latencies),
'p50_latency_ms': np.percentile(latencies, 50),
'p95_latency_ms': np.percentile(latencies, 95),
'p99_latency_ms': np.percentile(latencies, 99),
'total_cost_usd': sum(costs),
'avg_cost_per_execution': np.mean(costs),
}
# Utilisation
analytics = PromptAnalytics()
# Logger exécution
execution = PromptExecution(
prompt_id="marketing_ad_v2",
prompt_version="2.1",
model="gpt-4-turbo",
input_text="Créer pub Facebook pour app méditation",
output_text="[Output généré]",
input_tokens=1200,
output_tokens=450,
latency_ms=2340,
cost_usd=0.025,
user_id="user_12345",
timestamp=datetime.now(),
metadata={"ab_variant": "B", "quality_score": 0.87}
)
analytics.log_execution(execution)
# Récupérer métriques
metrics = analytics.get_metrics("marketing_ad_v2", days=7)
print(f"Avg latency: {metrics['avg_latency_ms']:.0f}ms")
print(f"P95 latency: {metrics['p95_latency_ms']:.0f}ms")
print(f"Total cost (7d): ${metrics['total_cost_usd']:.2f}")
A/B Testing de Prompts
Framework Complet
from typing import Dict, List, Callable
import random
from scipy import stats
import numpy as np
class PromptABTest:
"""A/B testing framework pour prompts"""
def __init__(
self,
variants: Dict[str, str],
traffic_split: Dict[str, float] = None,
analytics: PromptAnalytics = None
):
"""
Args:
variants: {"A": prompt_a, "B": prompt_b, "C": prompt_c}
traffic_split: {"A": 0.5, "B": 0.3, "C": 0.2}
analytics: Instance PromptAnalytics pour logging
"""
self.variants = variants
self.traffic_split = traffic_split or {
k: 1.0 / len(variants) for k in variants
}
self.analytics = analytics or PromptAnalytics()
self.results = {k: [] for k in variants}
def assign_variant(self, user_id: str) -> str:
"""
Assigner variant de manière cohérente (même user = même variant).
Args:
user_id: ID utilisateur
Returns:
Variant assigné ('A', 'B', etc.)
"""
# Hash user_id pour assignment déterministe
hash_val = hash(user_id) % 100
cumulative = 0
for variant, prob in self.traffic_split.items():
cumulative += prob * 100
if hash_val < cumulative:
return variant
return list(self.variants.keys())[0]
def execute(
self,
user_id: str,
input_data: Dict,
model: str = "gpt-4-turbo"
) -> tuple:
"""
Exécuter prompt avec variant assigné.
Returns:
(variant, output)
"""
import time
from openai import OpenAI
variant = self.assign_variant(user_id)
prompt = self.variants[variant].format(**input_data)
# Appel LLM avec timing
client = OpenAI()
start = time.time()
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}]
)
latency_ms = (time.time() - start) * 1000
output = response.choices[0].message.content
# Logger
execution = PromptExecution(
prompt_id=f"ab_test_{variant}",
prompt_version="1.0",
model=model,
input_text=prompt,
output_text=output,
input_tokens=response.usage.prompt_tokens,
output_tokens=response.usage.completion_tokens,
latency_ms=latency_ms,
cost_usd=calculate_cost(
model,
response.usage.prompt_tokens,
response.usage.completion_tokens
),
user_id=user_id,
timestamp=datetime.now(),
metadata={"ab_variant": variant}
)
self.analytics.log_execution(execution)
# Stocker pour analyse
self.results[variant].append({
'user_id': user_id,
'output': output,
'latency_ms': latency_ms,
'cost': execution.cost_usd
})
return variant, output
def analyze(self, metric_fn: Callable[[str], float]) -> Dict:
"""
Analyser résultats A/B test.
Args:
metric_fn: Fonction qui prend output et retourne score (0-1)
Returns:
Résultats avec stats et signif
"""
analysis = {}
for variant, results in self.results.items():
if not results:
continue
scores = [metric_fn(r['output']) for r in results]
analysis[variant] = {
'n': len(scores),
'mean': np.mean(scores),
'std': np.std(scores),
'median': np.median(scores),
'ci_95': stats.t.interval(
0.95,
len(scores) - 1,
loc=np.mean(scores),
scale=stats.sem(scores)
) if len(scores) > 1 else (0, 0)
}
# Signification statistique (si 2 variants)
if len(self.variants) == 2:
variants_list = list(self.variants.keys())
a, b = variants_list[0], variants_list[1]
if self.results[a] and self.results[b]:
scores_a = [metric_fn(r['output']) for r in self.results[a]]
scores_b = [metric_fn(r['output']) for r in self.results[b]]
# T-test
t_stat, p_value = stats.ttest_ind(scores_a, scores_b)
# Effect size (Cohen's d)
pooled_std = np.sqrt(
(np.std(scores_a)**2 + np.std(scores_b)**2) / 2
)
effect_size = (np.mean(scores_a) - np.mean(scores_b)) / pooled_std
analysis['significance'] = {
't_statistic': t_stat,
'p_value': p_value,
'significant': p_value < 0.05,
'effect_size': effect_size,
'winner': a if np.mean(scores_a) > np.mean(scores_b) else b,
'lift': (max(np.mean(scores_a), np.mean(scores_b)) /
min(np.mean(scores_a), np.mean(scores_b)) - 1) * 100
}
return analysis
# Utilisation
ab_test = PromptABTest(
variants={
"A": "Résume ce texte concisément : {text}",
"B": "Tu es un expert en synthèse. Crée un résumé clair et bref de : {text}",
"C": "Extrais les points clés de : {text}\n\nPoints clés :\n-"
},
traffic_split={"A": 0.4, "B": 0.4, "C": 0.2}
)
# Simuler 300 utilisateurs
for i in range(300):
user_id = f"user_{i}"
text = f"Article {i} à résumer..." # Simulé
variant, summary = ab_test.execute(
user_id=user_id,
input_data={"text": text}
)
# Définir métrique qualité
def quality_score(summary: str) -> float:
"""Score qualité simple (0-1)"""
# Longueur appropriée (50-200 chars)
length_score = 1.0 if 50 < len(summary) < 200 else 0.5
# Pas trop de répétitions
words = summary.lower().split()
uniqueness = len(set(words)) / len(words) if words else 0
return (length_score + uniqueness) / 2
# Analyser
results = ab_test.analyze(quality_score)
print("=" * 60)
print("A/B TEST RESULTS")
print("=" * 60)
for variant, data in results.items():
if variant == 'significance':
continue
print(f"\n{variant}:")
print(f" Sample size: {data['n']}")
print(f" Mean score: {data['mean']:.3f}")
print(f" 95% CI: [{data['ci_95'][0]:.3f}, {data['ci_95'][1]:.3f}]")
if 'significance' in results:
sig = results['significance']
print(f"\n{'='*60}")
print(f"STATISTICAL SIGNIFICANCE")
print(f"{'='*60}")
print(f"Winner: {sig['winner']}")
print(f"P-value: {sig['p_value']:.4f} ({'significant' if sig['significant'] else 'NOT significant'})")
print(f"Effect size (Cohen's d): {sig['effect_size']:.2f}")
print(f"Lift: +{sig['lift']:.1f}%")
if sig['significant']:
print(f"\n✅ RECOMMENDATION: Roll out variant {sig['winner']}")
else:
print(f"\n⚠️ Pas de différence significative. Continuer test ou garder variant A.")
Optimisation des Coûts
Tarifs 2025 (indicatifs)
# Prix au 1M tokens (USD)
PRICING = {
"gpt-4-turbo": {"input": 10, "output": 30},
"gpt-4": {"input": 30, "output": 60},
"gpt-3.5-turbo": {"input": 0.5, "output": 1.5},
"claude-3-opus": {"input": 15, "output": 75},
"claude-3-sonnet": {"input": 3, "output": 15},
"claude-3-haiku": {"input": 0.25, "output": 1.25},
"gemini-1.5-pro": {"input": 3.5, "output": 10.5},
}
def calculate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
"""Calculer coût en USD"""
pricing = PRICING[model]
cost = (
(input_tokens / 1_000_000) * pricing["input"] +
(output_tokens / 1_000_000) * pricing["output"]
)
return cost
# Exemple : Requête typique (1000 tokens in, 500 out)
print(f"GPT-4 Turbo: ${calculate_cost('gpt-4-turbo', 1000, 500):.6f}") # $0.025
print(f"GPT-3.5 Turbo: ${calculate_cost('gpt-3.5-turbo', 1000, 500):.6f}") # $0.0013
print(f"Claude Haiku: ${calculate_cost('claude-3-haiku', 1000, 500):.6f}") # $0.00088
# Économie annuelle (10k requêtes/jour)
annual_requests = 10_000 * 365
cost_gpt4 = calculate_cost('gpt-4-turbo', 1000, 500) * annual_requests
cost_gpt35 = calculate_cost('gpt-3.5-turbo', 1000, 500) * annual_requests
cost_haiku = calculate_cost('claude-3-haiku', 1000, 500) * annual_requests
print(f"\nCoût annuel (10k req/jour) :")
print(f" GPT-4 Turbo: ${cost_gpt4:,.0f}")
print(f" GPT-3.5 Turbo: ${cost_gpt35:,.0f}")
print(f" Claude Haiku: ${cost_haiku:,.0f}")
print(f"\nÉconomie GPT-4→Haiku : ${cost_gpt4 - cost_haiku:,.0f}/an (-{(1-cost_haiku/cost_gpt4)*100:.0f}%)")
Stratégies d’Optimisation
Routing Intelligent (Cascade)
def estimate_complexity(query: str) -> float:
"""
Estimer complexité requête (0-1).
Simple heuristique : longueur + mots techniques
"""
# Longueur
length_score = min(len(query.split()) / 100, 1.0)
# Mots techniques
technical_words = [
'algorithm', 'quantum', 'differential', 'thermodynamics',
'neurologique', 'synthèse', 'algorithme', 'complexité'
]
tech_count = sum(1 for word in query.lower().split() if word in technical_words)
tech_score = min(tech_count / 3, 1.0)
return (length_score + tech_score) / 2
def smart_routing(query: str) -> tuple:
"""
Router vers modèle optimal selon complexité.
Returns:
(model, response, cost)
"""
complexity = estimate_complexity(query)
# Choix modèle selon complexité
if complexity < 0.3:
# Simple → cheapest
model = "claude-3-haiku"
elif complexity < 0.7:
# Medium → mid-tier
model = "gpt-3.5-turbo"
else:
# Complex → best
model = "gpt-4-turbo"
# Appel
response = call_llm(model, query)
# Calculer coût (estimation)
cost = calculate_cost(model, len(query.split()) * 1.3, 500)
return model, response, cost
# Test
queries = [
"Quelle heure est-il ?", # Simple
"Explique-moi le concept de RAG en IA.", # Medium
"Résous cette équation différentielle : dy/dx + 2y = e^x" # Complex
]
for q in queries:
model, _, cost = smart_routing(q)
complexity = estimate_complexity(q)
print(f"Query: {q[:40]}...")
print(f" Complexity: {complexity:.2f} → Model: {model} (${cost:.6f})\n")
Économie estimée : -40% coût (vs GPT-4 partout).
Caching (Requêtes Identiques)
Caching = argent gratuit : Pour les FAQ et support, 30-50% des questions sont répétitives. Un cache bien configuré peut économiser 40% des coûts sans aucune perte de qualité. Spotify économise 180k$/an grâce au caching de ses résumés musicaux.
from functools import lru_cache
import hashlib
import pickle
class LLMCache:
"""Cache intelligent pour réponses LLM"""
def __init__(self, cache_file: str = "llm_cache.pkl"):
self.cache_file = cache_file
try:
with open(cache_file, 'rb') as f:
self.cache = pickle.load(f)
except FileNotFoundError:
self.cache = {}
def _hash_prompt(self, prompt: str, model: str) -> str:
"""Créer hash unique pour prompt+model"""
key = f"{model}:{prompt}"
return hashlib.md5(key.encode()).hexdigest()
def get(self, prompt: str, model: str):
"""Récupérer du cache"""
key = self._hash_prompt(prompt, model)
return self.cache.get(key)
def set(self, prompt: str, model: str, response: str):
"""Ajouter au cache"""
key = self._hash_prompt(prompt, model)
self.cache[key] = response
# Sauvegarder
with open(self.cache_file, 'wb') as f:
pickle.dump(self.cache, f)
def stats(self) -> dict:
"""Statistiques cache"""
return {
'cache_size': len(self.cache),
'memory_mb': len(pickle.dumps(self.cache)) / 1024 / 1024
}
# Wrapper avec cache
cache = LLMCache()
cache_hits = 0
cache_misses = 0
def call_llm_with_cache(prompt: str, model: str = "gpt-4-turbo") -> str:
"""Appeler LLM avec cache"""
global cache_hits, cache_misses
# Check cache
cached = cache.get(prompt, model)
if cached:
cache_hits += 1
return cached
# Cache miss → appel LLM
cache_misses += 1
response = call_llm(model, prompt)
# Store in cache
cache.set(prompt, model, response)
return response
# Utilisation
for _ in range(5):
# Même requête répétée
call_llm_with_cache("Qu'est-ce que le RAG en IA ?")
print(f"Cache hits: {cache_hits}") # 4
print(f"Cache misses: {cache_misses}") # 1
print(f"Hit rate: {cache_hits/(cache_hits+cache_misses):.0%}") # 80%
print(f"Cost savings: ~{cache_hits/(cache_hits+cache_misses):.0%}")
Économie : -30-50% selon taux de répétition (FAQ, support).
Prompt Compression
def compress_prompt(prompt: str, target_reduction: float = 0.25) -> str:
"""
Compresser prompt (réduire tokens).
Args:
prompt: Prompt original
target_reduction: % tokens à supprimer (0.25 = -25%)
Returns:
Prompt compressé
"""
import re
# Étape 1 : Supprimer mots superflus
filler_words = [
r'\bvraiment\b', r'\btrès\b', r'\bextrêmement\b',
r'\bassez\b', r'\bplutôt\b', r'\ben fait\b',
r'\ben réalité\b', r'\bbasiquement\b'
]
compressed = prompt
for pattern in filler_words:
compressed = re.sub(pattern, '', compressed, flags=re.IGNORECASE)
# Étape 2 : Abréviations courantes
abbreviations = {
r"c'est-à-dire": "i.e.",
r"par exemple": "ex.",
r"et cetera": "etc.",
r"s'il vous plaît": "svp"
}
for full, abbr in abbreviations.items():
compressed = re.sub(full, abbr, compressed, flags=re.IGNORECASE)
# Étape 3 : Nettoyer espaces multiples
compressed = re.sub(r'\s+', ' ', compressed).strip()
# Vérifier réduction
original_tokens = len(prompt.split())
compressed_tokens = len(compressed.split())
actual_reduction = 1 - (compressed_tokens / original_tokens)
print(f"Compression: {original_tokens} → {compressed_tokens} tokens ({actual_reduction:.0%})")
return compressed
# Test
original = """
Tu es vraiment un assistant très utile et extrêmement compétent.
Ta tâche est en fait de résumer, c'est-à-dire de créer un résumé
concis du texte suivant. Par exemple, tu dois extraire les points
clés et cetera.
S'il vous plaît, assure-toi que le résumé soit clair.
"""
compressed = compress_prompt(original)
print(f"\nOriginal ({len(original)} chars):\n{original}")
print(f"\nCompressed ({len(compressed)} chars):\n{compressed}")
Économie : -15-30% tokens.
Output Length Control
# Dans prompt
prompt = f"""
Réponds en MAXIMUM 100 mots. Sois concis.
Question : {question}
"""
# Ou via API (hard limit)
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4-turbo",
messages=[{"role": "user", "content": prompt}],
max_tokens=150 # Hard limit (empêche dépassement)
)
# Économie : -30-50% tokens output
Optimisation Latence
Techniques
Streaming (UX)
from openai import OpenAI
client = OpenAI()
def stream_response(prompt: str):
"""Afficher réponse progressivement (UX)"""
stream = client.chat.completions.create(
model="gpt-4-turbo",
messages=[{"role": "user", "content": prompt}],
stream=True # ← Streaming activé
)
full_response = ""
for chunk in stream:
if chunk.choices[0].delta.content:
content = chunk.choices[0].delta.content
print(content, end="", flush=True)
full_response += content
print() # Newline
return full_response
# Test
stream_response("Explique le concept de RAG en 100 mots.")
# Affiche mot par mot (impression de rapidité)
Bénéfice : Temps perçu -50% (utilisateur voit réponse se construire).
Parallel Calls
import asyncio
from openai import AsyncOpenAI
client = AsyncOpenAI()
async def parallel_llm_calls(prompts: List[str], model: str = "gpt-4-turbo"):
"""Appeler LLM en parallèle"""
async def call_one(prompt: str):
response = await client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
# Lancer tous en parallèle
tasks = [call_one(p) for p in prompts]
results = await asyncio.gather(*tasks)
return results
# Utilisation
prompts = [
"Résume l'article A",
"Résume l'article B",
"Résume l'article C"
]
# Sequential : 3 × 2s = 6s
# Parallel : max(2s, 2s, 2s) = 2s !
results = asyncio.run(parallel_llm_calls(prompts))
Gain : Latence totale = max(latences) au lieu de sum(latences).
Model Selection
# Latence moyenne (ms, estimation 2025)
LATENCY_MS = {
"gpt-4-turbo": 2500,
"gpt-3.5-turbo": 700,
"claude-3-opus": 3500,
"claude-3-sonnet": 1400,
"claude-3-haiku": 450,
}
def select_by_latency(required_latency_ms: int, min_quality: float = 0.7):
"""
Choisir modèle le plus performant respectant contrainte latence.
Args:
required_latency_ms: Latence max acceptable
min_quality: Score qualité min (0-1)
Returns:
Modèle recommandé
"""
# Qualité estimée (subjectif)
QUALITY = {
"gpt-4-turbo": 0.95,
"claude-3-opus": 0.96,
"claude-3-sonnet": 0.88,
"gpt-3.5-turbo": 0.80,
"claude-3-haiku": 0.75,
}
# Filtrer par qualité
candidates = {
model: latency
for model, latency in LATENCY_MS.items()
if QUALITY[model] >= min_quality
}
# Filtrer par latence
valid = {
model: latency
for model, latency in candidates.items()
if latency <= required_latency_ms
}
if not valid:
return None # Aucun modèle ne respecte contraintes
# Retourner le meilleur (plus rapide parmi ceux de qualité suffisante)
return min(valid, key=valid.get)
# Tests
print(select_by_latency(1000, min_quality=0.7)) # claude-3-haiku
print(select_by_latency(2000, min_quality=0.85)) # claude-3-sonnet
print(select_by_latency(500, min_quality=0.9)) # None (impossible)
Dashboard et Reporting
Dashboard Streamlit
import streamlit as st
import plotly.express as px
import plotly.graph_objects as go
from datetime import datetime, timedelta
import pandas as pd
def create_dashboard():
"""Dashboard interactif performance prompts"""
st.set_page_config(page_title="Prompt Performance", layout="wide")
st.title("📊 Prompt Performance Dashboard")
# Sidebar filters
with st.sidebar:
st.header("Filters")
prompt_id = st.selectbox(
"Prompt",
["marketing_ad_v2", "support_chatbot_v1", "summarizer_v3"]
)
date_range = st.date_input(
"Date Range",
value=(datetime.now() - timedelta(days=7), datetime.now())
)
# Fetch data (simulé ici)
analytics = PromptAnalytics()
metrics = analytics.get_metrics(prompt_id, days=7)
# KPIs Row
st.header("Key Metrics")
col1, col2, col3, col4 = st.columns(4)
with col1:
st.metric(
"Total Executions",
f"{metrics.get('total_executions', 0):,}",
delta="12% vs last week"
)
with col2:
st.metric(
"Avg Latency",
f"{metrics.get('avg_latency_ms', 0):.0f}ms",
delta="-8% (better)"
)
with col3:
st.metric(
"Total Cost (7d)",
f"${metrics.get('total_cost_usd', 0):.2f}",
delta="-15% (saved)"
)
with col4:
quality = 0.87 # Exemple
st.metric(
"Quality Score",
f"{quality:.2f}",
delta="+0.05"
)
# Charts
st.header("Trends")
# Latency over time (simulé)
dates = pd.date_range(start=date_range[0], end=date_range[1], freq='D')
latency_data = pd.DataFrame({
'date': dates,
'p50': [2100 + i*10 for i in range(len(dates))],
'p95': [3200 + i*15 for i in range(len(dates))],
'p99': [4500 + i*20 for i in range(len(dates))]
})
fig_latency = go.Figure()
fig_latency.add_trace(go.Scatter(
x=latency_data['date'], y=latency_data['p50'],
name='P50', mode='lines+markers'
))
fig_latency.add_trace(go.Scatter(
x=latency_data['date'], y=latency_data['p95'],
name='P95', mode='lines+markers'
))
fig_latency.add_trace(go.Scatter(
x=latency_data['date'], y=latency_data['p99'],
name='P99', mode='lines+markers', line=dict(dash='dash')
))
fig_latency.update_layout(
title="Latency Percentiles Over Time",
xaxis_title="Date",
yaxis_title="Latency (ms)"
)
st.plotly_chart(fig_latency, use_container_width=True)
# Cost breakdown
st.header("Cost Breakdown")
cost_data = pd.DataFrame({
'Model': ['GPT-4 Turbo', 'GPT-3.5 Turbo', 'Claude Haiku'],
'Cost': [45.2, 12.8, 5.3],
'Requests': [1200, 3500, 2100]
})
col1, col2 = st.columns(2)
with col1:
fig_cost_pie = px.pie(
cost_data,
values='Cost',
names='Model',
title="Cost by Model"
)
st.plotly_chart(fig_cost_pie, use_container_width=True)
with col2:
fig_cost_bar = px.bar(
cost_data,
x='Model',
y='Cost',
title="Cost per Model (USD)"
)
st.plotly_chart(fig_cost_bar, use_container_width=True)
# Quality distribution
st.header("Quality Distribution")
quality_scores = pd.DataFrame({
'score': [0.92, 0.88, 0.95, 0.87, 0.91, 0.89, 0.93, 0.86, 0.90, 0.94] * 50
})
fig_quality = px.histogram(
quality_scores,
x='score',
nbins=20,
title="Quality Score Distribution"
)
fig_quality.add_vline(x=0.90, line_dash="dash", line_color="red",
annotation_text="Target: 0.90")
st.plotly_chart(fig_quality, use_container_width=True)
# Recommendations
st.header("🎯 Recommendations")
if metrics.get('p95_latency_ms', 0) > 3000:
st.warning("⚠️ P95 latency > 3s. Consider using faster model (GPT-3.5, Claude Haiku).")
if metrics.get('total_cost_usd', 0) > 100:
st.info("💡 High costs detected. Try caching, routing, or compression.")
if quality < 0.85:
st.error("❌ Quality below target (0.85). Review prompt or use better model.")
# Lancer dashboard
if __name__ == "__main__":
create_dashboard()
Lancer : streamlit run dashboard.py
Amélioration Continue
Cycle DMAIC
1. DEFINE (Définir)
→ Objectif : Réduire coût de 30% sans impact qualité
→ KPI : Coût/requête < $0.01, qualité > 0.85
2. MEASURE (Mesurer)
→ Baseline actuelle : $0.025/requête, qualité 0.87
→ Collecter 1000+ samples
3. ANALYZE (Analyser)
→ Identifier : 70% requêtes sont simples (utilisent GPT-4 inutilement)
→ Opportunité : Routing intelligent
4. IMPROVE (Améliorer)
→ Implémenter routing (Haiku pour simple, GPT-4 pour complexe)
→ A/B test sur 20% traffic
5. CONTROL (Contrôler)
→ Monitor metrics
→ Si succès → roll out 100%
→ Sinon → itérer
Checklist Optimisation
Avant production :
- Métriques définies (qualité, business, tech)
- Tracking implémenté (logger chaque exécution)
- Dashboard de monitoring actif
- A/B testing framework en place
Optimisation coûts :
- Routing intelligent testé
- Caching implémenté (hit rate >20%)
- Compression prompts évaluée
- Output length contrôlé
Optimisation latence :
- Streaming activé (UX)
- Calls parallélisés (si applicable)
- Modèle rapide pour use cases simples
Amélioration continue :
- Revue hebdomadaire des métriques
- A/B tests trimestriels
- Documentation learnings
- Partage best practices équipe
Points Clés à Retenir
Mesurer 3 dimensions : Qualité (>0.85), Coût (<$0.01/req), Latence (<2s P95)
A/B testing rigoureux : Min 100 samples/variant, p-value <0.05 pour signif
Optimiser coûts : Routing (-40%), caching (-30%), compression (-25%)
Optimiser latence : Streaming (UX), parallel calls, modèles rapides
Dashboard : Visualiser KPIs temps réel, identifier bottlenecks
Amélioration continue : Cycle DMAIC mensuel/trimestriel
Suite de la Série
Article 8 : Prompt Engineering en Production : CI/CD, Architecture, Ops Dernière étape : industrialiser et opérer vos prompts à l’échelle
Retour : Sécurité et Robustesse (Injection, Jailbreaking, Guardrails)