API Documentation
v1SemaCache is an OpenAI-compatible caching proxy. Use any OpenAI SDK — just change the base URL and API key. All responses include cache metadata headers.
Quick Start
Get started in under 60 seconds. Create an API key in the dashboard, then point your OpenAI client at SemaCache.
from openai import OpenAI
client = OpenAI(
api_key="sc-your-key",
base_url="https://api.semacache.io/v1"
)
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "user", "content": "What is semantic caching?"}
]
)
print(response.choices[0].message.content)How it works: SemaCache checks for an exact or semantically similar cached response. On a hit, it returns the cached result in ~5–20ms. On a miss, it forwards to the upstream LLM, caches the response, and returns it. Your app sees a standard OpenAI response either way.
Authentication
All API requests require a SemaCache API key in the Authorization header. Create keys in the dashboard.
Authorization: Bearer sc-your-keyUpstream LLM Keys
SemaCache proxies requests to your LLM provider using your upstream API key. Two options:
Option A: Save in Dashboard Recommended
Go to Dashboard → Settings → Add your OpenAI or Gemini key. Keys are AES-256 encrypted at rest. The key is auto-resolved on every request.
Option B: Pass per-request
Send x-upstream-api-key header. This takes priority over stored keys.
curl -X POST https://api.semacache.io/v1/chat/completions \
-H "Authorization: Bearer sc-your-key" \
-H "x-upstream-api-key: sk-your-openai-key" \
-H "Content-Type: application/json" \
-d '{"model": "gpt-4o", "messages": [...]}'Chat Completions
OpenAI-compatible chat completions with three-tier caching: exact match → semantic match → LLM passthrough.
/v1/chat/completionsRequest Body
| Parameter | Type | Required | Description |
|---|---|---|---|
| model | string | required | Model ID: gpt-5.4, gpt-4o, gemini-3.1-pro-preview, gemini-2.5-flash, grok-4.20, grok-3, or any registered custom model. |
| messages | array | required | Array of message objects with role and content. Supports text and multimodal (image URLs). |
| temperature | number | optional | Sampling temperature (0–2). Passed to upstream LLM on cache miss. |
| max_tokens | integer | optional | Maximum tokens to generate. Passed to upstream LLM on cache miss. |
| stream | boolean | optional | Not yet supported. Requests with stream=true are passed through without caching. |
Supported Models
OpenAI
gpt-5.4gpt-5.4-minigpt-5.4-nanogpt-4.1gpt-4.1-minigpt-4.1-nanogpt-4ogpt-4o-minio3o3-minio4-miniGoogle Gemini
gemini-3.1-pro-previewgemini-3-flash-previewgemini-3.1-flash-lite-previewgemini-2.5-progemini-2.5-flashgemini-2.5-flash-litexAI Grok
grok-4.20grok-4grok-4-fastgrok-3grok-3-minigrok-3-fastResponse
Standard OpenAI-compatible response format, regardless of upstream provider.
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"model": "gpt-4o",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Semantic caching stores responses..."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 12,
"completion_tokens": 85,
"total_tokens": 97
}
}Image Generation
Generate images with caching. Supports OpenAI GPT Image, Google Imagen, and xAI Grok Imagine. Generated images are rehosted to cloud storage for permanent URLs.
/v1/images/generationsRequest Body
| Parameter | Type | Required | Description |
|---|---|---|---|
| prompt | string | required | Text description of the image to generate. |
| model | string | optional | gpt-image-1 (default), gpt-image-1.5, gpt-image-1-mini, imagen-4.0-generate-001, imagen-4.0-ultra-generate-001, imagen-4.0-fast-generate-001, grok-imagine-image, or grok-imagine-image-pro. |
| n | integer | optional | Number of images (default: 1). |
| size | string | optional | Image size: 1024x1024 (default), 1024x1792, 1792x1024. |
| quality | string | optional | Quality: standard (default) or hd. |
response = client.images.generate(
model="gpt-image-1",
prompt="A cat wearing a top hat, oil painting",
size="1024x1024"
)
image_url = response.data[0].urlVideo Generation
Generate videos with Google Veo or xAI Grok Imagine Video. Videos are rehosted to cloud storage. Requires an upstream API key (Gemini for Veo, xAI for Grok).
/v1/videos/generationsRequest Body
| Parameter | Type | Required | Description |
|---|---|---|---|
| prompt | string | required | Text description of the video to generate. |
| model | string | optional | veo-2.0-generate-001 (default), veo-3.1-generate-preview, veo-3.1-fast-generate-preview, veo-3.1-lite-generate-preview, veo-3.0-generate-001, veo-3.0-fast-generate-001, or grok-imagine-video. |
| duration_seconds | integer | optional | Video duration: 5–8 seconds (default: 8). |
| aspect_ratio | string | optional | 16:9 (default) or 9:16. |
| n | integer | optional | Number of videos (default: 1). |
curl -X POST https://api.semacache.io/v1/videos/generations \
-H "Authorization: Bearer sc-your-key" \
-H "Content-Type: application/json" \
-d '{
"model": "veo-2.0-generate-001",
"prompt": "A drone shot over a mountain lake at sunrise",
"duration_seconds": 8,
"aspect_ratio": "16:9"
}'Cache Control
Control caching behavior per-request using HTTP headers. All headers are passed via the OpenAI SDK's extra_headers (Python) or headers (TypeScript) parameter.
Request Headers
| Header | Behavior | Example |
|---|---|---|
| x-cache-ttl | Custom TTL in seconds for this cache entry | x-cache-ttl: 3600 |
| Cache-Control: max-age=N | Standard HTTP TTL (same effect as x-cache-ttl) | Cache-Control: max-age=86400 |
| Cache-Control: no-cache | Bypass cache read — always call the LLM, still store the result | Cache-Control: no-cache |
| Cache-Control: no-store | Bypass both read and write — ephemeral request, nothing cached | Cache-Control: no-store |
| x-similarity-threshold | Override the semantic similarity threshold for this request (0.50–1.00) | x-similarity-threshold: 0.90 |
Cache Duration (TTL) — Priority Order
The cache TTL (time-to-live) determines how long a cached response is stored before expiring. There are three ways to set it. The highest-priority source wins:
Per-request header highest priority
x-cache-ttl: 3600 or Cache-Control: max-age=3600. If both are present, x-cache-ttl wins. Value is clamped to 1 second – 90 days.
Server default
If no header is sent, the server default applies: 7 days (604,800 seconds). Configured via the CACHE_TTL_SECONDS environment variable on the cache service.
Examples
# Custom 1-hour TTL
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello"}],
extra_headers={"x-cache-ttl": "3600"}
)
# Force fresh response (bypass cache read, still store)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello"}],
extra_headers={"Cache-Control": "no-cache"}
)
# Ephemeral — don't read or store
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Sensitive question"}],
extra_headers={"Cache-Control": "no-store"}
)Similarity Threshold
The similarity threshold controls how closely a new query must match an existing cached query to count as a semantic cache hit. A value of 1.00 means only exact semantic matches; a lower value like 0.85 accepts looser paraphrases, increasing hit rate but risking false positives.
Three Ways to Set It — Priority Order
The threshold is resolved per-request. The highest-priority source that is set wins:
Per-request header highest priority
Send x-similarity-threshold: 0.90 to override for a single request. Useful for A/B testing or when a specific call needs tighter/looser matching. Value is clamped to 0.50 – 1.00. Invalid values are silently ignored.
Per-user setting (Dashboard)
Go to Dashboard → Settings → Cache Configurationand drag the slider. This value is stored in your account and applies to all requests that don't include the header. Range: 0.50 – 1.00.
Server default
If neither the header nor the dashboard setting is configured, the server default applies: 0.95. Configured via the SIMILARITY_THRESHOLD environment variable on the cache service.
Resolution summary
x-similarity-threshold header→Dashboard per-user setting→Server default (0.95)Choosing a Value
| Range | Label | Behavior |
|---|---|---|
| 0.97 – 1.00 | Very strict | Nearly identical queries only. Lowest false-positive risk, fewest cache hits. |
| 0.93 – 0.96 | Balanced | Catches close paraphrases (e.g., "What is caching?" ≈ "Explain caching"). Good default for most apps. |
| 0.85 – 0.92 | Aggressive | More cache hits, but some false positives for loosely related queries. |
| 0.50 – 0.84 | Very aggressive | High false-positive risk. Only recommended for very narrow, domain-specific use cases. |
Examples
# Set threshold globally for all requests on this client
client = OpenAI(
api_key="sc-your-key",
base_url="https://api.semacache.io/v1",
default_headers={"x-similarity-threshold": "0.90"},
)
# Or override per-request
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello"}],
extra_headers={"x-similarity-threshold": "0.98"},
)
# Combine with cache control
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello"}],
extra_headers={
"x-similarity-threshold": "0.85",
"x-cache-ttl": "3600",
},
)Applies to all endpoints: The similarity threshold affects chat completions, image generation, and video generation identically. The same priority order (header → dashboard → server default) applies to all three.
Custom Models
Pro Enterprise
Register any OpenAI-compatible endpoint (vLLM, Ollama, Together AI, Groq, Fireworks, etc.) and SemaCache will cache responses the same way it handles built-in models.
Register via Dashboard
Go to Dashboard → Custom Models → Add Model. Provide the base URL, model name, and authentication details. Once registered, use the model alias in any request.
Register via API
/dashboard/custom-models| Parameter | Type | Required | Description |
|---|---|---|---|
| model_alias | string | required | Name you'll use in requests (e.g., "my-llama"). |
| base_url | string | required | Provider base URL (e.g., "https://api.together.xyz"). |
| downstream_model | string | optional | Actual model name sent to the provider (e.g., "meta-llama/Llama-3-70b"). |
| api_path | string | optional | API path (default: "/v1/chat/completions"). |
| auth_header | string | optional | Auth header name (default: "Authorization"). |
| auth_prefix | string | optional | Auth prefix (default: "Bearer"). |
Usage
After registration, use your custom model alias like any built-in model:
# Store your custom model's API key in the dashboard first
# Then use the alias like any other model
response = client.chat.completions.create(
model="my-llama",
messages=[{"role": "user", "content": "Hello!"}]
)
# SemaCache auto-detects it's a custom model and routes accordinglyResponse Headers
Every response includes cache metadata headers so you can observe caching behavior.
| Header | Description | Example Values |
|---|---|---|
| x-semcache-match-type | How the response was resolved | EXACT, SEMANTIC, NATIVE |
| x-semcache-latency | Total processing time in milliseconds | 4.2, 18.7, 1240.5 |
| x-semcache-confidence | Semantic similarity score (SEMANTIC matches only) | 0.9712 |
response = client.chat.completions.with_raw_response.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello"}]
)
print(response.headers["x-semcache-match-type"]) # EXACT | SEMANTIC | NATIVE
print(response.headers["x-semcache-latency"]) # 4.2
print(response.headers["x-semcache-confidence"]) # 0.9712 (semantic only)Error Handling
Errors follow the OpenAI error format for compatibility with existing error handling code.
| Status | Type | Cause |
|---|---|---|
| 400 | invalid_request_error | Missing required fields, invalid model, or bad upstream key |
| 401 | authentication_error | Invalid or missing SemaCache API key |
| 403 | permission_error | Feature not available on your plan (e.g., custom models on Free tier) |
| 429 | rate_limit_error | Monthly quota or burst rate limit exceeded. Includes Retry-After header for burst limits. |
| 500 | server_error | Internal error or upstream provider failure |
{
"error": {
"message": "Monthly request limit exceeded. Upgrade to Pro for 50,000 requests/month.",
"type": "rate_limit_error"
}
}Rate Limits
Rate limits are enforced at two levels: monthly quotas (total requests per billing period) and burst limits (requests per second / per minute) to prevent abuse and protect upstream providers.
Monthly Quotas
| Plan | Requests / month | API Keys | Custom Models | Audit Log |
|---|---|---|---|---|
| Free | 1,000 | 1 | No | 7 days |
| Pro — $9/mo | 50,000 | 5 | Yes | 30 days |
| Enterprise — $39/mo | 500,000 | Unlimited | Yes | 90 days |
Burst Rate Limits
Burst limits use a sliding window to prevent short-duration spikes. These protect both you and the upstream LLM providers from runaway loops or accidental floods.
| Plan | Per second | Per minute |
|---|---|---|
| Free | 3 | 20 |
| Pro | 20 | 200 |
| Enterprise | 100 | 2,000 |
Rate Limit Response
When a rate limit is exceeded, the API returns 429 with a Retry-After header (for burst limits). Implement exponential backoff or respect the header value.
HTTP/1.1 429 Too Many Requests
Retry-After: 1
Content-Type: application/json
{
"error": {
"message": "Rate limit exceeded: 20 requests/second. Try again shortly.",
"type": "rate_limit_error"
}
}Cache hits count toward your monthly request limit. Upgrade anytime from the billing page. Need higher limits? Contact us for custom enterprise pricing.