API Documentation

v1

SemaCache is an OpenAI-compatible caching proxy. Use any OpenAI SDK — just change the base URL and API key. All responses include cache metadata headers.

Quick Start

Get started in under 60 seconds. Create an API key in the dashboard, then point your OpenAI client at SemaCache.

from openai import OpenAI

client = OpenAI(
    api_key="sc-your-key",
    base_url="https://api.semacache.io/v1"
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": "What is semantic caching?"}
    ]
)

print(response.choices[0].message.content)

How it works: SemaCache checks for an exact or semantically similar cached response. On a hit, it returns the cached result in ~5–20ms. On a miss, it forwards to the upstream LLM, caches the response, and returns it. Your app sees a standard OpenAI response either way.

Authentication

All API requests require a SemaCache API key in the Authorization header. Create keys in the dashboard.

HTTP
Authorization: Bearer sc-your-key

Upstream LLM Keys

SemaCache proxies requests to your LLM provider using your upstream API key. Two options:

Option A: Save in Dashboard Recommended

Go to Dashboard → Settings → Add your OpenAI or Gemini key. Keys are AES-256 encrypted at rest. The key is auto-resolved on every request.

Option B: Pass per-request

Send x-upstream-api-key header. This takes priority over stored keys.

Per-request upstream key
curl -X POST https://api.semacache.io/v1/chat/completions \
  -H "Authorization: Bearer sc-your-key" \
  -H "x-upstream-api-key: sk-your-openai-key" \
  -H "Content-Type: application/json" \
  -d '{"model": "gpt-4o", "messages": [...]}'

Chat Completions

OpenAI-compatible chat completions with three-tier caching: exact match → semantic match → LLM passthrough.

POST/v1/chat/completions

Request Body

ParameterTypeRequiredDescription
modelstringrequiredModel ID: gpt-5.4, gpt-4o, gemini-3.1-pro-preview, gemini-2.5-flash, grok-4.20, grok-3, or any registered custom model.
messagesarrayrequiredArray of message objects with role and content. Supports text and multimodal (image URLs).
temperaturenumberoptionalSampling temperature (0–2). Passed to upstream LLM on cache miss.
max_tokensintegeroptionalMaximum tokens to generate. Passed to upstream LLM on cache miss.
streambooleanoptionalNot yet supported. Requests with stream=true are passed through without caching.

Supported Models

OpenAI

gpt-5.4gpt-5.4-minigpt-5.4-nanogpt-4.1gpt-4.1-minigpt-4.1-nanogpt-4ogpt-4o-minio3o3-minio4-mini

Google Gemini

gemini-3.1-pro-previewgemini-3-flash-previewgemini-3.1-flash-lite-previewgemini-2.5-progemini-2.5-flashgemini-2.5-flash-lite

xAI Grok

grok-4.20grok-4grok-4-fastgrok-3grok-3-minigrok-3-fast

Response

Standard OpenAI-compatible response format, regardless of upstream provider.

Response
{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "model": "gpt-4o",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Semantic caching stores responses..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 12,
    "completion_tokens": 85,
    "total_tokens": 97
  }
}

Image Generation

Generate images with caching. Supports OpenAI GPT Image, Google Imagen, and xAI Grok Imagine. Generated images are rehosted to cloud storage for permanent URLs.

POST/v1/images/generations

Request Body

ParameterTypeRequiredDescription
promptstringrequiredText description of the image to generate.
modelstringoptionalgpt-image-1 (default), gpt-image-1.5, gpt-image-1-mini, imagen-4.0-generate-001, imagen-4.0-ultra-generate-001, imagen-4.0-fast-generate-001, grok-imagine-image, or grok-imagine-image-pro.
nintegeroptionalNumber of images (default: 1).
sizestringoptionalImage size: 1024x1024 (default), 1024x1792, 1792x1024.
qualitystringoptionalQuality: standard (default) or hd.
response = client.images.generate(
    model="gpt-image-1",
    prompt="A cat wearing a top hat, oil painting",
    size="1024x1024"
)

image_url = response.data[0].url

Video Generation

Generate videos with Google Veo or xAI Grok Imagine Video. Videos are rehosted to cloud storage. Requires an upstream API key (Gemini for Veo, xAI for Grok).

POST/v1/videos/generations

Request Body

ParameterTypeRequiredDescription
promptstringrequiredText description of the video to generate.
modelstringoptionalveo-2.0-generate-001 (default), veo-3.1-generate-preview, veo-3.1-fast-generate-preview, veo-3.1-lite-generate-preview, veo-3.0-generate-001, veo-3.0-fast-generate-001, or grok-imagine-video.
duration_secondsintegeroptionalVideo duration: 5–8 seconds (default: 8).
aspect_ratiostringoptional16:9 (default) or 9:16.
nintegeroptionalNumber of videos (default: 1).
curl
curl -X POST https://api.semacache.io/v1/videos/generations \
  -H "Authorization: Bearer sc-your-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "veo-2.0-generate-001",
    "prompt": "A drone shot over a mountain lake at sunrise",
    "duration_seconds": 8,
    "aspect_ratio": "16:9"
  }'

Cache Control

Control caching behavior per-request using HTTP headers. All headers are passed via the OpenAI SDK's extra_headers (Python) or headers (TypeScript) parameter.

Request Headers

HeaderBehaviorExample
x-cache-ttlCustom TTL in seconds for this cache entryx-cache-ttl: 3600
Cache-Control: max-age=NStandard HTTP TTL (same effect as x-cache-ttl)Cache-Control: max-age=86400
Cache-Control: no-cacheBypass cache read — always call the LLM, still store the resultCache-Control: no-cache
Cache-Control: no-storeBypass both read and write — ephemeral request, nothing cachedCache-Control: no-store
x-similarity-thresholdOverride the semantic similarity threshold for this request (0.50–1.00)x-similarity-threshold: 0.90

Cache Duration (TTL) — Priority Order

The cache TTL (time-to-live) determines how long a cached response is stored before expiring. There are three ways to set it. The highest-priority source wins:

1

Per-request header highest priority

x-cache-ttl: 3600 or Cache-Control: max-age=3600. If both are present, x-cache-ttl wins. Value is clamped to 1 second – 90 days.

2

Server default

If no header is sent, the server default applies: 7 days (604,800 seconds). Configured via the CACHE_TTL_SECONDS environment variable on the cache service.

Examples

# Custom 1-hour TTL
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}],
    extra_headers={"x-cache-ttl": "3600"}
)

# Force fresh response (bypass cache read, still store)
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}],
    extra_headers={"Cache-Control": "no-cache"}
)

# Ephemeral — don't read or store
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Sensitive question"}],
    extra_headers={"Cache-Control": "no-store"}
)

Similarity Threshold

The similarity threshold controls how closely a new query must match an existing cached query to count as a semantic cache hit. A value of 1.00 means only exact semantic matches; a lower value like 0.85 accepts looser paraphrases, increasing hit rate but risking false positives.

Three Ways to Set It — Priority Order

The threshold is resolved per-request. The highest-priority source that is set wins:

1

Per-request header highest priority

Send x-similarity-threshold: 0.90 to override for a single request. Useful for A/B testing or when a specific call needs tighter/looser matching. Value is clamped to 0.50 – 1.00. Invalid values are silently ignored.

2

Per-user setting (Dashboard)

Go to Dashboard → Settings → Cache Configurationand drag the slider. This value is stored in your account and applies to all requests that don't include the header. Range: 0.50 – 1.00.

3

Server default

If neither the header nor the dashboard setting is configured, the server default applies: 0.95. Configured via the SIMILARITY_THRESHOLD environment variable on the cache service.

Resolution summary

x-similarity-threshold headerDashboard per-user settingServer default (0.95)

Choosing a Value

RangeLabelBehavior
0.97 – 1.00Very strictNearly identical queries only. Lowest false-positive risk, fewest cache hits.
0.93 – 0.96BalancedCatches close paraphrases (e.g., "What is caching?" ≈ "Explain caching"). Good default for most apps.
0.85 – 0.92AggressiveMore cache hits, but some false positives for loosely related queries.
0.50 – 0.84Very aggressiveHigh false-positive risk. Only recommended for very narrow, domain-specific use cases.

Examples

# Set threshold globally for all requests on this client
client = OpenAI(
    api_key="sc-your-key",
    base_url="https://api.semacache.io/v1",
    default_headers={"x-similarity-threshold": "0.90"},
)

# Or override per-request
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}],
    extra_headers={"x-similarity-threshold": "0.98"},
)

# Combine with cache control
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}],
    extra_headers={
        "x-similarity-threshold": "0.85",
        "x-cache-ttl": "3600",
    },
)

Applies to all endpoints: The similarity threshold affects chat completions, image generation, and video generation identically. The same priority order (header → dashboard → server default) applies to all three.

Custom Models

Pro Enterprise

Register any OpenAI-compatible endpoint (vLLM, Ollama, Together AI, Groq, Fireworks, etc.) and SemaCache will cache responses the same way it handles built-in models.

Register via Dashboard

Go to Dashboard → Custom Models → Add Model. Provide the base URL, model name, and authentication details. Once registered, use the model alias in any request.

Register via API

POST/dashboard/custom-models
ParameterTypeRequiredDescription
model_aliasstringrequiredName you'll use in requests (e.g., "my-llama").
base_urlstringrequiredProvider base URL (e.g., "https://api.together.xyz").
downstream_modelstringoptionalActual model name sent to the provider (e.g., "meta-llama/Llama-3-70b").
api_pathstringoptionalAPI path (default: "/v1/chat/completions").
auth_headerstringoptionalAuth header name (default: "Authorization").
auth_prefixstringoptionalAuth prefix (default: "Bearer").

Usage

After registration, use your custom model alias like any built-in model:

Python
# Store your custom model's API key in the dashboard first
# Then use the alias like any other model
response = client.chat.completions.create(
    model="my-llama",
    messages=[{"role": "user", "content": "Hello!"}]
)
# SemaCache auto-detects it's a custom model and routes accordingly

Response Headers

Every response includes cache metadata headers so you can observe caching behavior.

HeaderDescriptionExample Values
x-semcache-match-typeHow the response was resolvedEXACT, SEMANTIC, NATIVE
x-semcache-latencyTotal processing time in milliseconds4.2, 18.7, 1240.5
x-semcache-confidenceSemantic similarity score (SEMANTIC matches only)0.9712
Inspect headers (Python)
response = client.chat.completions.with_raw_response.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}]
)

print(response.headers["x-semcache-match-type"])  # EXACT | SEMANTIC | NATIVE
print(response.headers["x-semcache-latency"])      # 4.2
print(response.headers["x-semcache-confidence"])   # 0.9712 (semantic only)

Error Handling

Errors follow the OpenAI error format for compatibility with existing error handling code.

StatusTypeCause
400invalid_request_errorMissing required fields, invalid model, or bad upstream key
401authentication_errorInvalid or missing SemaCache API key
403permission_errorFeature not available on your plan (e.g., custom models on Free tier)
429rate_limit_errorMonthly quota or burst rate limit exceeded. Includes Retry-After header for burst limits.
500server_errorInternal error or upstream provider failure
Error response format
{
  "error": {
    "message": "Monthly request limit exceeded. Upgrade to Pro for 50,000 requests/month.",
    "type": "rate_limit_error"
  }
}

Rate Limits

Rate limits are enforced at two levels: monthly quotas (total requests per billing period) and burst limits (requests per second / per minute) to prevent abuse and protect upstream providers.

Monthly Quotas

PlanRequests / monthAPI KeysCustom ModelsAudit Log
Free1,0001No7 days
Pro — $9/mo50,0005Yes30 days
Enterprise — $39/mo500,000UnlimitedYes90 days

Burst Rate Limits

Burst limits use a sliding window to prevent short-duration spikes. These protect both you and the upstream LLM providers from runaway loops or accidental floods.

PlanPer secondPer minute
Free320
Pro20200
Enterprise1002,000

Rate Limit Response

When a rate limit is exceeded, the API returns 429 with a Retry-After header (for burst limits). Implement exponential backoff or respect the header value.

429 Response
HTTP/1.1 429 Too Many Requests
Retry-After: 1
Content-Type: application/json

{
  "error": {
    "message": "Rate limit exceeded: 20 requests/second. Try again shortly.",
    "type": "rate_limit_error"
  }
}

Cache hits count toward your monthly request limit. Upgrade anytime from the billing page. Need higher limits? Contact us for custom enterprise pricing.