API Documentation

SemaCache is an OpenAI-compatible caching proxy. Use any OpenAI SDK — just change the base URL and API key. All responses include cache metadata headers.

Quick Start

Get started in under 60 seconds. Create an API key in the dashboard, then point your OpenAI client at SemaCache.

from openai import OpenAI

client = OpenAI(
    api_key="sc-your-key",
    base_url="https://api.semacache.io/v1"
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": "What is semantic caching?"}
    ]
)

print(response.choices[0].message.content)

How it works: SemaCache checks for an exact or semantically similar cached response. On a hit, it returns the cached result in ~5–20ms. On a miss, it forwards to the upstream LLM, caches the response, and returns it. Your app sees a standard OpenAI response either way.

Authentication

All API requests require a SemaCache API key in the Authorization header. Create keys in the dashboard.

HTTP

Authorization: Bearer sc-your-key

Upstream LLM Keys

SemaCache proxies requests to your LLM provider using your upstream API key. Two options:

Option A: Save in Dashboard Recommended

Go to Dashboard → Settings → Add your OpenAI or Gemini key. Keys are AES-256 encrypted at rest. The key is auto-resolved on every request.

Option B: Pass per-request

Send x-upstream-api-key header. This takes priority over stored keys.

Per-request upstream key

curl -X POST https://api.semacache.io/v1/chat/completions \
  -H "Authorization: Bearer sc-your-key" \
  -H "x-upstream-api-key: sk-your-openai-key" \
  -H "Content-Type: application/json" \
  -d '{"model": "gpt-4o", "messages": [...]}'

Chat Completions

OpenAI-compatible chat completions with three-tier caching: exact match → semantic match → LLM passthrough.

POST/v1/chat/completions

Request Body

Parameter	Type	Required	Description
model	string	required	Model ID: gpt-5.4, gpt-4o, gemini-3.1-pro-preview, gemini-2.5-flash, grok-4.20, grok-3, or any registered custom model.
messages	array	required	Array of message objects with role and content. Supports text and multimodal (image URLs).
temperature	number	optional	Sampling temperature (0–2). Passed to upstream LLM on cache miss.
max_tokens	integer	optional	Maximum tokens to generate. Passed to upstream LLM on cache miss.
stream	boolean	optional	Not yet supported. Requests with stream=true are passed through without caching.

Supported Models

OpenAI

gpt-5.4gpt-5.4-minigpt-5.4-nanogpt-4.1gpt-4.1-minigpt-4.1-nanogpt-4ogpt-4o-minio3o3-minio4-mini

Google Gemini

gemini-3.1-pro-previewgemini-3-flash-previewgemini-3.1-flash-lite-previewgemini-2.5-progemini-2.5-flashgemini-2.5-flash-lite

xAI Grok

grok-4.20grok-4grok-4-fastgrok-3grok-3-minigrok-3-fast

Response

Standard OpenAI-compatible response format, regardless of upstream provider.

Response

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "model": "gpt-4o",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Semantic caching stores responses..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 12,
    "completion_tokens": 85,
    "total_tokens": 97
  }
}

Image Generation

Generate images with caching. Supports OpenAI GPT Image, Google Imagen, and xAI Grok Imagine. Generated images are rehosted to cloud storage for permanent URLs.

POST/v1/images/generations

Request Body

Parameter	Type	Required	Description
prompt	string	required	Text description of the image to generate.
model	string	optional	gpt-image-1 (default), gpt-image-1.5, gpt-image-1-mini, imagen-4.0-generate-001, imagen-4.0-ultra-generate-001, imagen-4.0-fast-generate-001, grok-imagine-image, or grok-imagine-image-pro.
n	integer	optional	Number of images (default: 1).
size	string	optional	Image size: 1024x1024 (default), 1024x1792, 1792x1024.
quality	string	optional	Quality: standard (default) or hd.

response = client.images.generate(
    model="gpt-image-1",
    prompt="A cat wearing a top hat, oil painting",
    size="1024x1024"
)

image_url = response.data[0].url

Video Generation

Generate videos with Google Veo or xAI Grok Imagine Video. Videos are rehosted to cloud storage. Requires an upstream API key (Gemini for Veo, xAI for Grok).

POST/v1/videos/generations

Request Body

Parameter	Type	Required	Description
prompt	string	required	Text description of the video to generate.
model	string	optional	veo-2.0-generate-001 (default), veo-3.1-generate-preview, veo-3.1-fast-generate-preview, veo-3.1-lite-generate-preview, veo-3.0-generate-001, veo-3.0-fast-generate-001, or grok-imagine-video.
duration_seconds	integer	optional	Video duration: 5–8 seconds (default: 8).
aspect_ratio	string	optional	16:9 (default) or 9:16.
n	integer	optional	Number of videos (default: 1).

curl

curl -X POST https://api.semacache.io/v1/videos/generations \
  -H "Authorization: Bearer sc-your-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "veo-2.0-generate-001",
    "prompt": "A drone shot over a mountain lake at sunrise",
    "duration_seconds": 8,
    "aspect_ratio": "16:9"
  }'

Cache Control

Control caching behavior per-request using HTTP headers. All headers are passed via the OpenAI SDK's extra_headers (Python) or headers (TypeScript) parameter.

Request Headers

Header	Behavior	Example
x-cache-ttl	Custom TTL in seconds for this cache entry	x-cache-ttl: 3600
Cache-Control: max-age=N	Standard HTTP TTL (same effect as x-cache-ttl)	Cache-Control: max-age=86400
Cache-Control: no-cache	Bypass cache read — always call the LLM, still store the result	Cache-Control: no-cache
Cache-Control: no-store	Bypass both read and write — ephemeral request, nothing cached	Cache-Control: no-store
x-similarity-threshold	Override the semantic similarity threshold for this request (0.50–1.00)	x-similarity-threshold: 0.90

Cache Duration (TTL) — Priority Order

The cache TTL (time-to-live) determines how long a cached response is stored before expiring. There are three ways to set it. The highest-priority source wins:

Per-request header highest priority

x-cache-ttl: 3600 or Cache-Control: max-age=3600. If both are present, x-cache-ttl wins. Value is clamped to 1 second – 90 days.

Server default

If no header is sent, the server default applies: 7 days (604,800 seconds). Configured via the CACHE_TTL_SECONDS environment variable on the cache service.

Examples

# Custom 1-hour TTL
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}],
    extra_headers={"x-cache-ttl": "3600"}
)

# Force fresh response (bypass cache read, still store)
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}],
    extra_headers={"Cache-Control": "no-cache"}
)

# Ephemeral — don't read or store
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Sensitive question"}],
    extra_headers={"Cache-Control": "no-store"}
)

Similarity Threshold

The similarity threshold controls how closely a new query must match an existing cached query to count as a semantic cache hit. A value of 1.00 means only exact semantic matches; a lower value like 0.85 accepts looser paraphrases, increasing hit rate but risking false positives.

Three Ways to Set It — Priority Order

The threshold is resolved per-request. The highest-priority source that is set wins:

Per-request header highest priority

Send x-similarity-threshold: 0.90 to override for a single request. Useful for A/B testing or when a specific call needs tighter/looser matching. Value is clamped to 0.50 – 1.00. Invalid values are silently ignored.

Per-user setting (Dashboard)

Go to Dashboard → Settings → Cache Configurationand drag the slider. This value is stored in your account and applies to all requests that don't include the header. Range: 0.50 – 1.00.

Server default

If neither the header nor the dashboard setting is configured, the server default applies: 0.95. Configured via the SIMILARITY_THRESHOLD environment variable on the cache service.

Resolution summary

x-similarity-threshold header→Dashboard per-user setting→Server default (0.95)

Choosing a Value

Range	Label	Behavior
0.97 – 1.00	Very strict	Nearly identical queries only. Lowest false-positive risk, fewest cache hits.
0.93 – 0.96	Balanced	Catches close paraphrases (e.g., "What is caching?" ≈ "Explain caching"). Good default for most apps.
0.85 – 0.92	Aggressive	More cache hits, but some false positives for loosely related queries.
0.50 – 0.84	Very aggressive	High false-positive risk. Only recommended for very narrow, domain-specific use cases.

Examples

# Set threshold globally for all requests on this client
client = OpenAI(
    api_key="sc-your-key",
    base_url="https://api.semacache.io/v1",
    default_headers={"x-similarity-threshold": "0.90"},
)

# Or override per-request
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}],
    extra_headers={"x-similarity-threshold": "0.98"},
)

# Combine with cache control
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}],
    extra_headers={
        "x-similarity-threshold": "0.85",
        "x-cache-ttl": "3600",
    },
)

Applies to all endpoints: The similarity threshold affects chat completions, image generation, and video generation identically. The same priority order (header → dashboard → server default) applies to all three.

Custom Models

Pro Enterprise

Register any OpenAI-compatible endpoint (vLLM, Ollama, Together AI, Groq, Fireworks, etc.) and SemaCache will cache responses the same way it handles built-in models.

Register via Dashboard

Go to Dashboard → Custom Models → Add Model. Provide the base URL, model name, and authentication details. Once registered, use the model alias in any request.

Register via API

POST/dashboard/custom-models

Parameter	Type	Required	Description
model_alias	string	required	Name you'll use in requests (e.g., "my-llama").
base_url	string	required	Provider base URL (e.g., "https://api.together.xyz").
downstream_model	string	optional	Actual model name sent to the provider (e.g., "meta-llama/Llama-3-70b").
api_path	string	optional	API path (default: "/v1/chat/completions").
auth_header	string	optional	Auth header name (default: "Authorization").
auth_prefix	string	optional	Auth prefix (default: "Bearer").

Usage

After registration, use your custom model alias like any built-in model:

Python

# Store your custom model's API key in the dashboard first
# Then use the alias like any other model
response = client.chat.completions.create(
    model="my-llama",
    messages=[{"role": "user", "content": "Hello!"}]
)
# SemaCache auto-detects it's a custom model and routes accordingly

Response Headers

Every response includes cache metadata headers so you can observe caching behavior.

Header	Description	Example Values
x-semcache-match-type	How the response was resolved	EXACT, SEMANTIC, NATIVE
x-semcache-latency	Total processing time in milliseconds	4.2, 18.7, 1240.5
x-semcache-confidence	Semantic similarity score (SEMANTIC matches only)	0.9712

Inspect headers (Python)

response = client.chat.completions.with_raw_response.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}]
)

print(response.headers["x-semcache-match-type"])  # EXACT | SEMANTIC | NATIVE
print(response.headers["x-semcache-latency"])      # 4.2
print(response.headers["x-semcache-confidence"])   # 0.9712 (semantic only)

Error Handling

Errors follow the OpenAI error format for compatibility with existing error handling code.

Status	Type	Cause
400	invalid_request_error	Missing required fields, invalid model, or bad upstream key
401	authentication_error	Invalid or missing SemaCache API key
403	permission_error	Feature not available on your plan (e.g., custom models on Free tier)
429	rate_limit_error	Monthly quota or burst rate limit exceeded. Includes `Retry-After` header for burst limits.
500	server_error	Internal error or upstream provider failure

Error response format

{
  "error": {
    "message": "Monthly request limit exceeded. Upgrade to Pro for 50,000 requests/month.",
    "type": "rate_limit_error"
  }
}

Rate Limits

Rate limits are enforced at two levels: monthly quotas (total requests per billing period) and burst limits (requests per second / per minute) to prevent abuse and protect upstream providers.

Monthly Quotas

Plan	Requests / month	API Keys	Custom Models	Audit Log
Free	1,000	1	No	7 days
Pro — $9/mo	50,000	5	Yes	30 days
Enterprise — $39/mo	500,000	Unlimited	Yes	90 days

Burst Rate Limits

Burst limits use a sliding window to prevent short-duration spikes. These protect both you and the upstream LLM providers from runaway loops or accidental floods.

Plan	Per second	Per minute
Free	3	20
Pro	20	200
Enterprise	100	2,000

Rate Limit Response

When a rate limit is exceeded, the API returns 429 with a Retry-After header (for burst limits). Implement exponential backoff or respect the header value.

429 Response

HTTP/1.1 429 Too Many Requests
Retry-After: 1
Content-Type: application/json

{
  "error": {
    "message": "Rate limit exceeded: 20 requests/second. Try again shortly.",
    "type": "rate_limit_error"
  }
}

Cache hits count toward your monthly request limit. Upgrade anytime from the billing page. Need higher limits? Contact us for custom enterprise pricing.