Glossary
LLM glossary
51 termsA categorized reference for the terms that show up across model evaluations, deployment planning, pricing reviews, and AI governance decisions.
Capability Concepts
Core model behaviors that shape reasoning quality, multimodal depth, and how much orchestration you need around the model.
Context window
The amount of input text or tokens a model can process in a single request.
Signal to watch: larger windows help long-document analysis, but recall quality still depends on model reasoning.
Instruction tuning
Post-training that teaches a model to follow assistant-style prompts, policies, and response formats more reliably.
Signal to watch: good tuning improves steerability, not raw knowledge freshness.
Structured output
A model's ability to return schema-constrained JSON or other predictable response formats.
Signal to watch: this matters when LLMs feed workflows, dashboards, or downstream automation.
Multimodal
A system that can understand or generate across text, image, audio, or video instead of plain text alone.
Signal to watch: check which modalities are native versus routed through separate models.
Hallucination
A plausible-sounding output that is fabricated, unsupported, or materially incorrect.
Signal to watch: benchmark wins do not eliminate hallucination in domain-specific workflows.
System prompt
A hidden instruction prepended to every conversation that sets tone, constraints, and behavioral guardrails for the model.
Why it matters: system prompts are the primary lever for controlling model behavior in production apps.
Retrieval-augmented generation
An architecture that fetches relevant documents from an external store and injects them into the prompt before generation.
Signal to watch: RAG quality depends heavily on retrieval precision, not just the model's reasoning ability.
Fine-tuning
Training a pre-trained model on domain-specific data to improve accuracy or match a particular style.
Why it matters: fine-tuning trades upfront training cost for lower prompt engineering overhead and better task fit.
Embedding
A dense numerical vector that represents text semantics, enabling similarity search and clustering.
Signal to watch: embedding quality directly affects RAG recall and semantic search relevance.
Chain-of-thought
A prompting strategy where the model is guided to show intermediate reasoning steps before producing a final answer.
Signal to watch: chain-of-thought improves complex reasoning but increases token usage and latency.
Temperature
A sampling parameter that controls randomness in output generation, where lower values produce more deterministic responses.
Why it matters: temperature tuning is a key trade-off between creative diversity and factual consistency.
Evaluation & Benchmarks
The signals analysts use to compare models beyond marketing claims, including repeatability, speed, and failure modes.
Benchmark
A standardized test suite used to compare model performance on a defined set of tasks.
Signal to watch: benchmark fit matters more than raw leaderboard position.
Elo rating
A comparative ranking method that estimates performance from head-to-head preference outcomes.
Signal to watch: Elo captures relative preference, not absolute accuracy or cost efficiency.
Latency SLA
A provider commitment for how quickly responses should arrive under stated operating conditions.
Why it matters: latency commitments shape user experience and workflow feasibility.
Throughput
The amount of work a system can process over time, often expressed as requests or tokens per minute.
Signal to watch: throughput ceilings become visible during batch jobs and agent loops.
Reliability
How consistently a model or service produces usable outputs without outages, timeouts, or malformed responses.
Signal to watch: reliability often matters more than peak benchmark scores in production.
Perplexity
A measurement of how well a model predicts a sample of text, where lower values indicate higher confidence.
Signal to watch: perplexity is useful for comparing models on the same corpus, but not across domains.
Human eval
A scoring process where human reviewers rate model outputs for quality, accuracy, safety, or preference.
Why it matters: human eval catches issues automated benchmarks miss, but it is expensive and slow to scale.
Safety evaluation
Testing that probes a model for harmful, biased, or policy-violating outputs under adversarial prompting.
Signal to watch: safety evals must be continuous because new attack patterns emerge with every model release.
Time to first token
The elapsed time between sending a prompt and receiving the first output token from the model.
Why it matters: TTFT drives perceived responsiveness in chat interfaces and interactive tools.
Token error rate
The frequency of malformed, truncated, or corrupted tokens in model output.
Signal to watch: elevated error rates often signal capacity pressure or serving stack instability.
Deployment & Ops
Terms that matter once a model leaves the lab and becomes part of a live product, internal tool, or enterprise workflow.
Inference endpoint
A hosted API surface where applications send prompts and receive model outputs.
Why it matters: endpoint design affects auth, observability, and rate-limit behavior.
Dynamic batching
A serving technique that groups compatible requests together to improve GPU utilization and lower cost.
Signal to watch: batching can cut cost, but it may add queueing delay under bursty traffic.
Autoscaling
Automatically adding or removing serving capacity as workload demand rises or falls.
Why it matters: poor autoscaling policies create latency spikes during launches or business-hour peaks.
Observability
The logs, traces, metrics, and prompt-level visibility needed to diagnose quality or performance problems.
Why it matters: without observability, teams struggle to trace failures back to prompts, tools, or providers.
Data residency
A deployment constraint that specifies where data is stored, processed, or allowed to transit.
Why it matters: regulated buyers often treat residency as a hard gate before pricing discussions start.
Rate limit
A cap on the number of requests or tokens a client can submit within a time window.
Why it matters: rate limits protect shared infrastructure but can throttle bursty agent workloads.
GPU utilization
The percentage of available GPU compute capacity actually used during inference.
Signal to watch: low utilization means you are paying for idle hardware; high utilization risks queuing.
Model routing
A proxy layer that directs incoming requests to different models based on task type, cost, or availability.
Why it matters: smart routing can cut costs by sending simple tasks to cheaper, faster models.
KV cache
Stored key-value pairs from previous attention computations that speed up autoregressive token generation.
Signal to watch: KV cache memory pressure is the main constraint on concurrent users per GPU.
Quantization
Reducing the numerical precision of model weights (e.g., from 16-bit to 4-bit) to lower memory and compute requirements.
Why it matters: quantization enables self-hosting on cheaper hardware but may degrade output quality.
Pricing & Procurement
Commercial terms buyers use to evaluate total cost, contract flexibility, and the true economics of sustained usage.
Per-token pricing
A billing model where cost scales with the volume of input and output tokens processed.
Why it matters: token price is the starting point, not the full cost of a workload.
Reserved capacity
Pre-committed compute or throughput purchased in exchange for guaranteed access or discounted rates.
Why it matters: reserved deals reduce uncertainty for large deployments but increase forecasting risk.
Burst pricing
Premium pricing or overage treatment applied when usage exceeds a contracted or baseline allocation.
Why it matters: burst behavior can distort unit economics during launches or agent-heavy workloads.
Service credit
A contractual remedy that compensates customers when a provider misses defined service commitments.
Why it matters: credits rarely repay business impact, but they expose how seriously SLAs are enforced.
Cost per task
The effective spend required to complete a business action such as a summary, support resolution, or code review.
Why it matters: cost per task is often more decision-useful than raw price per token.
Token budget
A spending cap or allocation measured in tokens, used to control cost across teams or features.
Why it matters: token budgets prevent runaway spend in agent loops or long-running batch jobs.
Input vs output pricing
The split between cost per input token (prompts) and cost per output token (completions), which often differ by 2-4x.
Signal to watch: output-heavy workloads like summarization cost disproportionately more than retrieval tasks.
Committed use discount
A price reduction offered in exchange for a contractual commitment to spend a minimum amount over a set period.
Why it matters: discounts range from 15-60%, but underuse penalties can erase the savings.
Free tier
A baseline allocation of requests or tokens provided at no charge, designed for experimentation or low-volume use.
Signal to watch: free tier limits often differ from paid tier rate limits, causing surprises at scale.
Total cost of ownership
The full lifecycle cost of running LLMs including API spend, engineering time, infrastructure, and risk mitigation.
Why it matters: API pricing is usually 30-50% of TCO; the rest is integration, evaluation, and ops overhead.
Trends & Guardrails
Emerging patterns and governance ideas that shape where the market is moving and where adoption can break down.
Agent
A system that uses a model to plan, call tools, and iterate toward a goal across multiple steps.
Signal to watch: useful agents depend on tooling, memory, and guardrails more than prompt cleverness.
Tool use
A model's ability to invoke external functions, APIs, or retrieval systems while generating an answer.
Signal to watch: tool routing quality affects whether agents stay grounded or drift into brittle loops.
Alignment
Techniques that steer a model toward preferred behavior, safety policies, and user intent.
Signal to watch: stronger alignment can improve trustworthiness but may also narrow behavior in edge cases.
Open-weight
A model whose weights are available for self-hosting, tuning, or private deployment.
Why it matters: open-weight models trade turnkey convenience for control over performance, cost, and data handling.
Privacy guardrail
A control that reduces the chance of exposing sensitive data through prompts, logs, training, or outputs.
Why it matters: privacy controls need to cover the full workflow, not only the model endpoint.
Prompt injection
An adversarial technique where hidden instructions in user input override or manipulate model behavior.
Signal to watch: prompt injection remains unsolved; defense-in-depth with input sanitization is essential.
Constitutional AI
An alignment method where a model critiques and revises its own outputs against a set of explicit principles.
Signal to watch: constitutional methods scale alignment review, but principle gaps still produce edge-case failures.
Model context protocol
An emerging standard for how applications share context, tools, and resources with language models.
Signal to watch: MCP adoption is growing fast; it may become the default integration pattern for tool-augmented LLMs.
Self-hosting
Running model inference on your own infrastructure instead of consuming a managed API.
Why it matters: self-hosting gives data control and cost predictability but requires GPU ops expertise.
Red teaming
Adversarial testing that probes a model for harmful outputs, jailbreaks, or policy violations before deployment.
Why it matters: red teaming is becoming a regulatory expectation, not just a best practice.