Glossary

LLM glossary

51 terms

A categorized reference for the terms that show up across model evaluations, deployment planning, pricing reviews, and AI governance decisions.

Capability Concepts

Core model behaviors that shape reasoning quality, multimodal depth, and how much orchestration you need around the model.

11 terms

Capability concept

Context window

The amount of input text or tokens a model can process in a single request.

Signal to watch: larger windows help long-document analysis, but recall quality still depends on model reasoning.

Capability concept

Instruction tuning

Post-training that teaches a model to follow assistant-style prompts, policies, and response formats more reliably.

Signal to watch: good tuning improves steerability, not raw knowledge freshness.

Capability concept

Structured output

A model's ability to return schema-constrained JSON or other predictable response formats.

Signal to watch: this matters when LLMs feed workflows, dashboards, or downstream automation.

Capability concept

Multimodal

A system that can understand or generate across text, image, audio, or video instead of plain text alone.

Signal to watch: check which modalities are native versus routed through separate models.

Capability risk

Hallucination

A plausible-sounding output that is fabricated, unsupported, or materially incorrect.

Signal to watch: benchmark wins do not eliminate hallucination in domain-specific workflows.

Capability concept

System prompt

A hidden instruction prepended to every conversation that sets tone, constraints, and behavioral guardrails for the model.

Why it matters: system prompts are the primary lever for controlling model behavior in production apps.

Capability concept

Retrieval-augmented generation

An architecture that fetches relevant documents from an external store and injects them into the prompt before generation.

Signal to watch: RAG quality depends heavily on retrieval precision, not just the model's reasoning ability.

Capability concept

Fine-tuning

Training a pre-trained model on domain-specific data to improve accuracy or match a particular style.

Why it matters: fine-tuning trades upfront training cost for lower prompt engineering overhead and better task fit.

Capability concept

Embedding

A dense numerical vector that represents text semantics, enabling similarity search and clustering.

Signal to watch: embedding quality directly affects RAG recall and semantic search relevance.

Capability concept

Chain-of-thought

A prompting strategy where the model is guided to show intermediate reasoning steps before producing a final answer.

Signal to watch: chain-of-thought improves complex reasoning but increases token usage and latency.

Capability concept

Temperature

A sampling parameter that controls randomness in output generation, where lower values produce more deterministic responses.

Why it matters: temperature tuning is a key trade-off between creative diversity and factual consistency.

Evaluation & Benchmarks

The signals analysts use to compare models beyond marketing claims, including repeatability, speed, and failure modes.

10 terms

Evaluation signal

Benchmark

A standardized test suite used to compare model performance on a defined set of tasks.

Signal to watch: benchmark fit matters more than raw leaderboard position.

Evaluation signal

Elo rating

A comparative ranking method that estimates performance from head-to-head preference outcomes.

Signal to watch: Elo captures relative preference, not absolute accuracy or cost efficiency.

Operational metric

Latency SLA

A provider commitment for how quickly responses should arrive under stated operating conditions.

Why it matters: latency commitments shape user experience and workflow feasibility.

Operational metric

Throughput

The amount of work a system can process over time, often expressed as requests or tokens per minute.

Signal to watch: throughput ceilings become visible during batch jobs and agent loops.

Evaluation signal

Reliability

How consistently a model or service produces usable outputs without outages, timeouts, or malformed responses.

Signal to watch: reliability often matters more than peak benchmark scores in production.

Evaluation signal

Perplexity

A measurement of how well a model predicts a sample of text, where lower values indicate higher confidence.

Signal to watch: perplexity is useful for comparing models on the same corpus, but not across domains.

Evaluation signal

Human eval

A scoring process where human reviewers rate model outputs for quality, accuracy, safety, or preference.

Why it matters: human eval catches issues automated benchmarks miss, but it is expensive and slow to scale.

Evaluation signal

Safety evaluation

Testing that probes a model for harmful, biased, or policy-violating outputs under adversarial prompting.

Signal to watch: safety evals must be continuous because new attack patterns emerge with every model release.

Operational metric

Time to first token

The elapsed time between sending a prompt and receiving the first output token from the model.

Why it matters: TTFT drives perceived responsiveness in chat interfaces and interactive tools.

Operational metric

Token error rate

The frequency of malformed, truncated, or corrupted tokens in model output.

Signal to watch: elevated error rates often signal capacity pressure or serving stack instability.

Deployment & Ops

Terms that matter once a model leaves the lab and becomes part of a live product, internal tool, or enterprise workflow.

10 terms

Deployment term

Inference endpoint

A hosted API surface where applications send prompts and receive model outputs.

Why it matters: endpoint design affects auth, observability, and rate-limit behavior.

Deployment term

Dynamic batching

A serving technique that groups compatible requests together to improve GPU utilization and lower cost.

Signal to watch: batching can cut cost, but it may add queueing delay under bursty traffic.

Deployment term

Autoscaling

Automatically adding or removing serving capacity as workload demand rises or falls.

Why it matters: poor autoscaling policies create latency spikes during launches or business-hour peaks.

Ops practice

Observability

The logs, traces, metrics, and prompt-level visibility needed to diagnose quality or performance problems.

Why it matters: without observability, teams struggle to trace failures back to prompts, tools, or providers.

Ops constraint

Data residency

A deployment constraint that specifies where data is stored, processed, or allowed to transit.

Why it matters: regulated buyers often treat residency as a hard gate before pricing discussions start.

Deployment term

Rate limit

A cap on the number of requests or tokens a client can submit within a time window.

Why it matters: rate limits protect shared infrastructure but can throttle bursty agent workloads.

Ops metric

GPU utilization

The percentage of available GPU compute capacity actually used during inference.

Signal to watch: low utilization means you are paying for idle hardware; high utilization risks queuing.

Deployment term

Model routing

A proxy layer that directs incoming requests to different models based on task type, cost, or availability.

Why it matters: smart routing can cut costs by sending simple tasks to cheaper, faster models.

Serving optimization

KV cache

Stored key-value pairs from previous attention computations that speed up autoregressive token generation.

Signal to watch: KV cache memory pressure is the main constraint on concurrent users per GPU.

Serving optimization

Quantization

Reducing the numerical precision of model weights (e.g., from 16-bit to 4-bit) to lower memory and compute requirements.

Why it matters: quantization enables self-hosting on cheaper hardware but may degrade output quality.

Pricing & Procurement

Commercial terms buyers use to evaluate total cost, contract flexibility, and the true economics of sustained usage.

10 terms

Buyer term

Per-token pricing

A billing model where cost scales with the volume of input and output tokens processed.

Why it matters: token price is the starting point, not the full cost of a workload.

Buyer term

Reserved capacity

Pre-committed compute or throughput purchased in exchange for guaranteed access or discounted rates.

Why it matters: reserved deals reduce uncertainty for large deployments but increase forecasting risk.

Buyer term

Burst pricing

Premium pricing or overage treatment applied when usage exceeds a contracted or baseline allocation.

Why it matters: burst behavior can distort unit economics during launches or agent-heavy workloads.

Procurement term

Service credit

A contractual remedy that compensates customers when a provider misses defined service commitments.

Why it matters: credits rarely repay business impact, but they expose how seriously SLAs are enforced.

Buyer metric

Cost per task

The effective spend required to complete a business action such as a summary, support resolution, or code review.

Why it matters: cost per task is often more decision-useful than raw price per token.

Buyer term

Token budget

A spending cap or allocation measured in tokens, used to control cost across teams or features.

Why it matters: token budgets prevent runaway spend in agent loops or long-running batch jobs.

Buyer term

Input vs output pricing

The split between cost per input token (prompts) and cost per output token (completions), which often differ by 2-4x.

Signal to watch: output-heavy workloads like summarization cost disproportionately more than retrieval tasks.

Procurement term

Committed use discount

A price reduction offered in exchange for a contractual commitment to spend a minimum amount over a set period.

Why it matters: discounts range from 15-60%, but underuse penalties can erase the savings.

Buyer term

Free tier

A baseline allocation of requests or tokens provided at no charge, designed for experimentation or low-volume use.

Signal to watch: free tier limits often differ from paid tier rate limits, causing surprises at scale.

Buyer metric

Total cost of ownership

The full lifecycle cost of running LLMs including API spend, engineering time, infrastructure, and risk mitigation.

Why it matters: API pricing is usually 30-50% of TCO; the rest is integration, evaluation, and ops overhead.

Trends & Guardrails

Emerging patterns and governance ideas that shape where the market is moving and where adoption can break down.

10 terms

Trend

Agent

A system that uses a model to plan, call tools, and iterate toward a goal across multiple steps.

Signal to watch: useful agents depend on tooling, memory, and guardrails more than prompt cleverness.

Trend

Tool use

A model's ability to invoke external functions, APIs, or retrieval systems while generating an answer.

Signal to watch: tool routing quality affects whether agents stay grounded or drift into brittle loops.

Guardrail

Alignment

Techniques that steer a model toward preferred behavior, safety policies, and user intent.

Signal to watch: stronger alignment can improve trustworthiness but may also narrow behavior in edge cases.

Deployment option

Open-weight

A model whose weights are available for self-hosting, tuning, or private deployment.

Why it matters: open-weight models trade turnkey convenience for control over performance, cost, and data handling.

Guardrail

Privacy guardrail

A control that reduces the chance of exposing sensitive data through prompts, logs, training, or outputs.

Why it matters: privacy controls need to cover the full workflow, not only the model endpoint.

Security risk

Prompt injection

An adversarial technique where hidden instructions in user input override or manipulate model behavior.

Signal to watch: prompt injection remains unsolved; defense-in-depth with input sanitization is essential.

Guardrail

Constitutional AI

An alignment method where a model critiques and revises its own outputs against a set of explicit principles.

Signal to watch: constitutional methods scale alignment review, but principle gaps still produce edge-case failures.

Trend

Model context protocol

An emerging standard for how applications share context, tools, and resources with language models.

Signal to watch: MCP adoption is growing fast; it may become the default integration pattern for tool-augmented LLMs.

Deployment option

Self-hosting

Running model inference on your own infrastructure instead of consuming a managed API.

Why it matters: self-hosting gives data control and cost predictability but requires GPU ops expertise.

Guardrail

Red teaming

Adversarial testing that probes a model for harmful outputs, jailbreaks, or policy violations before deployment.

Why it matters: red teaming is becoming a regulatory expectation, not just a best practice.