Safeway AI — Rubric

The Problem
The Architecture
Agentic Search
Memory Architecture
Evaluation
What We Learned
Outcome

The Problem

Albertsons operates over 2,200 stores under banners including Safeway, Vons, and Jewel-Osco. Their product catalog spans more than 250,000 SKUs — everything from produce with seasonal availability to store-brand variants that differ by region.

The challenge wasn't search in the traditional sense. Keyword search already existed. The problem was reasoning over inventory — answering questions that require understanding relationships between products, remembering what's been found in prior interactions, and refining queries based on what didn't work.

A customer asking "what can I make for dinner tonight that's healthy and uses what's on sale?" isn't issuing a search query. They're asking a system to reason across promotional data, nutritional information, recipe knowledge, and real-time inventory — then synthesize an answer that's specific to their store.

No static search index handles this. It requires an agent.

The Architecture

We built a system with three layers: agentic search, structured memory, and evaluation.

Agentic Search

The search layer is not a single query. It's a multi-step process where the agent formulates retrieval strategies, executes them, inspects the results, and decides whether to refine or continue.

Decomposebreak intent into constraints

→

Retrievetargeted catalog queries

→

Inspectevaluate result quality

→

Deciderefine or respond

↩ refine|↓ respond

Termination: 3 passes without improvement → respond with best results + explain limitation

The agent evaluates its own results. Different questions produce different retrieval strategies.

For a query like "gluten-free pasta alternatives under $5," the agent doesn't fire one search. It decomposes the intent into constraints — dietary restriction, product category, price ceiling — and runs targeted retrievals against the catalog. If the first pass returns too many results, it tightens. If it returns too few, it relaxes a constraint and explains what it changed.

The agent has access to a set of retrieval primitives — filtered catalog search, promotional data lookup, nutritional attribute filtering, and store-level availability checks. It composes these at runtime based on the query, rather than following a fixed retrieval pipeline. Different questions produce different retrieval strategies.

This is the Primitives over Pipelines approach in practice. The agent decides the trajectory. We define the capabilities.

Memory Architecture

The most difficult part of this system was memory. A grocery shopping interaction isn't a single turn — it's a session that builds context over time. The customer adds constraints, changes their mind, asks follow-ups that depend on earlier answers.

Session Memoryephemeral

retrievalsrecommendationsrejectionsrefined queries

“not that one, something cheaper” → filters its own history, doesn't re-retrieve

↓ classify & promotedurable preference? → persist

Cross-Session Memorydurable

nut allergyprefers organichousehold: 4brand: Safeway Select

Next session starts with context already loaded. The agent doesn't ask questions it's already answered.

Not an append-only log — structured, scoped, selective. The classification layer decides what to store and at what scope.

Two-layer memory with explicit retention policies. Most agent frameworks treat memory as a log. That breaks at scale.

We built a structured memory system with two layers:

Session memory tracks everything the agent has retrieved, recommended, and discarded in the current interaction. When the customer says "not that one, something cheaper," the agent knows what "that one" refers to because it has a record of its own prior recommendations. It doesn't re-retrieve — it filters its own history.

Cross-session memory persists across interactions. If a customer previously indicated a nut allergy or a preference for organic produce, the system retains that as a durable constraint. The next session starts with context already loaded. The agent doesn't ask questions it's already answered.

The memory architecture is selective. Not everything gets persisted — only information that the system identifies as durable preference versus session-specific context. A classification layer decides what to store and at what scope:

type MemoryEntry = {
  content: string
  scope: 'session' | 'cross-session'
  durability: 'ephemeral' | 'durable'
  category: 'dietary' | 'brand' | 'budget' | 'household' | 'preference'
  confidence: number
  expiresAt: Date | null  // null = permanent
}

Evaluation

Every recommendation the system produces is scored before the customer sees it. The evaluation layer checks relevance against stated constraints, availability at the customer's store, consistency with memory (allergies, prior rejections), and whether the recommendation is concrete enough to act on. Recommendations that fail any check are filtered — the customer never sees a result that contradicts their own stated preferences, because the system checked against memory before responding.

What We Learned

Memory is the hardest layer. Retrieval and generation are well-understood problems. Deciding what to remember, for how long, and at what level of specificity is not. Most agent frameworks treat memory as an append-only log. That breaks at scale — the context window fills with irrelevant history, and the agent's reasoning degrades. Structured, scoped memory with explicit retention policies is a different problem than "save the conversation."

Agentic search requires the agent to evaluate its own results. A retrieval pipeline returns results and trusts them. An agentic search system retrieves, inspects, and decides whether to try again. This self-evaluation loop is what makes the system adaptive — but it also means the agent can get stuck refining indefinitely. We built explicit termination conditions: if three retrieval passes don't improve the result set, the agent responds with what it has and explains the limitation.

Scale forces you to be selective about context. With 250k SKUs, you cannot load the full catalog into context. You cannot even load a meaningful subset without strategy. The retrieval primitives are designed to return narrow, pre-filtered slices — and the agent composes them rather than requesting broad sweeps. Context engineering at this scale is as much about what you exclude as what you include.

Outcome

The system is in active use across Albertsons properties, handling natural language grocery queries with multi-turn reasoning, persistent customer preferences, and real-time inventory awareness — at the scale of a Fortune 500 retailer's full 250k+ SKU catalog across 2,200 stores.