FileGram: Grounding Agent Personalization in File-System Behavioral Traces

AI agents are getting better at reading files, searching across folders, and answering questions over long contexts. But a crucial piece of personalization is still missing: agents rarely understand how a person actually works.

Most agent personalization today starts with explicit instructions. Users write a CLAUDE.md, an AGENTS.md, Cursor rules, or a long preference prompt that explains their style. This helps, but it has a structural limitation: people are not always good at describing their own habits, and even when they are, those habits change over time.

At Synvo AI, we believe the next generation of personal and enterprise agents needs a deeper form of context. It is not enough for an agent to know what documents exist in a workspace. It should also learn how a user reads, writes, organizes, revises, and navigates through that workspace.

That is the motivation behind FileGram, a research framework for grounding agent memory and personalization in file-system behavioral traces through controlled, persona-driven simulations and diagnostic evaluation.

Why File Behavior Matters

When a person works with files, they leave behind rich behavioral signals.

FileGram overview connecting FileGramEngine, FileGramBench, and FileGramOS — FileGram turns accumulated file-system behavior into memory an agent can use, reducing the need for users to repeatedly explain context.

Some users read sequentially, opening documents one by one before making edits. Others search aggressively, jump between folders, and only inspect targeted sections. Some users create detailed intermediate drafts; others keep a lean workspace and delete temporary files quickly. Some reorganize projects into deep hierarchies, while others prefer flat directories and descriptive filenames.

These patterns are not superficial preferences; they shape how an AI agent should assist. A meticulous planner may need preserved provenance, intermediate reasoning, and careful detail, while a fast scanner may need concise summaries, direct links, and minimal interruption. A personalized agent should adapt to these differences without requiring the user to manually describe every preference.

File systems are where much of this behavior becomes observable. Reads, writes, edits, moves, renames, deletions, generated artifacts, screenshots, PDFs, and multimodal documents together form a trace of how people transform information into work.

Files tell an agent what a user knows.

Behavioral traces tell an agent how a user works.

The FileGram Framework: Engine, Benchmark, and Memory

FileGram has three components, each designed to make file-system behavioral memory measurable, reproducible, and useful for agent personalization.

01 · Data engine

FileGramEngine

Generates persona- and task-conditioned file-system workflows with typed actions, content deltas, and multimodal artifacts.

02 · Evaluation

FileGramBench

Turns behavioral traces into diagnostic QA across nine sub-tasks for profiling, reasoning, drift detection, and grounding.

03 · Memory

FileGramOS

Builds user profiles from atomic file-level signals through procedural, semantic, and episodic channels.

FileGramEngine: Generating Realistic Behavioral Traces

FileGramEngine is a persona-driven data engine that synthesizes controlled file-system workflows conditioned on specific user profiles and tasks. Instead of collecting only final answers or static documents, it records fine-grained behavioral trajectories: typed file actions, content deltas, generated artifacts, and how the workspace evolves over time.

FileGramEngine data generation pipeline — FileGramEngine simulates profile-isolated workflows, filters raw tool traces into behaviorally meaningful actions, and materializes aligned text, document, and visual views for downstream QA and memory evaluation.

From just 20 behavioral profiles and 32 workspace tasks — spanning understanding, creation, organization, synthesis, iteration, and maintenance, across both text-centric and multimodal scenarios — the engine produces a large, controlled corpus:

behavioral trajectories

atomic actions

agent-generated files

multimodal files

This gives researchers a controlled environment for studying how file-level behavior reveals user preferences.

FileGramBench: Evaluating Behavioral Memory

FileGramBench turns behavioral traces into diagnostic questions. The benchmark asks whether a memory system can infer stable user patterns, disentangle mixed traces, detect persona drift, ground answers in files, and reason over multimodal artifacts.

The benchmark includes 4,653 QA pairs across nine sub-tasks organized into four tracks:

Understanding — attribute recognition, behavioral fingerprinting, and profile reconstruction
Reasoning — behavioral inference and trace disentanglement
Detection — anomaly detection and persona shift analysis
Multimodal grounding — file grounding and visual grounding

This setup tests a different kind of memory from standard retrieval benchmarks. The goal is not simply to find a relevant paragraph. The goal is to reconstruct how a user behaves across time.

Example FileGramBench behavioral question

Profiles

behavioral user styles

Tasks

workspace scenarios

Trajectories

640

file-system workflows

Benchmark

4,653

behavioral QA pairs

FileGramBench combines dataset scale and task diversity: 20 behavioral profiles, 32 workspace scenarios, 640 file-system trajectories, and 4,653 QA pairs across nine sub-tasks for understanding, reasoning, detection, and multimodal grounding.

FileGramOS: Bottom-Up Memory for Personalization

FileGramOS is a bottom-up, action-aware memory framework designed around behavioral traces. Rather than summarizing an entire session into a generic note, it builds profiles from atomic file-level signals through three complementary channels:

Procedural memory — how the user works. We turn atomic actions into a compact behavioral fingerprint and aggregate it across trajectories into stable habits.
Semantic memory — what the user produces. A vision-language model reads file snapshots and edits into a cross-session view of the user's style and structure.
Episodic memory — how behaviors persist or shift. We segment trajectories into episodes and flag outliers, separating a deliberate variation from a real behavioral shift.

This bottom-up structure is important. Premature summarization can erase the small signals that matter for personalization. FileGramOS keeps the behavioral evidence grounded, then consolidates it into higher-level user models.

FileGramOS memory architecture — FileGramOS encodes each trajectory into an Engram, consolidates cross-session evidence across procedural, semantic, and episodic channels, then retrieves query-adaptive clues for grounded personalization.

Results: How Memory Systems Perform on FileGramBench

We evaluate FileGramOS against eleven baseline memory systems on FileGramBench under a shared two-stage protocol — every system ingests the same raw trajectories and answers with the same backbone (Gemini 2.5-Flash), so the only variable is how each one structures memory. Scores are accuracy (%) averaged across the nine sub-tasks.

FileGramBench leaderboard

FileGramOS leads FileGramBench against 11 baselines.

+7.7 pp over best baseline

SimpleMem

32.9

Mem0

33.2

MemOS

36.2

Zep

40.2

Naive RAG

40.5

MemU

44.4

MMA

44.7

Full Context

48.0

Eager Summ.

49.5

EverMemOS

49.9

VisRAG

51.9

FileGramOS

59.6

Timing Is Everything: Why Structure Beats Summaries

FileGramOS reaches 59.6%, ahead of the strongest narrative baseline EverMemOS (49.9%). The gap comes down to when abstraction happens. Narrative-first systems summarize each trajectory into prose at ingest time, discarding the signals that separate one worker from another — read counts, folder depth, edit size — so two very different users end up with the same adjectives ("structured", "methodical", "comprehensive"). FileGramOS keeps those raw signals and only interprets them at query time.

Inside the Memory: What FileGramOS Stores

Accuracy improves because FileGramOS does not only store text snippets. It keeps the behavioral evidence organized by how the user works, what they produce, and when those patterns repeat. Hover or focus on the leaderboard above to inspect representative memory views based on each adapter's retained evidence and structure.

FileGramOS

Grounded behavioral profile

Procedural · Semantic · Episodic

Procedural: sequential deep reading before edits.
Semantic: structured technical writing with detailed revision.
Episodic: repeated small edits and pragmatic cleanup across sessions.

Noticing Is Easier Than Explaining

The benchmark exposes a ceiling between noticing and explaining: systems that aggregate behavior across sessions detect anomalies well (past 70%, versus 21–26% for flat memories like Mem0 and SimpleMem), but every method falls below 39% when asked which habit shifted and in which direction. Sensing a shift is far easier than naming it.

When Raw Context Wins — and When It Doesn't

Raw evidence is sometimes its own kind of memory: Full Context (48.0%) ties FileGramOS on trace disentanglement (80.5% vs 80.9%) yet collapses on cross-session anomaly detection (36.8% vs 70.2%). Multimodal methods tell a parallel story — MMA (44.7%) and MemU (44.4%) never beat the best text-only systems, since a rendered page image cannot see directory depth or naming conventions.

Frugal Memory, and an Honest Gap

FileGramOS is also efficient — about 110K tokens of memory per user and 4.3K per question, versus 625K and 45.9K for full-context prompting. The honest caveat: on real human screen recordings every method drops to single digits, making the sim-to-real gap — with shift attribution and profile reconstruction — the frontier behavioral memory has to cross next.

From File Understanding to Behavioral Context

Synvo AI has been building contextual intelligence systems that help agents understand complex, multimodal file systems. FileGram extends this direction from document understanding to behavioral understanding.

For Synvo, FileGram is not a separate research detour. It is a way to make the Contextualization Engine and local agent experience more personal: if the existing memory layer helps agents understand what is inside a workspace, FileGram studies how that workspace changes as people read, write, revise, organize, and collaborate.

Traditional file intelligence asks: What information is inside these files? FileGram asks an additional question: What do these file interactions reveal about how this person works? That shift matters for both personal and enterprise agents.

In an enterprise setting, two analysts may work with the same set of reports but have very different workflows. One might carefully compare every source before drafting. Another might quickly produce a first version, then revise through many small iterations. A useful agent should not treat these users identically.

In a developer setting, two engineers may work in the same repository but expect different collaboration styles from an AI coding assistant. One may prefer explicit plans and conservative edits; another may prefer fast implementation and compact explanations. These expectations are often visible in the filesystem long before they are written in a configuration file.

Why This Matters

These benchmark gaps matter because they point directly to the product problem: better memory should change what an agent can actually do for you.

Agents that adapt their communication style to a user's workflow
Local assistants that preserve personal working habits across tools
Enterprise agents that understand not only shared knowledge, but team-specific operating patterns
Evaluation pipelines that measure whether memory systems capture behavior, not just content

This matters most as AI agents move from isolated chat interfaces into real work environments, where they increasingly act on behalf of users and need a grounded model of intent, preference, and workflow.

Looking Ahead: Toward Behavioral Memory at Scale

FileGram is a first step toward behavioral memory for AI agents, not a claim that real-world behavior understanding is solved. The benchmark intentionally exposes hard open problems such as shift attribution, multimodal grounding, and the gap between controlled traces and noisy real-world workflows.

There are several directions we are actively exploring:

Scaling from controlled simulated trajectories and screen-recording pilots to broader opt-in user studies
Integrating behavioral profiles with local agent products
Improving privacy-preserving profiling pipelines
Extending FileGramOS to richer multimodal and temporal signals
Studying how behavioral memory improves long-running agent collaboration

We are releasing FileGram to invite researchers, builders, and product teams to study this problem with us.

If document intelligence was the first layer of contextual AI, behavioral intelligence is the next one. The agents that become truly useful will not only retrieve what we wrote. They will learn how we work.

Citation

Please cite this work as:

@article{liu2026filegram,
  title={FileGram: Grounding Agent Personalization in File-System Behavioral Traces},
  author={Liu, Shuai and Tian, Shulin and Hu, Kairui and Dong, Yuhao and Yang, Zhe and Li, Bo and Yang, Jingkang and Loy, Chen Change and Liu, Ziwei},
  journal={arXiv preprint arXiv:2604.04901},
  year={2026}
}

FileGram: Learning How People Work from File-System Behavioral Traces

Why File Behavior Matters

The FileGram Framework: Engine, Benchmark, and Memory

FileGramEngine: Generating Realistic Behavioral Traces

FileGramBench: Evaluating Behavioral Memory

FileGramOS: Bottom-Up Memory for Personalization

Results: How Memory Systems Perform on FileGramBench

FileGramOS leads FileGramBench against 11 baselines.

Timing Is Everything: Why Structure Beats Summaries

Inside the Memory: What FileGramOS Stores

Grounded behavioral profile

Noticing Is Easier Than Explaining

When Raw Context Wins — and When It Doesn't

Frugal Memory, and an Honest Gap

From File Understanding to Behavioral Context

Why This Matters

Looking Ahead: Toward Behavioral Memory at Scale

Citation

Keep exploring contextual intelligence

The Digital Avalanche: Building the Memory Layer for Next-Gen Corporation AI Agents

Browse the Synvo AI blog

Grounded behavioral profile

Local APP

FileGram: Learning How People Work from File-System Behavioral Traces

Why File Behavior Matters

The FileGram Framework: Engine, Benchmark, and Memory

FileGramEngine: Generating Realistic Behavioral Traces

FileGramBench: Evaluating Behavioral Memory

FileGramOS: Bottom-Up Memory for Personalization

Results: How Memory Systems Perform on FileGramBench

FileGramOS leads FileGramBench against 11 baselines.

Timing Is Everything: Why Structure Beats Summaries

Inside the Memory: What FileGramOS Stores

Grounded behavioral profile

Noticing Is Easier Than Explaining

When Raw Context Wins — and When It Doesn't

Frugal Memory, and an Honest Gap

From File Understanding to Behavioral Context

Why This Matters

Looking Ahead: Toward Behavioral Memory at Scale

Citation

Keep exploring contextual intelligence

The Digital Avalanche: Building the Memory Layer for Next-Gen Corporation AI Agents

Browse the Synvo AI blog