The 'Agent Identity' Problem: Why Two Copies of the Same AI Employee Behave Differently After 30 Days

Share
The 'Agent Identity' Problem: Why Two Copies of the Same AI Employee Behave Differently After 30 Days

A customer asked us a fair question last month. They run two near-identical e-commerce brands, same SKU catalog, same support workflow, same Shopify setup. They wanted to clone their Customer Success employee — call her Enide-A — and stand up Enide-B for the second brand. "Same employee, two stores. Just duplicate her."

We did it. They were happy. Five weeks later they came back: Enide-B is making different decisions than Enide-A. Different escalation thresholds. Different draft tone. Different summaries on the same kind of ticket. Same starting employee, same role description, same skill set on day one. Why aren't they the same person anymore?

The honest answer is that they were never going to be. And the more we explain why to customers, the more we realize the "agent clone" mental model is one of the most dangerous things builders inherit from cloud computing.

Identity is the runtime, not the config

When you clone a VM, the clone is identical because a VM is its disk image. Boot, run, behavior is determined by what's on disk. Two clones, same disk, same behavior. Forever.

An AI employee is not a VM. The disk image is the cheap part — model weights, system prompt, skill registry, role description. We can clone all of that in a few hundred milliseconds. The expensive part, the part that actually determines how the employee behaves, is everything that happens after boot:

  • Every customer conversation she remembers
  • Every decision she made and how it landed
  • Every skill she wrote on the fly to handle a one-off task that turned out to recur
  • Every preference she learned about her employer's communication style
  • Every escalation that got rolled back, and the implicit lesson she drew from it

None of that is in the config. All of it is in the runtime. And the moment you put two clones in front of two different sets of customers, those runtimes start diverging on day one. By day thirty they are different employees who happen to share an origin.

This is not a bug. This is what makes them useful. A virtual employee whose behavior never updates from experience is the chatbot you bought in 2023 — confident, consistent, and slightly wrong in the same ways forever. Drift in this case is just learning, viewed through the lens of someone who expected determinism.

Where we had to make calls

We had to decide what "clone" means for our system, and the decisions matter more than they sound.

Memory partitioning. When Enide-A is cloned to Enide-B, what comes with her? We landed on: nothing customer-specific, everything employer-specific. Enide-B inherits her employer's preferences, the tone they like, the brand voice she's been calibrated to — but she gets a fresh customer book. Otherwise Enide-B walks into Brand-2's inbox already remembering Brand-1's customers, which is both a leak and useless.

The harder edge case is patterns — things Enide-A learned that aren't tied to a specific customer. "When a refund request comes in for a damaged item over $200, escalate before responding." That's not customer-specific, but it was learned from one customer's reaction. Does it transfer? We default to no, with a flag the employer can flip per skill. Most employers leave it off and let Enide-B re-learn the lesson on her own. The ones who flip it on are usually the ones running an actual franchise where the brands really are operationally identical.

Skill provenance. Enide-A spent six weeks writing custom skills for her brand — a Shopify return-flow handler, a refund-eligibility check, a few one-off macros. When we clone her, do those skills come along?

We split skills into two buckets at write time. Skills tagged "role-generic" (anything she'd write for any e-commerce CSM job) clone over. Skills tagged "tenant-specific" (anything that references a particular SKU, pricing tier, or vendor) stay behind. The default is tenant-specific because the failure mode of cloning a too-specific skill is "Enide-B confidently does the wrong thing for Brand-2," and that's worse than her not having the skill at all. She'll write a new one when she hits the problem.

Drift detection. This is the one we keep iterating on. Once Enide-A and Enide-B are running independently, when do we tell the employer they're starting to diverge in ways that matter?

The naive version is to compare outputs side by side on synthetic test cases. That works, and it surfaces nothing useful. Both employees pass the synthetic tests. The drift that matters happens in the long tail.

The version we actually shipped: every two weeks, we run each clone against the other clone's recent decisions and ask, "would you have made this call?" If Enide-B looks at fifty of Enide-A's recent escalations and disagrees on more than a quarter of them, we surface it as a drift report. Not as an alarm — just as information. The employer decides whether the divergence is a problem (the brands have actually grown apart) or a regression (one of them is getting worse).

What we don't do is auto-merge. Auto-merging two divergent employees is how you get the worst of both. The drift report is a prompt for a human to decide which lessons should propagate.

The mental model that actually works

Stop thinking of cloned agents as instances of a class. Start thinking of them as twins separated at birth. They share genome. Everything else is environment.

Two practical implications fall out of this.

First, when you stand up a new employee for a new client, do not promise them an identical experience to your existing client. Promise them the same starting point. The behavior that emerges over the next thirty days is going to depend on what their customers throw at the employee, not what your other clients threw at yours.

Second, design your platform assuming divergence is the default. Build memory partitioning, skill scoping, and drift visibility from day one. Do not let two tenants accidentally share a memory layer "because it's faster." That's an outage waiting for a regulator to find it.

The customers who succeed with this model are the ones who understand they're hiring an employee who will grow into the role, not provisioning a service that will behave the same forever. The ones who churn are the ones who wanted a deterministic API and got a person.

We chose a person on purpose. Two months in, the divergence between Enide-A and Enide-B is the feature, not the bug — each one has gotten genuinely better at her brand. The clone gave them a head start. The runtime made them their own.


Want to test what an AI employee looks like when she actually grows into the role? Try it here: https://Geta.Team

Read more