Why Agent Skill Composition Is the New API Design (And Most Frameworks Get It Wrong)

Share
Why Agent Skill Composition Is the New API Design (And Most Frameworks Get It Wrong)

Most production agent platforms today ship the same way: a single big tool catalog, often dozens of tool definitions, each one mapping to a discrete thing the agent can do. Send email. Search calendar. Query database. Update CRM. Generate report. The list grows as the product grows, and at some point your agent has 47 tools to choose from on every turn.

This is the wrong abstraction. It mirrors a mistake that web APIs went through twenty years ago, and the fix is the same: stop treating skills like endpoints, start treating them like composable verbs.

The API evolution that skills are about to repeat

When SOAP was the dominant API style, every operation got its own endpoint. getUserById, getUserByEmail, getUserByPhoneNumber. Need a new way to find a user? Add another endpoint. The catalog grew faster than the product. By the time someone wanted to do something even slightly novel, the API team had to ship a new endpoint, or the consumer had to chain three of them together by hand.

REST forced a different shape: small consistent verbs (GET, POST, PUT, DELETE) operating on resources. The combinatorial expressiveness moved from the endpoint catalog (which got smaller) to how clients composed those verbs (which got more powerful). GraphQL took it further: one endpoint, an entire query language for composition, the catalog disappeared.

Skills for AI agents are heading the same way. The teams that figure this out first will ship agents that handle situations the original designer never anticipated. The teams that don't will ship 47-tool monoliths that need a new tool for every new use case and never get reliable beyond the use cases the team explicitly tested.

Three anti-patterns I keep seeing

The kitchen-sink skill. A single skill called handle_customer_inquiry that takes a customer email, classifies it, decides whether to reply, drafts the reply, sends it, logs the interaction, and updates the CRM. The argument for this shape is "the agent doesn't have to compose anything, it just calls one tool and gets a complete behavior." The reality is the model loses control over the intermediate steps. When something goes wrong, the agent has no purchase on where in the pipeline it broke, and you've got a black box that has to be debugged by reading whatever the kitchen-sink skill chose to log.

The token-greedy skill. A skill that takes 12 parameters, all optional, with a 400-line description in the system prompt explaining when to use each one. Every turn, the model has to read and reason about the entire skill surface area to decide how to call it. Token cost climbs, and worse, the model starts hallucinating parameter values it didn't actually need to fill. The skill was built to be flexible. The model interprets that as a maze.

The hidden-state skill. A skill that depends on side effects from other skills, but doesn't say so in its contract. send_email only works if search_contacts was called earlier in the conversation, because the contact ID was tucked into a hidden buffer. The agent figures this out by trial and error, fails a few times, and eventually stumbles into the right order. The framework's response is to add more orchestration to enforce the order externally, which makes the framework heavier without making the skills better.

These three patterns are surprisingly common, and they all stem from treating skills as endpoints rather than as composable units the model can chain.

What good skill composition looks like

The shape that survives in production has a few consistent properties.

Small verbs, not big nouns. send_email(to, subject, body) is a verb. customer_communication_orchestrator is a noun, and a vague one. The verb is composable because its scope is bounded. The noun is a kitchen sink in disguise. When you read a skill name, you should be able to predict its inputs and outputs in one sentence. If you can't, split it.

Strict, declarative contracts. Every input has a type and a clear purpose. Every output is structured and includes a status. No optional parameters that "might be useful." If you find yourself adding a third optional parameter, you've probably found two skills hiding in a trench coat.

No hidden state. Each skill is callable from a fresh context with no preconditions other than its declared inputs. If send_email needs a contact, it takes the contact as an argument, not from an implicit buffer somewhere. This means the model can call skills in any order it figures out, and re-trying a skill is always safe. The whole composition layer becomes the model's decision, not the framework's enforcement.

Composability designed into the output shape. A skill's output should be shaped to feed into another skill's input cleanly. If search_contacts returns a list of contact objects, and send_email takes a contact object as input, the agent can chain them naturally. If search_contacts returns a markdown report that mentions contacts in prose, the agent has to parse and reformat between every step, which is where tool errors come from.

If you take these four rules seriously, your skill catalog gets smaller, not larger. Half the skills you thought you needed turn out to be combinations of two simpler ones. The framework that lives around the skills gets thinner because the model is doing more of the orchestration. And the agent's behavior gets more predictable because the surface area at every turn is smaller and cleaner.

The Geta.Team take, briefly

We built our skill model around exactly this shape. Each skill is a CLI module the agent invokes directly, with structured input via flags and structured output via stdout. New skills get created on demand in natural language, but the shape they end up taking is always verb-style. The agent does the orchestration. The framework just handles the invocation plumbing.

The win we see in production: agents trained on skill-composition patterns generalise to new situations the operator didn't explicitly anticipate. Same model, same agent, different problem, the skills compose into a solution that wasn't in the original prompt set. That's the test that separates a real agent from a fancy workflow runner.

What to do this week if you're building agents

If your skill catalog has more than 20 tools, audit it. Look for the kitchen-sink ones first (anything whose name contains "manage," "handle," "orchestrate," or "process" is suspect). Split them. Re-test. The agent will probably get more reliable, not less, because the model has more room to compose.

If your skills have more than three required parameters or more than three optional ones, the skill is doing too much. Find the natural split.

If any skill depends on side effects from another, surface that as an explicit input. The framework's job is plumbing, not enforcement.

None of this is glamorous. It's the same hygiene work that distinguished good API designers from bad ones in the 2010s. But it's about to be what distinguishes agent platforms that scale from agent platforms that get rebuilt every six months.

Want to test the most advanced AI employees? Try it here: https://Geta.Team

Read more