Our AI Employee Diagnosed a 14-Second Latency Spike in Minutes
At 2:00 PM on a Friday, a Directus instance went from 200ms response times to 14 seconds. No deployment had been pushed. No config had changed. The server was technically "up," which made it worse, because every monitoring dashboard showed 100% uptime while actual users stared at loading spinners.
Here's what happened next: an AI employee diagnosed the entire problem, identified five fixes, and delivered a structured investigation report, all before a human even opened a terminal.
This is a real incident. No embellishments, no hypothetical scenario. Just a production server, a latency spike, and an autonomous AI agent doing what it was built to do. (Details have been anonymized to protect the client's infrastructure.)
The Setup
The server in question ran Directus on PostgreSQL, a common CMS stack for content-heavy applications. The infrastructure was healthy on paper: 7.7 GB RAM, 169 GB disk at 29% usage, five containers running with zero restarts, no OOM kills. CPU was mostly idle.
Nothing screamed "problem." And that's exactly why this kind of incident is so dangerous, because everything looks fine until a user tells you it isn't.
The Spike
Between 14:00 and 16:00 UTC on February 7th, the server received 43,214 requests. That's not unusual for a busy application, but the composition of that traffic was the real issue.
32% of all requests targeted a single flow trigger endpoint. Another 16% hit a second one. Bots, specifically bingbot, AhrefsBot, SemrushBot, and SERankingBot, were hammering these endpoints relentlessly. The existing rate limit of 500 requests per minute per IP was useless because the bots came through Cloudflare IPs.
But the bot traffic alone didn't cause a 70x latency spike. It was the combination with something lurking in the database.
The Smoking Gun: 22 GB of Forgotten History
When the AI employee connected via SSH and began investigating, it found something remarkable: the directus_revisions table had ballooned to 22 GB. That's 95% of the entire 23 GB database, sitting in a single table.
The breakdown was brutal:
- 72,794 rows storing full item snapshots as JSON
- 140 MB of actual row metadata
- 34 MB of indexes
- 21 GB of TOAST data (the JSON
dataanddeltacolumns)
Collections like review_photo_google (9,031 rows, 87 MB of JSON) and df_unique (12,391 rows, 30 MB) were the biggest offenders. Every edit, every update, every tiny change was being preserved in full fidelity, forever.
The database was 11.5x larger than PostgreSQL's shared_buffers (2 GB). This meant most reads required disk I/O. When PostgreSQL ran maintenance operations (checkpoints, vacuum) on that 22 GB table while 43,000 bot requests piled up, the I/O contention collapsed everything.
To make matters worse, the Nginx proxy cache was disabled. Every single request went straight to Directus, straight to PostgreSQL, with zero caching layer. And a separate api_request_logs table had grown to 967,000 rows (668 MB), adding write I/O that competed with reads.
The perfect storm: bloated database + disabled cache + bot traffic surge + maintenance I/O = 14-second response times.
The Five Fixes
The AI employee's report didn't just explain the problem. It delivered five prioritized recommendations:
1. Purge directus_revisions (Critical) Delete old revisions, keeping only the last 20 per item. This alone could reclaim ~20 GB, shrinking the database from 23 GB to roughly 3 GB. Set REVISIONS_LIMIT=20 in the Directus environment to prevent future bloat.
2. Enable Nginx Proxy Cache The proxy cache directives were already in the Nginx config, just commented out. Uncommenting them and caching flow trigger responses for 30-60 seconds would dramatically reduce database hits from bot traffic.
3. Add Bot Rate Limiting on /flows/trigger/* IP-based rate limiting doesn't work when bots route through Cloudflare. The fix: user-agent-based filtering or a separate, stricter rate limit for identified bot traffic.
4. Clean Up api_request_logs Implement periodic cleanup to keep only the last 30 days. At nearly a million rows and growing unbounded, this table was adding unnecessary write pressure during an already stressed period.
5. Fix varchar(255) Overflow 19 database errors during the spike window ("value too long for varchar(255)") pointed to a bulk import process hitting column limits. A small fix, but one that eliminates error-handling overhead during high-load periods.
What Makes This Different
Any experienced sysadmin could have reached the same conclusions. The difference is time and availability.
The AI employee didn't need to be paged. It didn't need to context-switch from another task. It didn't need to remember which SSH key goes to which server, or spend 20 minutes grepping through logs to orient itself. It connected, investigated, correlated traffic patterns with database metrics, identified the root cause, and produced a structured report with actionable fixes.
The whole process ran in what we call "investigation mode," a human-in-the-loop approach where the AI generates the report and waits for explicit confirmation before touching anything. No changes were made. Every recommendation sat there, waiting for a human to say "yes" or "no."
There's also an "auto-fix" mode for teams that want faster resolution on obvious issues. But for production databases holding 22 GB of customer data, investigation-first is the right call.
The Uncomfortable Truth About Server Monitoring
Most teams have monitoring. Dashboards, alerts, uptime checks. But monitoring tells you that something is wrong. It rarely tells you why.
The gap between "alert fired" and "root cause identified" is where hours disappear. It's where engineers get pulled out of deep work. It's where Friday evenings turn into weekend firefights.
This Directus incident was resolved because an AI employee filled that gap autonomously. Not by guessing or running a playbook, but by actually investigating: reading logs, analyzing traffic patterns, measuring table sizes, checking configurations, and connecting the dots.
That's the difference between an alert system and an AI employee. One tells you the house is on fire. The other tells you it started in the kitchen because someone left the stove on, and here are five ways to prevent it from happening again.
Try It Yourself
At Geta.Team, our AI employees handle investigations like this every day. They connect to your infrastructure, diagnose issues in real time, and either fix them autonomously or hand you a detailed report with clear next steps.
If your team spends more time reacting to incidents than preventing them, it might be time to hire an AI employee who never sleeps, never context-switches, and never forgets what your server looked like last Tuesday.