Series: RDS Rightsizing

Engineering AI Agents: Moving Beyond 'Creative Statistics' to Build Pragmatic Dev Tools

AI, Platform Engineering, AWS, FinOps, Architecture

By now, using generative AI in your daily engineering workflow isn’t a differentiator—it’s table stakes. But there is a massive chasm between asking a chatbot to “write me a boilerplate React component” and building production-grade, cost-efficient agentic workflows that solve actual infrastructure problems.

A while back, under my writing alias Adam Korga, I wrote about how Agentic AI is the new cloud and why most companies are already drastically overpaying for context. The thesis was simple: treating LLM context windows like an infinite, free resource is the modern equivalent of spinning up unmonitored “m7i.4xlarge` instances just to run a cron job.

If you don’t manage context manually, you’re throwing money into a furnace.

So, how do we apply this discipline to complex, real-world engineering tasks? Let’s look at a concrete case: building a custom skill for Claude Code designed to automate AWS RDS rightsizing across hundreds of database instances.

The Challenge: Database Rightsizing at Scale

In any mature organization, infrastructure drift is inevitable. You end up with hundreds of RDS instances. Some are massively over-provisioned “just in case,” others are choking on memory pressure, and a few are relics of abandoned staging environments silently costing thousands of dollars a month.

As a Platform Engineer or Architect, you can’t manually audit 300 databases every month.

Sure, there are dozens of commercial FinOps tools designed to help with this. They are great at what they do, but they are inherently conservative. They play it safe because they have to scale across thousands of distinct environments without breaking production. In my experience, while a standard FinOps platform might flag 30% to 45% of obvious infrastructure waste, a stubborn engineer who actually understands the specific application workload can easily squeeze out over 70%.

My goal was to build a system that achieves that engineer-level precision and depth, without actually sending my platform team to therapy from manual audit fatigue. But you also can’t blindly trust a generic, autonomous AI agent to script its way through your AWS account without strict guardrails.

Here is how I engineered a highly deterministic, dirt-cheap agentic workflow to solve this.

The Execution: A Six-Step Process

1. Extracting the Domain Knowledge (Rubber-Ducking)

An AI agent is only as good as the heuristics you feed it. Instead of letting the model hallucinate what a “healthy” RDS database looks like, I spent time using a standard browser-based LLM as an interactive sounding board.

At this stage, your ability to express thoughts clearly and ask the right questions is invaluable. I refuse to call this “prompt engineering”—it is simply clarity of communication. This is the exact same skill a senior engineer must master to align with human colleagues, explain trade-offs, and bridge the gap with team members from different backgrounds. If you cannot articulate a complex technical problem clearly to a human, you won’t get a coherent output from a machine.

I dumped twenty years of database administration experience, incident post-mortems, and AWS scaling patterns into the chat. We validated the edge cases. We refined the rules of thumb (e.g., when is a CPU spike acceptable, and when does it signal a missing index rather than a need for a larger instance class?).

Using an LLM as a “rubber duck” is an incredible accelerator for learning and structuring tacit knowledge—provided you maintain a healthy level of professional skepticism and rigorously verify the output. It bypassed hours of writing dry specifications from scratch.

2. Drafting the Blueprint

Models excel at copy-editing and structuring unstructured thoughts. I previously broke this process down in a series on the role of AI in creative work —using the analogy of the three distinct roles of book creation (writer, editor, and author)—arguing that while AI makes raw generation cheap, human vision is the only scarce resource we actually need to protect.

Once the raw rules were down, I didn’t rush straight to code. Instead, I used the LLM in its editor capacity to execute a two-phase blueprinting process:

Phase A: Isolating the Analysis Vectors. Translating my unstructured brain-dump into a logical, numbered framework of specific analysis vectors (e.g., isolating CPU-bound patterns, identifying memory pressure, recognizing idle or abandoned databases). This established the strict logic of what needs to be evaluated.
Phase B: Generating Human-Readable Documentation. Using the LLM to expand those vectors into clean, exhaustive documentation. Why write markdown documentation for an automated agent? Two reasons:
1. Organizational Knowledge Base: It creates a permanent, readable source of truth for other engineers on the team.
2. The “Alignment Check”: Reading the generated text is the ultimate verification step. It lets you see exactly how the model interpreted your domain expertise. If it merged unrelated concepts or hallucinated “magic” AWS metrics that don’t exist in CloudWatch, you catch the misunderstanding in the prose before it writes a single broken script.

(Note: The full, production-ready version of this document lives securely in my client’s Confluence as part of their proprietary knowledge base—and my personal consulting IP. However, to show you how this looks in practice, I’ve published a simplified, sanitized version of this blueprint as a companion case study on this site. Consider it a freemium teaser of how I translate domain expertise into structured, machine-ready heuristics).

3. The Heuristic Split (Reducing the Cognitive Load)

Before writing a single line of agent code, I made a critical architectural decision: Do we need the full cognitive power of an LLM to analyze every single database?

The answer is a hard no.

Why burn expensive input tokens analyzing a database that hasn’t crossed 5% CPU or memory utilization in three months? I split the workflow into two distinct stages:

The Filter: A dirt-simple, deterministic script to flag obvious candidates.
The Deep Dive: An agentic workflow reserved only for instances that require complex, multi-variable interpretation.

4. Code Generation

With a highly structured specification document ready, I didn’t waste time hand-crafting the Claude Code skill (SKILL.md). I fed the spec directly to Claude and told it to build the scaffolding for me. When your input requirements are rock-solid, the code practically writes itself.

5. The First Run (The “Half-Product”)

The initial generation worked. It ran the commands, gathered data, and spit out recommendations. But like most naive AI-generated code, it was bloated, fragile, and financially irresponsible. It was a prototype, not an engineering tool.

6. Refactoring: Stripping the Magic out of the Agent

This is where actual software engineering comes back into play. To make this tool production-ready, I applied two aggressive refactoring passes:

Pass A: Modularizing Context (Combating the “Ulysses” Anti-Pattern)

The first iteration generated a massive SKILL.md file. It was a sprawling, epic document that would make James Joyce proud. The problem? Claude reads this entire file on every single execution.

I broke this monolith apart. The main skill file was stripped down to a high-level orchestration flow and raw file references. Each sub-task (metric extraction, CSV parsing, recommendation formatting) was moved to its own isolated file.

The Result: We cut token consumption by 60% to 70% on execution. Yes, Claude features context caching. But let’s be real: this isn’t a CI/CD loop that runs every five minutes; it’s an audit tool you trigger once every few weeks. At that frequency, the context cache is long dead, leaving raw context optimization as your only actual defense against token bloat.
The Bonus: Smaller, hyper-focused instruction sets drastically reduced the model’s surface area for hallucinations.

Pass B: Replacing “Creative Statistics” with Boring Determinism

I analyzed the agent’s execution log and looked for everywhere the model was trying to be “creative.”

Was it dynamically generating bash commands to query CloudWatch APIs on the fly? Yes. And it was occasionally messing up the date formatting, failing, and burning tokens debugging its own bash syntax.
Was it using natural language to parse CSV files? Yes. Which meant the output format shifted slightly every run, making downstream automation impossible.

I replaced these steps with static, deterministic helpers:

The Agent does not query AWS directly. A static, boring Python helper script fetches CloudWatch metrics and dumps them into a structured CSV.
The Agent does not format the final report. A rigid template engine handles the output.
The Agent’s sole job is synthesis. It takes the clean CSV, applies the heuristics we defined in Step 1, and explains why a specific rightsizing step is recommended.

We bridged the explainability gap by forcing the agent to output a clear, step-by-step reasoning path for its recommendations, rather than just shouting an instance size. Consequently, we didn’t just get an end-state recommendation; the agent generated a structured, phased rollout plan that factors in the blast radius and rollback difficulty of each change (since downscaling a write-heavy primary database carries vastly different operational risks than adjusting a replica).

Furthermore, because running the agent is dirt-cheap, nothing stops us from simply re-running the analysis under the new baseline conditions before proceeding to the next phase of the rollout. This allows you to verify that the initial adjustments hold up under real-world load without introducing latency issues, before moving on to higher-risk databases.

The Payoff

By treating the agent as a modular software component rather than a magic black box, we built a highly optimized workflow:

Cost-per-run: Down to under $0.40 in tokens per database analyzed.
Infrastructure Savings: Identifies 60% to 70% in monthly waste on targeted RDS instances.
Predictability: The output is structured, predictable, and actually explainable to stakeholders who need to sign off on database migrations.

Three Rules for Pragmatic AI Engineering

If you take away nothing else from this case study, build your next tool with these three principles in mind:

Use AI to build solutions, not solve problems. Use LLMs to generate your CI/CD pipelines, your Terraform scripts, and your static helper code. Do not build systems where the LLM is the actual runtime environment, dynamically interpreting and executing these deterministic tasks in production every single time.
Software engineering principles still apply. The classic “divide and conquer” paradigm applies here more than ever before. Separation of concerns, modularity, and DRY (Don’t Repeat Yourself) principles are not obsolete relics of the pre-AI era. If your prompt or skill file is over 200 lines, you have an architectural smell. Break it up.
Minimize creative statistics. LLMs are next-token predictors—they are engines of creative statistics. Whenever you can replace an LLM call with a regular expression, a SQL query, or a static python script, do it. Your CFO and your on-call engineer will thank you.