Your Network Team Is One Bad Script Away From a 2AM Incident — Here's a Better Path

·

11 min read

Cover Image for Your Network Team Is One Bad Script Away From a 2AM Incident — Here's a Better Path

There's a specific kind of pain that every infrastructure engineer knows. It's 2am. BGP is flapping on an edge router. You're SSH'd in, running show ip bgp summary manually, copy-pasting output into a Google Doc, cross-referencing your runbook, and asking yourself why — in the year we're in — this is still a one-person-at-a-keyboard problem.

The answer usually comes down to two things: the scripts that were supposed to help have become too brittle to trust, and the "AI tools" people tried were too chatty and not operational enough. Neither fills the gap between "someone with deep expertise running commands" and "lights-out automation."

This post walks through GoogleADK-NetworkAutomation, a hands-on repository built by Ashwin Joisa that tackles exactly that gap using Google's Agent Development Kit. It's a collection of 14+ working agent examples organized into a learning path — from a single-tool agent to multi-agent hierarchies deployed on Vertex AI. If you're running infrastructure for a cash-strapped university IT department, a regional MSP, or a mid-market company trying to modernize without a large automation team, this is worth your time.

What Google ADK Actually Is

Google's Agent Development Kit (ADK) is an open-source Python framework for building production-ready AI agents. It's not a chatbot builder. The core idea is that you give an LLM (Gemini, in this case) a set of tools — Python functions, API calls, external services — and it decides which tools to call, in what order, based on a goal you give it.

This is meaningfully different from a script. A script follows a fixed path. An agent figures out the path. If BGP is fine but OSPF is the culprit, the agent pivots. If it needs to check three routers concurrently instead of sequentially, it can. The flexibility lives in the reasoning layer, not the code.

The key components in ADK are:

  • Agent: The reasoning unit. Wraps an LLM with tools, a system prompt, and optional sub-agents.

  • Tool: Any Python function or external API the agent can call.

  • Runner: Executes the agent, manages the session, handles tool calls.

  • Session: Stores context across turns so multi-step conversations don't lose state.

  • Workflow agents: SequentialAgent, ParallelAgent, LoopAgent — orchestration primitives.

Getting the Repo Running

Before anything, make sure you have uv installed. It's the package manager used throughout this repo and it handles all dependency isolation automatically — no virtual environments to set up manually.

# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

You'll also need a Google Gemini API key. Get one from Google AI Studio — there's a generous free tier.

export GOOGLE_API_KEY=your-gemini-api-key-here

Clone the repo:

git clone https://github.com/ashwinjo/GoogleADK-NetworkAutomation.git
cd GoogleADK-NetworkAutomation

Each agent lives in its own numbered folder. Navigate into any of them and you'll find the same structure:

1-basic-agent/
├── README.md
├── basic_agent/
│   ├── __init__.py
│   ├── agent.py          # Core agent logic
│   ├── fast_api_app.py   # FastAPI server
│   └── app_utils/        # Utility functions and tools
├── tests/
├── pyproject.toml
├── Makefile
└── Dockerfile

The Makefile is your interface. The three commands you'll use constantly:

make install       # Install all dependencies via uv
make playground    # Launch ADK's local web UI (recommended for exploration)
make local-backend # Run the FastAPI server with hot-reload

To interact via CLI instead of the web UI:

uv run adk run basic_agent

Phase 1: Your First Network Agent

Start in 1-basic-agent. This is a Network Design Review Agent — give it a description of your topology, and it analyzes it for failure domains, redundancy gaps, and architectural risks.

After running make install && make playground, open your browser to http://localhost:8000. You'll see ADK's built-in playground — a split-pane interface with the agent on one side and execution traces on the other.

Try this prompt:

"Review my network design: We have a single core switch connecting to two access layer switches. All servers run off the access layer, and we have one uplink to our ISP."

The agent will identify the single points of failure (the core switch, the single ISP uplink), ask clarifying questions if needed, and produce structured feedback. It's not magic — it's Gemini following a carefully written system prompt — but it's also immediately useful for teams that don't have a dedicated network architect.

The underlying agent.py is worth reading. The pattern you'll see repeated across all 14 agents:

from google.adk.agents import Agent

agent = Agent(
    model="gemini-2.0-flash",
    name="network_design_reviewer",
    description="Reviews network architecture for risks and anti-patterns.",
    instruction="""You are a senior network architect. Analyze the provided
    network design and identify: failure domains, redundancy gaps, security
    exposures, and scalability constraints. Be specific and actionable.""",
    tools=[],  # No external tools needed for pure reasoning
)

The instruction field is your system prompt. This is where the expertise lives. For a university IT shop, you might tailor this to your specific topology — campus buildings, research VLANs, specific vendor equipment.

Phase 2: Agents That Actually Do Things

The real power shows up in 2-basic-agent-with-tools. This is where the agent stops just reasoning and starts taking action.

The BGP Troubleshooting agent has Python functions that simulate network API calls — in a real deployment, these would point at your actual network devices via RESTCONF, NETCONF, or SSH. The agent receives a router name, decides which tools to call, and systematically diagnoses the problem.

cd 2-basic-agent-with-tools
make install && make playground

Try:

"Troubleshoot BGP on router r1-sea3"

The agent will call tools to fetch BGP neighbor summaries, check specific neighbor states, inspect route tables, and return a structured diagnosis. It doesn't just answer — it works through the problem the way a good engineer would.

This phase also introduces MCP (Model Context Protocol) tooling. One agent connects to an external subnet calculator service:

from google.adk.tools.mcp_tool.mcp_toolset import McpToolset, StdioConnectionParams
from mcp import StdioServerParameters

McpToolset(
    connection_params=StdioConnectionParams(
        server_params=StdioServerParameters(
            command="npx",
            args=["-y", "supergateway", "--sse",
                  "https://mcp-subnet-calculator.mteke.com/sse"]
        ),
        timeout=30,
    ),
)

This is the MCP pattern: your agent connects to any MCP-compatible service and gets its tools automatically. For infrastructure teams, this is significant — you can build a library of MCP servers for your specific environment (CMDB, monitoring, ticketing) and expose them to any agent.

Another variant connects to Hugging Face for model search:

McpToolset(
    connection_params=StreamableHTTPServerParams(
        url="https://huggingface.co/mcp",
        headers={"Authorization": f"Bearer {HUGGING_FACE_TOKEN}"},
    ),
)

The point isn't these specific services — it's the pattern. Any tool your team has built an API for can become available to any agent.

Phase 3: Multi-Turn Context and Human Approvals

Two of the most practically important agents live in phases 3 and 4.

Session Context (Agent 3)

Networks are stateful. "What was the BGP state on that router 10 minutes ago?" is a valid question during an incident. 3-agent-session-context demonstrates how to build a NOC assistant that maintains context across a full troubleshooting conversation.

cd 3-agent-session-context
make install && make playground

A multi-turn conversation might look like:

  • Turn 1: "Check BGP summary for router r1-core01"

  • Turn 2: "What was the state of that last neighbor?"

  • Turn 3: "Is that the same neighbor that was flapping yesterday?"

Without session state, each turn is a fresh conversation. With it, the agent carries context forward. For a NOC team handing off incidents between shifts, this changes the workflow entirely.

Human-in-the-Loop Approvals (Agent 4)

This one is important for anyone who's ever had a script make an unauthorized change to production. 4-agent-human-in-the-loop puts a confirmation gate before any destructive action.

The implementation is clean. There are two patterns:

Pattern 1: Boolean flag on a tool

FunctionTool(
    write_router_config,
    require_confirmation=confirmation_if_not_spof_router
)

The confirmation_if_not_spof_router function returns True when the target router is critical infrastructure, False otherwise. No approval needed for lab routers; explicit approval required for anything in the path of production traffic.

Pattern 2: Structured approval payload

if not tool_confirmation:
    tool_context.request_confirmation(
        hint="This will modify router config on a production device",
        payload={"ok_to_write": False}
    )
    return {"status": "pending_approval"}

ok_to_write = tool_confirmation.payload.get("ok_to_write")
if ok_to_write:
    return write_router_config(router_name)

The agent pauses, surfaces the request to a human, and only proceeds when explicitly approved. For a change management process that currently lives in email threads and Slack messages, this is a significant upgrade.

Phase 4: Workflow Orchestration

6-agent-workflows is where ADK shows what makes it different from a single LLM call.

Three orchestration patterns are implemented:

Sequential — for troubleshooting procedures where each step depends on the previous:

StartAgent → GatherInfo → DeviceStatus → PingTest →
Traceroute → FirewallRules → SummaryAgent

Try:

"Our BGP neighbor on router edge-router1 keeps going down. Please diagnose."

The agent walks through each stage, passing context forward. If the ping test fails, the traceroute still runs to identify where packets stop.

Parallel — for independent checks that waste time running sequentially:

StartAgent → ParallelAgent [
    check_r1_agent (cpu, memory, interface tools)
    check_r2_agent (cpu, memory, interface tools)
] → SummaryAgent

Try:

"Check the health of router1, router2, and fw1 devices"

All three checks run concurrently. For a network with dozens of devices, this isn't just faster — it's the difference between a useful tool and an unusable one.

Loop — for remediation cycles that need to keep running until a condition is met:

StartAgent → LoopAgent (max_iterations=3) [
    MonitoringAgent (check_connectivity, check_latency)
    RemediationAgent (restart_service, fix_connectivity)
] → SummaryAgent

The max_iterations=3 guard matters. Loops need an exit condition. This pattern is appropriate for things like "keep trying to restore connectivity, but stop after 3 attempts and page a human."

Phase 5: Deployment

Once you've built something useful locally, 11-agent-deployment-cloudrun and 12-agent-deployment-vtxai handle getting it into production.

For Cloud Run:

make deploy
# Equivalent to deploying a containerized FastAPI app with IAM controls
# Options: IAP=true PORT=8080

For Vertex AI Agent Engine:

make playground
# Vertex AI provides a managed runtime with built-in observability,
# trace visualization, and version management

The Vertex AI path is worth considering for teams that need audit trails. Every agent interaction is logged, traceable, and can be replayed. For environments with compliance requirements — healthcare networks, financial infrastructure, government — this matters.

Observability is addressed separately in 10-agent-observability. The default setup enables Cloud Trace automatically. To capture prompt/response content:

export LOGS_BUCKET_NAME=your-gcs-bucket
export OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=FULL

For the Cash-Strapped IT Team

If you're running a small IT team at a university or a nonprofit, the path here doesn't have to start with Vertex AI and Cloud Run. It can start with:

  1. A single laptop running make playground

  2. A Gemini API key (free tier handles substantial usage)

  3. One agent pointed at your most repetitive troubleshooting task

The BGP troubleshooting agent, adapted to whatever your most common incident type is, run locally, gives one engineer the ability to triage more accurately and faster. That's a meaningful outcome before you spend a dollar on cloud infrastructure.

The repo also includes 14-agent-ollama for running agents against local LLMs via Ollama. If data sovereignty or network isolation is a constraint, you can run the entire stack without any external API calls.

cd 14-agent-ollama
make install && make playground
# Runs against a local Ollama instance instead of Gemini

What to Try First

If you're new to this, here's a concrete starting path:

# 1. Clone and set up
git clone https://github.com/ashwinjo/GoogleADK-NetworkAutomation.git
export GOOGLE_API_KEY=your-key-here

# 2. Start with the basic agent
cd GoogleADK-NetworkAutomation/1-basic-agent
make install && make playground

# 3. Try a real design review
# Prompt: "Review my network design: [describe your actual topology]"

# 4. Move to tools
cd ../2-basic-agent-with-tools
make install && make playground
# Prompt: "Troubleshoot BGP on router r1-sea3"

# 5. Try the human-in-the-loop agent before touching anything production-adjacent
cd ../4-agent-human-in-the-loop
make install && make playground

The sample queries are documented in SAMPLE_QUERIES.md — each agent has 3-5 example prompts you can run immediately.

The Honest Assessment

This repo is a learning path, not a production-ready product. The tool functions simulate network calls rather than hitting real devices. Adapting them to your actual infrastructure — whether that's Cisco IOS-XE via RESTCONF, Juniper via NETCONF, or Arista via eAPI — requires writing the actual integration layer.

That work isn't trivial, but it's also not the hard part anymore. The hard part was always the orchestration: how do you handle a troubleshooting workflow where step 4 depends on the result of step 2? How do you get a confirmation gate before a config change? How do you run parallel health checks without writing a threading nightmare? ADK handles all of that.

The integration layer — connecting tool functions to your actual devices — is now just the boring part. Which is progress.


Resources