Building Agent-Bando: Learning The Inner Workings with Open-Source LLMs

SecurityAI

May 26

https://github.com/Shasheen8/agent-Bando

This project was all about learning how AI models behave, from a security perspective. Agent Bando is a lightweight Flask-based web app I built to query Common Vulnerabilities and Exposures (CVEs) like CVE-2025-30400 and generate dynamic tables and AI-powered summaries, with open-source large language models (LLMs) to craft a partial SOC assistant, diving deep into prompt tuning and cost implications from a security perspective

Agent Bando is a lightweight web tool for SOC analysts to fetch CVE details fast. Type a CVE ID (e.g., CVE-2024-6387), and it delivers:

Dynamic Table: Displays Severity, Impact, Affected Products, MITRE ATT&CK Techniques, and more, with handy tooltips.
AI-Generated Summary: A Markdown-rendered report covering:
- Vulnerability description and attack vector
- Severity and exploit details
- Affected products (e.g., Cisco IOS XE, Juniper Junos OS)
- MITRE mappings, threat actors, bug bounties, and SOC actions

The app runs on Flask, with agent.py handling LLM queries, main.py serving the UI, index.html rendering the interface, style.css adding polish, and logger.py logging to agent_bando.log. It’s a simple setup, but it’s a playground for AI experimentation.

Why Together AI?

Together, powers the backend AI muscle. Here’s why it’s awesome:

Open-Source Model Zoo: Hosts models like Llama-4-Maverick-17B and DeepSeek V3-0324, perfect for testing without heavy infrastructure.
Blazing Fast: CVE queries return in seconds, keeping the UI snappy.
Budget-Friendly: Competitive pricing lets me experiment without draining my wallet.
Simple API: Using together, I hooked up model calls in agent.py with a TOGETHER_API_KEY in .env.
Model Switching: Toggling between Llama-4 and DeepSeek was a one-line change.

Together AI made it easy to focus on prompt tuning and cost analysis, not server wrangling. It’s a game-changer for AI-driven security projects!

Llama-4-Maverick-17B vs. DeepSeek V3-0324: The Experiment

Meta Llama-4-Maverick-17B-128E-Instruct-FP8

Pros: Its mixture-of-experts architecture shines for detailed text and image understanding. Summaries were verbose, with rich MITRE mappings and threat actor insights, especially with larger prompts (max_tokens=2000).

Cons: Slower inference and higher output token costs. Smaller prompts (max_tokens=1000) sometimes cut off key sections.

Output Example (for CVE-2017-0144): Check the image above

Cost Breakdown: Rough Calculation

Pricing: $0.27/1M input tokens, $0.85/1M output tokens.

Standard Query (500 input tokens, 1000 output tokens):
- Input: (500 / 1M) * $0.27 = $0.000135
- Output: (1000 / 1M) * $0.85 = $0.00085
- Total: ~$0.000985 (~0.1 cents/query)
- 1000 queries: ~$0.985
Larger Prompt (1000 input tokens, 2000 output tokens):
- Input: (1000 / 1M) * $0.27 = $0.00027
- Output: (2000 / 1M) * $0.85 = $0.0017
- Total: ~$0.00197 (~0.2 cents/query)
- 1000 queries: ~$1.97

Pros: Faster inference and concise summaries, great for quick SOC queries. Reliable JSON output for tables, even with smaller prompts.

Cons: Less detailed summaries, sometimes skipping threat actors or news. Larger prompts improved output but at a higher cost.

Output Example (for CVE-2017-0144): Check the image above

Cost Breakdown: Rough Calculation

Pricing: $1.25/1M tokens (input and output).

Standard Query (500 input tokens, 1000 output tokens):
- Input: (500 / 1M) * $1.25 = $0.000625
- Output: (1000 / 1M) * $1.25 = $0.00125
- Total: ~$0.001875 (~0.19 cents/query)
- 1000 queries: ~$1.875
Larger Prompt (1000 input tokens, 2000 output tokens):
- Input: (1000 / 1M) * $1.25 = $0.00125
- Output: (2000 / 1M) * $1.25 = $0.0025
- Total: ~$0.00375 (~0.38 cents/query)
- 1000 queries: ~$3.75

Prompt Tuning and Costs - $1 Starting Credit

Small Prompts: Early on, I used max_tokens=1000, but summaries were truncated (e.g., missing Threat Actors). Costs were low (~0.1 cents/query for Llama-4), but output quality suffered.
Larger Prompts: Bumping to max_tokens=2000 in agent.py completed summaries, doubling output token costs (e.g., $0.0017 for Llama-4, $0.0025 for DeepSeek). With testing max_tokens=3000 for Llama-4 to leverage its verbose strengths, expecting ~$0.0028/query.
Optimization: I tightened prompts to request concise JSON and summaries, cutting output tokens by ~20% (e.g., from 1000 to 800). Logging token counts in agent_bando.log helped in balancing cost and quality.
Key Insight: Llama-4’s lower output token cost ($0.85 vs. $1.25) makes it cheaper for large prompts, critical for security apps needing detailed outputs.

For a SOC running 10,000 queries monthly, Llama-4 costs ~$9.85-$19.70 (depending on prompt size), while DeepSeek runs ~$18.75-$37.50. Llama-4’s cost edge and richer outputs make it tempting for Agent Bando’s next phase of testing and building.

Why I Built Agent Bando

Agent Bando was an experiment to explore AI models through a security lens. SOCs drown in CVE data, and I wanted to see if LLMs could simplify analysis while learning:

Prompt Tuning: How crafting precise prompts (e.g., Markdown + JSON) affects output quality and cost.
Cost Implications: How token sizes impact budgets, crucial for scaling AI in security.
Model Behavior: How models like Llama-4 and DeepSeek handle structured security data.

This wasn’t about testing security features like prompt injection (that’s next!). Instead, the focus was on using AI to summarize CVEs, map MITRE techniques, and suggest SOC actions. Every prompt tweak or model switch was a lesson for me about AI’s potential and pitfalls in security contexts. Agent Bando’s simplicity belies its value as a learning tool.

Next Steps for Agent Bando

Test Security Features: Experiment with prompt injection to probe LLM vulnerabilities, ensuring Agent Bando’s outputs are secure. (DeepSeek will probably be a good model for this)
Integrate Tools: Connect to production data (e.g., SIEM, EDR, MDM, Vuln Mgnt logs, asset inventories) for context-aware CVE insights
Enhance UI: Add CVE autocomplete, a summary export button, or a dark mode for SOC night owls.
Optimize Costs: Fine-tune prompts further and test Llama-4 with max_tokens=3000 to maximize detail without breaking the bank. (Inference Testing)

Try It

GitHub Repository: https://github.com/Shasheen8/agent-Bando

Try It: Clone, set up your TOGETHER_API_KEY, and query CVEs at http://localhost:5001.

Together AI’s open-source models made it easy to experiment, and comparing Llama-4 and DeepSeek taught me how to balance prompt size, cost, and output quality (Inference).

AISecurityInference

Shasheen Bandodkar

Building Agent-Bando: Learning The Inner Workings with Open-Source LLMs

How Should You Prioritize Product Security?

Building a Vulnerability Disclosure Program