Browser Workflow QA Agent
An AI agent that performs structured quality assurance on live websites. Navigates pages,
inspects console state, captures desktop and mobile screenshots, and produces
evidence-backed reports with PASS,
WARN, or FAIL verdicts.
Triggered by a single natural-language message.
"Telling an agent 'it looks good' is not a QA workflow. Evidence is: screenshots, console logs, structured verdicts, and reproducible steps."
01 The Problem
AI agents are increasingly used for software quality checks. Most produce unstructured chat output: "the page looks fine", "no obvious errors". That is not QA. It is a guess with no audit trail, no reproducibility, and no way to track regressions over time.
The real problem is that agents default to summarizing what they see rather than capturing evidence of it. A QA workflow that lives only in a chat window is worthless the moment the window closes.
02 What I Built
A structured QA agent triggered by a natural-language Telegram message:
"Jarvis, QA this URL: [url] Goal: [objective]".
The agent executes a deterministic inspection workflow and saves a full evidence
bundle. Not just a chat reply.
Browser Inspection
Navigates URLs, inspects rendered DOM, audits console for errors and warnings
Screenshot Capture
Desktop and mobile viewport screenshots saved as persistent evidence artifacts
Verdict Engine
Deterministic PASS / WARN / FAIL classification with severity ratings and rationale
Report Helper
Reusable Python module for slug generation, markdown formatting, and file persistence
The workflow terminates with a structured Markdown report saved to durable storage and a concise summary sent back via Telegram. Findings persist whether or not the conversation continues.
03 Evidence from a Real Run
The agent ran a full QA + security review against
realgradientdescent.tech, capturing desktop
and mobile state, auditing security headers, and testing the AI chat feature.
Verdict: PASS with recommendations.
HTTPS valid, HSTS enabled, clickjacking protections active, content-type sniffing disabled
AI chat API resisted prompt-injection attempts; rate limiting and auth controls in place
SEO and discoverability files noted for future addition (robots.txt, sitemap.xml)
04 Key Design Decisions
Evidence-first, not chat-first. The workflow is designed around what gets saved, not what gets said. Every run produces a Markdown report, a screenshots bundle, and a metadata file. All independent of whether the conversation continues. QA that exists only as chat history is not QA.
Small, testable reporting substrate.
Rather than embedding report logic inside agent prompts, I extracted it into a reusable
Python helper (qa_report_helper.py) with
deterministic slug generation, verdict classification, and structured Markdown output.
TDD discipline: tests cover filename safety, verdict logic, required report sections,
and output naming conventions.
Natural-language trigger, structured execution. The Telegram trigger is intentionally casual. One line, conversational. The agent's job is to translate that into a deterministic inspection checklist. The interface is human. The process is not.
05 Challenges
The boundary between "agent observes" and "agent concludes" required deliberate design. Without explicit constraints, agents tend to collapse multi-step inspection into a single verdict, skipping the evidence capture that makes the verdict trustworthy. The workflow enforces screenshot capture and console logging as required steps, not optional ones.
Structuring findings by severity (high / medium / low / info) rather than as a flat list required judgment: what makes something actionable vs. informational? The verdict schema needed to match what a real QA engineer would care about, not just what the agent happened to notice.
06 What I Learned
Agents need output contracts
Without a defined schema for what a "completed QA run" looks like, agents improvise. Improvised QA is unreliable.
TDD applies to agentic helpers
Testing the reporting substrate independently from the agent makes the whole workflow more predictable and debuggable.
Evidence changes accountability
When a QA run produces screenshots and structured findings, you can compare runs, track regressions, and hold the system accountable.
Real runs reveal real issues
Running the agent against a live production site (this portfolio) surfaced real findings that chat-based review would have missed.
07 Why This Matters
This project demonstrates practical agentic automation applied to a real engineering need: making quality assurance auditable, reproducible, and persistent. It's not a demo that shows what an agent can say. It is a workflow that shows what an agent can prove.
Natural-language triggers, deterministic execution workflows, evidence capture, structured verdicts, and testable helper substrates generalize to any domain where agents need to produce trustworthy outputs rather than impressive-sounding ones.