live Agentic QA & Automation

Browser Workflow QA Agent

An AI agent that performs structured quality assurance on live websites. Navigates pages, inspects console state, captures desktop and mobile screenshots, and produces evidence-backed reports with PASS, WARN, or FAIL verdicts. Triggered by a single natural-language message.

Python Browser Automation Agentic Workflows Telegram QA Engineering TDD

"Telling an agent 'it looks good' is not a QA workflow. Evidence is: screenshots, console logs, structured verdicts, and reproducible steps."

01 The Problem

AI agents are increasingly used for software quality checks. Most produce unstructured chat output: "the page looks fine", "no obvious errors". That is not QA. It is a guess with no audit trail, no reproducibility, and no way to track regressions over time.

The real problem is that agents default to summarizing what they see rather than capturing evidence of it. A QA workflow that lives only in a chat window is worthless the moment the window closes.

02 What I Built

A structured QA agent triggered by a natural-language Telegram message: "Jarvis, QA this URL: [url] Goal: [objective]". The agent executes a deterministic inspection workflow and saves a full evidence bundle. Not just a chat reply.

Browser Inspection

Navigates URLs, inspects rendered DOM, audits console for errors and warnings

Screenshot Capture

Desktop and mobile viewport screenshots saved as persistent evidence artifacts

Verdict Engine

Deterministic PASS / WARN / FAIL classification with severity ratings and rationale

Report Helper

Reusable Python module for slug generation, markdown formatting, and file persistence

The workflow terminates with a structured Markdown report saved to durable storage and a concise summary sent back via Telegram. Findings persist whether or not the conversation continues.

QA Agent — Workflow Pipeline Telegram "Jarvis, QA this…" Hermes / Jarvis interprets objective Browser Automation navigate · inspect · capture Console audit Screenshots Security checks qa_report_helper.py verdict · slug · markdown · persist Markdown report Screenshots bundle Telegram summary

03 Evidence from a Real Run

The agent ran a full QA + security review against realgradientdescent.tech, capturing desktop and mobile state, auditing security headers, and testing the AI chat feature. Verdict: PASS with recommendations.

desktop · home PASS
Desktop screenshot of realgradientdescent.tech homepage captured by the QA agent
mobile · projects PASS
Mobile screenshot of realgradientdescent.tech projects page captured by the QA agent
findings summary — realgradientdescent.tech
PASS

HTTPS valid, HSTS enabled, clickjacking protections active, content-type sniffing disabled

PASS

AI chat API resisted prompt-injection attempts; rate limiting and auth controls in place

INFO

SEO and discoverability files noted for future addition (robots.txt, sitemap.xml)

04 Key Design Decisions

Evidence-first, not chat-first. The workflow is designed around what gets saved, not what gets said. Every run produces a Markdown report, a screenshots bundle, and a metadata file. All independent of whether the conversation continues. QA that exists only as chat history is not QA.

Small, testable reporting substrate. Rather than embedding report logic inside agent prompts, I extracted it into a reusable Python helper (qa_report_helper.py) with deterministic slug generation, verdict classification, and structured Markdown output. TDD discipline: tests cover filename safety, verdict logic, required report sections, and output naming conventions.

Natural-language trigger, structured execution. The Telegram trigger is intentionally casual. One line, conversational. The agent's job is to translate that into a deterministic inspection checklist. The interface is human. The process is not.

05 Challenges

The boundary between "agent observes" and "agent concludes" required deliberate design. Without explicit constraints, agents tend to collapse multi-step inspection into a single verdict, skipping the evidence capture that makes the verdict trustworthy. The workflow enforces screenshot capture and console logging as required steps, not optional ones.

Structuring findings by severity (high / medium / low / info) rather than as a flat list required judgment: what makes something actionable vs. informational? The verdict schema needed to match what a real QA engineer would care about, not just what the agent happened to notice.

06 What I Learned

Agents need output contracts

Without a defined schema for what a "completed QA run" looks like, agents improvise. Improvised QA is unreliable.

TDD applies to agentic helpers

Testing the reporting substrate independently from the agent makes the whole workflow more predictable and debuggable.

Evidence changes accountability

When a QA run produces screenshots and structured findings, you can compare runs, track regressions, and hold the system accountable.

Real runs reveal real issues

Running the agent against a live production site (this portfolio) surfaced real findings that chat-based review would have missed.

07 Why This Matters

This project demonstrates practical agentic automation applied to a real engineering need: making quality assurance auditable, reproducible, and persistent. It's not a demo that shows what an agent can say. It is a workflow that shows what an agent can prove.

Natural-language triggers, deterministic execution workflows, evidence capture, structured verdicts, and testable helper substrates generalize to any domain where agents need to produce trustworthy outputs rather than impressive-sounding ones.