NEW
Proxify is bringing transparency to tech team performance based on research conducted at Stanford. An industry first, built for engineering leaders.
Learn more
Insights
Feb 25, 2026 · 10 min read
AI governance that actually works: A practical guide for engineering teams
AI governance is often misunderstood as a compliance exercise or a bureaucratic layer that slows teams down. In reality, it’s an engineering discipline; one that manages the unique risks of probabilistic systems that drift, hallucinate, and act autonomously.
Stefanija Tenekedjieva Haans
Content Lead
Verified author

In this interview, Petar Stojanovski, Client Engineering Manager at Proxify, breaks down what effective AI governance looks like in practice: how to assess risk, design approval flows that don’t block delivery, and embed guardrails directly into engineering workflows.
When engineering teams hear “AI governance,” what do you think they get wrong most often?
Most teams assume AI governance is primarily about compliance paperwork and checklists. It’s not. It’s about controlling risk introduced by probabilistic systems that behave differently from traditional deterministic software. The biggest misconception is treating AI like another API integration instead of a dynamic system that can drift, hallucinate, or amplify bias over time.
In your experience, what are the real risks of not having any AI governance in place?
The immediate risk is silent failure. Because AI functionality is very difficult to test, AI systems can produce plausible but incorrect outputs that go unnoticed until they affect customers, decisions, or revenue. There is also a compounding risk: data leakage, regulatory exposure, reputational damage, and model drift that degrades performance gradually. Without governance, detection occurs after impact rather than before.
How do you define “good” AI governance for an engineering-led organization?
Good AI governance is operational, not bureaucratic. It embeds risk controls directly into the engineering lifecycle: clear ownership, dataset traceability, model evaluation standards, and monitoring in production. It defines acceptable risk thresholds and measurable quality metrics. Most importantly, it treats AI systems as continuously evolving components, not one-time deployments.
What’s the minimum viable AI governance you’d recommend for a team starting to use AI in production?
At minimum: documented use case intent, explicit risk classification, human-in-the-loop review for high-impact outputs, and production monitoring for drift and failure patterns. Add data provenance tracking and clear rollback mechanisms. Assign a single accountable owner for each AI system. If ownership is ambiguous, governance does not exist.
How do you personally assess the risk level of an AI use case? What signals matter most?
I assess impact radius first: who is affected if the model is wrong, and how severely. Then I evaluate reversibility — can mistakes be corrected quickly without lasting harm? I look at data sensitivity, automation level, and whether outputs influence decisions or execute actions directly. High autonomy combined with high impact is the strongest risk signal.
If you had to design a simple 3–4 tier AI risk model, what would clearly separate each tier?
I’d separate tiers by impact severity, autonomy, and reversibility. Low risk is advisory output with low consequence and easy rollback. Medium risk influences decisions or workflows but has guardrails, review, and monitoring. High risk affects regulated domains or sensitive data, has hard-to-reverse harm, and triggers actions (remember the story of the bot selling a car for $1?). It needs formal review, stronger controls, and explicit accountability.
What’s an example of an AI use case that seems low-risk but actually isn’t?
Internal copilots that can search or summarize company knowledge often look “safe” because they’re not customer-facing. In practice, they can leak sensitive information through prompts, retrieval mistakes, or overly broad access scopes. They also shape decisions quietly: people trust summaries and stop verifying. The risk is less about output quality and more about access control and the downstream decisions that output influences. One solution is a Zero Trust strategy: don’t assume anything is safe just because it’s internal - verify.
Where do teams most often overestimate risk — and where do they underestimate it?
Teams often overestimate risk in the most visible place: what the AI says to end users, worrying that a hallucinated answer will cause major damage; for example, a support copilot confidently gives the wrong return-policy detail, which is usually caught and corrected with minimal blast radius. Teams often underestimate risk in tool-action side effects: once a copilot can post to Slack or create tickets, a small ambiguity can turn into a real leak or policy breach. For example, someone asks it to “share the incident update with the team,” and it posts sensitive customer or security details into a broad channel instead of the restricted incident room because there’s no guardrail or confirmation step. Another frequent underestimate is data risk - who can prompt what, with which context, and what gets logged.
Who should own AI approval decisions in an engineering organization, and who shouldn’t?
Ownership should sit with an accountable engineering leader for the domain (e.g., the service owner), paired with a lightweight risk-reviewer function (security/privacy/legal, as needed by tier). The decision needs to be close to the system, the data, and the blast radius — not centralized in a committee detached from delivery. It shouldn’t be owned solely by legal, procurement, or an “AI council” that can’t enforce technical controls. Approval is an engineering decision with cross-functional input, not a compliance handoff.
What information do approvers actually need to make good decisions?
This goes for classic software development, but a crisp description of the use case, the failure modes that matter, and the blast radius if it goes wrong. They also need data classification and data flow: what enters the system, what leaves it, what is stored, and who can access it. A baseline evaluation plan (how you’ll measure quality) and a production plan (monitoring, rollback, incident ownership) are essential. If you can’t explain those on a page, the system isn’t ready for approval.
What are the non-negotiable boundaries you always set for AI usage?
No uncontrolled access to sensitive data, and no training or retention on proprietary inputs without explicit agreement and visibility. No autonomous actions that change state in production without guardrails, auditing, and a clear rollback path. Outputs must be traceable to their sources when retrieval is involved, and the system must fail safely when confidence is low or context is missing. And ownership must be explicit: every AI capability has a named team and an on-call path like any other production component.
How should teams think about data sensitivity when using AI tools and models?
Treat prompts and context as data exfiltration vectors, not “temporary text.” Classify input data, then map it to allowed models and tooling based on where data goes, how it’s logged, and what the vendor retains. Retrieval is the high-risk multiplier: it silently expands the amount and variety of data exposed per request. The safest default is least-privilege access to context, strict scope boundaries, and a clear policy for what can never be sent to external systems.
When is human-in-the-loop genuinely necessary, and when is it just compliance theater?
Human-in-the-loop is genuinely necessary when a person can actually prevent harm: before high-impact actions (sending messages, changing records, triggering workflows), when sensitive data might be exposed, or when the decision is ambiguous enough that context and judgment matter. It’s also valuable during testing and smoke tests, where humans reliably catch permission leaks, unsafe tool routing, and bad defaults that automated checks miss. It becomes compliance theater when the “review” is a stamp of approval that doesn’t change outcomes, lacks context/time, or occurs after the risky step has already happened. If removing the human step wouldn’t measurably increase risk, it’s almost useless and time not-well spent.
What monitoring or observability do teams most often underestimate for AI systems?
Teams most often underestimate how much monitoring AI systems need beyond “did the model respond”: they miss observability for data access (what sources were retrieved and why), tool execution (what actions were taken and with what parameters), and safety/quality drift over time (prompt changes, model updates, shifting user behavior). They also underinvest in detecting leakage and abuse patterns, like repeated queries probing for sensitive info or unusual spikes in retrieval from restricted corpora. And they forget that “old-fashioned tools” still do a lot of the work here: strong auth, least-privilege permissions, audit logs, rate limits, anomaly detection, SIEM alerts, and good dashboards/tracing are often more important than fancy model metrics.
If you had to write a one-page AI policy that engineers would actually read, what would it include?
If I had to write a one-page AI policy that engineers would actually read, it would start with a simple decision tree: “Can I use AI here?”—focused on what data is being shared, what the output will be used for, and who is accountable for it. It would clearly define non-negotiables like never sharing secrets or client-sensitive data and always treating AI output as untrusted until reviewed. The policy would list approved tools and provide concrete engineering examples of where AI is helpful versus risky. Most importantly, it would emphasize responsible enablement over restriction, making it clear that AI is encouraged—just with guardrails.
What policy rules or governance patterns have you seen backfire in practice?
One governance pattern I’ve seen consistently backfire is zero-exception policy rules with no contextual flexibility around tools, environments, or integrations. The intent is risk reduction, but in practice, when engineers are blocked from solving real problems within sanctioned paths, they route around the system. Shadow workflows emerge, unofficial integrations appear, and security becomes adversarial instead of embedded. The organization optimizes for policy compliance optics rather than actual risk management. In complex systems, edge cases are inevitable, and treating every exception as a violation rather than a learning signal prevents the policy from evolving. The strongest governance models don’t aim for zero deviation; they create structured, visible exception paths that preserve ownership while keeping risk transparent and managed.
How should AI governance change as a company scales or becomes more regulated?
You need more standardization, not more meetings. As scale grows, governance should move from ad-hoc reviews to tier-based controls, reusable templates, and audited evidence generation baked into CI/CD. In regulated environments, you formalize data lineage, evaluation documentation, and change management, and you tighten vendor and retention requirements. The key shift is treating governance outputs as artifacts you can produce reliably, not bespoke narratives recreated for every release.
How do you embed AI governance into everyday engineering workflows without slowing teams down?
Make the “right path” the easiest path: pre-approved components, shared libraries for redaction and logging, and default monitoring dashboards. Use lightweight gates in the pipeline (risk tier + checklist) rather than manual meetings, with clear auto-approve criteria and exception handling. Put governance into code review norms: prompts, retrieval scope, evaluations, and rollback plans are reviewed like any other change. Speed comes from predictable controls and automation, not from telling people to be careful.
If you could give one piece of advice to a CTO designing AI governance today, what would it be?
Design governance around operational failure modes, not around organizational anxiety. Start with a small risk model, enforce a few hard boundaries, and instrument everything so you can see what’s happening in production. Then iterate based on incidents, metrics, and real usage patterns. If you can’t measure it and own it, you can’t govern it.
Boost your team
Proxify developers are a powerful extension of your team, consistently delivering expert solutions. With a proven track record across 500+ industries, our specialists integrate seamlessly into your projects, helping you fast-track your roadmap and drive lasting success.
Was this article helpful?
Find your next developer within days, not months
In a short 25-minute call, we would like to:
- Understand your development needs
- Explain our process to match you with qualified, vetted developers from our network
- You are presented the right candidates 2 days in average after we talk


