Governance · Reliability · AI safety

AI you can run, audit, and explain.

Building software is half the job. The other half — observability, governance, AI safety, audit-grade trails, change management, incident response — is what makes a system you can bet the business on. This page is the substrate beneath every engagement we run.

SLO TARGET
99.95%
30-day rolling window,
per service
DEPLOY ROLLBACK
<90s
Marketing site;
<5min for platforms
AI ACTIONS
HITL
Human approval on every
destructive or irreversible path
AUDIT RETENTION
7 yrs
Immutable logs,
cryptographically signed
01 — Why this exists

You can’t bet the business on software you can’t operate.

Most software vendors hand over a working build and disappear. That’s fine for a marketing site. It’s a disaster for the dispatch board your freight company runs on, the inspection platform 200 organizations submit AMPP reports through, or the intake system that drives signed cases at a law firm. The day-two question — “is this thing still working, who’s watching, and what happens when it isn’t?” — is where most operational software actually fails.

AI compounds the problem. An LLM that drafts an insurance demand letter, an agent that posts loads to a board, a vision model classifying coating defects — each adds a new failure surface. Hallucinations, prompt injection, model drift, vendor outages, silent degradation. The frontier labs ship models; nobody ships the operating substrate that makes those models safe in your business.

This page is that substrate. The practices below aren’t aspirational — they’re the default for every Sytepoint engagement, instrumented from day one, visible on our own production. The marketing site you’re reading right now runs on the same backbone.

02 — Observability

Instrumented on day one.

If it’s in production, it’s instrumented. Sentry for error tracking + release tagging. PostHog for product analytics + session replay + named event taxonomy. Vercel for request-level telemetry. Anthropic API metering on every AI call. The marketing site below ships with this stack from the first commit; your operational platform inherits the same default.

SLOs tied to user-facing outcomes, not server metrics. “Dispatch board loads under 800ms for 99% of requests” matters; “CPU under 80%” doesn’t. Each service gets 3–5 SLOs scoped to the workflows that drive business value. Error budgets are calculated monthly and shape release pace — burn the budget, slow down releases until it recovers.

Structured logs + traces, not text grep. Every request carries a trace ID that follows it across services, queues, and AI calls. Logs are JSON-structured with stable field names. When something breaks at 3 a.m., the engineer pulls one trace ID and sees the full transaction end-to-end — not a hunting expedition.

operational telemetry · 24h window
live
ERROR BUDGET
97.2%
DEPLOYS · 24H
11
all gated · 0 rollbacks
AUDIT ENTRIES · 24H
1,247
retained 7 yrs · immutable
SLO
GREEN
99.95% / 30d window
Recent governance eventsstreaming
  • Audit log · inspection report stored
  • Deploy gate passed · main → prod
  • HITL approval · agent action released
  • Eval harness · 47/47 passing
  • Access review · least-privilege reconciled
03 — AI safety

Human approves before anything material moves.

Human-in-the-loop on every destructive or irreversible path. Agents draft, propose, summarize, route — they don’t commit. Sending a payment, deleting a record, posting an external message, accepting a contract: a human approves the agent’s proposal before it fires. The pattern adds milliseconds to most flows and zero risk of an autonomous LLM doing something you can’t undo.

Eval harnesses before production, regression suites after. Every prompt-driven workflow ships with an eval set — ~30–100 representative inputs with expected outputs (or expected shape, where free-form). The suite runs on every change. New prompt? Re-run evals. Model version bump? Re-run evals. Regression in pass rate gates the deploy.

Prompt-injection scanning + output validation. Inbound user content runs through an injection detector before it reaches an LLM with privileged context. Outbound LLM responses are validated against the expected schema before any tool call or downstream action fires. Belt and suspenders; both are cheap.

Vendor diversity, model versioning, kill switches. Critical AI workflows are written against an abstraction, not a vendor. Model versions are pinned per workflow and bumped explicitly, not silently. Feature flags can disable any AI path in production in under 30 seconds if the vendor degrades or pricing shifts.

04 — Data governance

Data classification, all the way down.

Every field in every system gets a class. The class drives encryption, retention, access policy, and what shows up in logs vs. what gets redacted. No data is stored by default “in case we need it” — every retention has a reason and an expiry.

Class
Example
Handling
Public
Service prices, location addresses
Encrypted in transit; freely logged
Internal
Build artifacts, deploy logs, metrics
RBAC + TLS 1.3; structured logs
Confidential
Client business data, intake records
AES-256 at rest; access logged; redacted from telemetry
Restricted (PII / PHI)
SSN, medical records, payment details
AES-256 + envelope encryption; least-privilege only; never in logs; BAA-covered

Encryption defaults: AES-256 at rest, TLS 1.3 in transit, envelope encryption for fields in the Restricted class. Keys managed via AWS KMS or Vercel-equivalent — not in source, not on engineer laptops.

Retention: documented per class at engagement kick-off. Operational data persists as long as the workflow requires it. Telemetry capped at 90 days by default. Audit logs retained per regulatory minimum (typically 7 years for inspection / financial / legal work).

Subject rights: right-to-access, right-to-deletion, and right-to-correction workflows built into systems that handle consumer data. GDPR + CCPA aligned even when not strictly required, because the engineering cost is one-time and the future cost of retrofitting is enormous.

05 — Reliability & change management

Every change goes through the same gates.

No direct production writes. Even an emergency hotfix flows through the full pipeline below — with an expedited canary window, not a bypass. Audit-logged at every gate, reversible at every step.

Change · 7c4f3a1 · sytepoint-website5 gates · audit-logged · rollback in <90s
PR review
Peer + automated checks
CI · build + test
Type, lint, unit, e2e
Staging deploy
Full prod-shape env
Canary · 5%
Production · subset
Promote · 100%
Production · all

Blue-green & canary deploys on every production release. The canary catches what staging missed; promotion is gated by error rate + latency stability on the canary slice. Feature flags decouple deploy from release — code reaches production before users see it, which gives us a kill switch independent of the deploy system itself.

Rollback drillsrun quarterly in staging — actual rollbacks, actual recovery, actual stopwatch. The marketing site you’re reading rolls back in under 90 seconds; the larger platforms in under five minutes. A documented number means it’s a number we’ve actually measured.

Backups + RPO/RTO commitments. Postgres point-in-time recovery to the last 7 days, full snapshots daily, snapshots tested monthly by restoring to a fresh instance and running the smoke suite. RPO 5 minutes, RTO 30 minutes for platform workloads — calibrated to the actual business cost of data loss for each workflow.

Change management. Every change has a PR, a required reviewer, an automated check suite, and a paper trail. Direct production access is provisioned per-engagement, not per-employee, and audited quarterly. Secret rotation runs automatically; revocation on engineer departure is same-day.

06 — Compliance posture

Calibrated, not aspirational.

We say what we are and aren’t. “HIPAA-ready architecture” is not “HIPAA-compliant.” “SOC 2 trajectory” is not “SOC 2 certified.” Where there’s a gap, we say so and document the controls we have in place that map to it.

Standard
Status
What it means
SOC 2 Type II
2027 trajectory
Aligned controls in place; formal attestation pending audit
HIPAA
Ready
BAAs signed per engagement; technical safeguards in place
GDPR
Aligned
Subject rights, data minimization, lawful basis documented
CCPA / CPRA
Aligned
Right-to-know, right-to-delete, opt-out flows in place
NIST AI RMF
Aligned
Govern / Map / Measure / Manage practices encoded
AMPP audit trail (coatings work)
In production
Audit-grade logs powering 200+ organizations through DocuPaint
07 — Incident response

When something breaks, here’s the playbook.

Severity classification. Sev-1: production down or data at risk. Sev-2: significant degradation. Sev-3: minor issue, no user impact. The classification drives paging, response SLA, and post-mortem depth.

Paging & response. Sev-1 pages the on-call engineer within 5 minutes via the monitoring stack. Acknowledgment SLA: 15 minutes. Communication SLA: status page update within 30 minutes, then every 30 minutes until resolved. The client is in the loop from the first acknowledgment, not after the fix.

Blameless post-mortems.Sev-1 and Sev-2 incidents get a written post-mortem within 5 business days. Root cause, contributing factors, timeline, what fixed it, what we’re changing so it doesn’t recur. Shared with the affected client. Action items tracked to completion — the post-mortem isn’t closed until they ship.

Durability over speed.The first fix is often a quick script; the second fix is the system change that makes the same incident class impossible. We track recurrence rate as a KPI — same root cause twice is the indicator we’re fixing symptoms instead of causes.

08 — FAQ

The questions buyers actually ask.

Are you SOC 2 certified?

Not yet — SOC 2 Type II attestation is on our 2027 trajectory once retainer revenue justifies the audit cost. What's in place today: SOC 2-aligned controls (access governance, change management, encryption at rest and in transit, incident response runbooks, vendor risk reviews). For clients in regulated industries, we map our existing controls to the Trust Service Criteria categories so your auditor has a clear picture even without our own attestation.

Can you build HIPAA-compliant systems?

We build HIPAA-ready architecture — meaning the technical safeguards (encryption, access controls, audit logging, secure transmission) are in place from day one. HIPAA compliance is a property of the operating organization plus its Business Associate Agreements, not the software vendor in isolation. We sign BAAs for engagements where we touch PHI, and we work with cloud providers (AWS, Vercel) under their HIPAA programs when the workload requires it.

What's your AI safety posture?

Three layered defaults. (1) Human-in-the-loop on every destructive or irreversible action — agents draft and propose; humans approve before anything material moves. (2) Eval harnesses on every prompt-driven workflow before it goes to production, with regression suites that run on every change. (3) Prompt-injection scanning on inbound user content before it reaches an LLM, plus output validation against expected shape before any downstream action fires. The goal isn't AI that never errs — it's AI where errors are caught at the gate, not in production.

How do you handle data retention and deletion?

Retention schedules are defined per data class at engagement kick-off — operational data for as long as the workflow requires it, telemetry capped at 90 days by default, audit logs retained for regulatory minimums (typically 7 years for financial / inspection work). Right-to-deletion workflows are built into systems that handle consumer data. We document the retention matrix in the engagement; nothing is retained by default "because we might want it."

How does your change-management process actually work?

Every change flows through the pipeline you see on this page: PR with required reviewer + automated checks → CI (type, lint, unit, e2e) → staging deploy with smoke tests → canary at 5% of production traffic → promote to 100%. Every gate is audit-logged. No direct production writes — even hotfixes flow through the same path with an expedited canary window. Rollback is a one-command operation that completes in under 90 seconds for the marketing site, under 5 minutes for the larger platforms.

What's your incident response process?

Sev-1 (production down) pages the on-call engineer within 5 minutes via the monitoring stack. Acknowledgment SLA is 15 minutes. Status page updates every 30 minutes until resolved. Post-mortem within 5 business days, published internally and shared with the affected client. We optimize for blameless post-mortems and root-cause durability — the same incident class shouldn't recur because the fix was a script, not a system change.

How do you protect customer data from your own team?

Least-privilege RBAC on every system; principal engineer access is provisioned per engagement, not per employee. Production data access is logged and reviewed quarterly. Engineers see synthetic / scrubbed data in staging by default. Secrets are managed via the cloud provider's secret store with rotation policies; nothing sensitive lives in source. When an engineer leaves the firm, access is revoked the same day.

Can you provide an audit trail of what your AI did?

Yes — every AI-mediated action is recorded with the input prompt, the model version, the response, any tool calls made, and the human approver (when HITL applies). Audit logs are immutable, retained per the data-class schedule, and queryable. For regulated workflows (inspection reports under AMPP, intake decisions in legal) the audit trail is part of the system's value proposition, not a logging addendum.

09 — Begin
Replies within 1 business day

Want this kind of substrate under your operation?