Redigital — Production Readiness Assessment

Internal decision document · Prepared with AI-assisted analysis

Executive summary

The Redigital codebase (Django + React multi-tenant SaaS) is structurally complete and runs locally. However, a comprehensive audit identified 92 issues including 29 critical ones across security, data isolation, payment correctness, and performance. The system cannot safely host paying customers in its current state.

Bringing it to production-grade — secure code plus the operational infrastructure a real SaaS needs (CI/CD, monitoring, alerting, backups) — is realistically a 3-4 month effort for one engineer working with AI assistance, or about 2 months with two engineers in parallel.

Most urgent, regardless of any other decision: Live production API keys (Stripe, OpenAI, Anthropic, Postmark, Cloudflare R2, Tiptap) are hardcoded directly into the source code and committed to git history. These need to be rotated this week — independent of any remediation timeline.

Current state of the codebase

What works

The stack boots cleanly under Docker (Postgres, Redis, Django, Huey worker, Mailhog all healthy)
Database migrations apply without errors across ~120 migration files
The React/Inertia.js marketing site renders correctly
The signup form loads and validates input
Routing, settings, and dev tooling are in place; a deployment story exists (Fly.io configs for staging and production)

What doesn't

CRITICAL Multi-tenant isolation is broken under production load. apps/tenants/utils.py:11 uses threading.local() to track the active tenant; under thread-pooled workers, threads are reused across requests and Tenant A's context can persist into Tenant B's request. Result: User B sees User A's data. Silent, no error.
CRITICAL WebSocket authentication bypass at apps/live/consumers.py:37-48. Anonymous users can connect to any channel UUID and intercept real-time collaboration data.
CRITICAL Hardcoded production credentials in apps/ai/services/utils/agent.py, config/settings/base.py, config/settings/production.py, fly.toml, fly-staging.toml, and assets/js/components/editor/tiptap.jsx.
CRITICAL No database transactions wrapping multi-step operations (payment flows, tenant creation, invitations). Webhooks have no idempotency or deduplication, leading to race conditions and double-charge potential.
HIGH IDOR vulnerabilities across 10+ API endpoints — cross-tenant data access via direct object reference.
HIGH XSS sinks in 4 places (file upload preview, diff renderer, email iframe, MJML preview).
HIGH Near-zero automated test coverage — every fix carries regression risk.

Multi-tenancy: real by design, broken in practice

The architecture is row-level multi-tenant. Every tenant-scoped model inherits TenantModelMixin (apps/tenants/models.py:504), which adds a tenant_id foreign key. A custom TenantModelManager (apps/tenants/mixins.py:268) auto-filters queries by the active tenant. Nine apps participate in this scheme; users can belong to multiple workspaces with different roles.

The flaw is in how the active tenant is tracked. apps/tenants/utils.py stores tenant context in Python's threading.local(). Under production-style thread-pooled workers (Gunicorn/uWSGI), threads are reused across requests, and the middleware doesn't reliably clear the context at request boundaries. In single-threaded dev/test environments this bug is invisible; under real concurrency it leaks tenant data silently.

Fix is a ~4-hour refactor: replace threading.local() with contextvars.ContextVar and clear the context in middleware on every request. This is the single most important fix in the entire remediation plan.

Workstream breakdown

Effort estimates assume one engineer working with Claude Code AI assistance. Calendar time includes review, integration testing, and discovery cycles.

Workstream	Effort	Notes
Sprint 1 — Emergency security Rotate credentials + purge git history, fix WebSocket auth, IDOR on 10+ endpoints, swap `threading.local()` → `ContextVar`, wrap payment/tenant/invite flows in `@transaction.atomic`, webhook dedup/idempotency	3-4 days	Patterns repeat across endpoints, AI compresses well. Cred rotation cascades into CI/CD breakage — budget extra. After this sprint: safe for private beta.
Sprint 2 — Frontend & input security DOMPurify in XSS sinks, open redirect fix, CSP headers, API rate limits, move Sentry DSN server-side	3-4 days	Mostly mechanical. After this: defensible for external users.
Sprints 3-4 — Performance & correctness N+1 queries, pagination, WebSocket memory leak, DB cursor leaks, circuit breakers, email rate limiting	1.5-2 weeks	AI finds patterns fast; measurement and verification take real time.
Sprints 5-8 — Test coverage & edge cases Build coverage from near-zero, edge cases from audit, dependency upgrades, soak testing after tenant refactor	2.5-3 weeks	Tests are AI's strength but need real human review. Tenant-isolation soak is calendar-bound — you can't compress real traffic.
CI/CD pipeline GitHub Actions → Fly: lint, test, build, staging promotion	3-4 days	YAML-heavy, AI-friendly. Some `.github` workflows exist already.
Monitoring & APM Datadog or Grafana Cloud: metrics, traces, dashboards, SLOs	~1 week	Dashboard config compresses well; choosing what to measure is human judgment.
Centralized logging Structured logs, aggregation, retention policy	2 days	Currently stdout to Fly logs only.
Alerting & on-call PagerDuty or Better Stack, alert routing, runbooks	3-4 days	Runbook drafting compresses well with AI.
Backups & DR Postgres snapshot policy, tested restore drill, RPO/RTO docs	3 days	Restore drill is calendar-bound — must be done for real.
Staging environment + deploy gates Activate `fly-staging.toml`, promotion flow	2 days	Config exists; needs to be wired into the pipeline.
Infrastructure-as-code Terraform for Fly + DNS + Cloudflare	3-4 days
Secrets management Proper vault, rotation policy, dev/prod parity	2 days
Support tooling django-hijack guardrails, audit-log search UI for support	2-3 days	Tooling is partially installed; needs proper guardrails before use.

Scope & timing

Scope	What's included	What you get	1 engineer (w/ AI)	2 engineers parallel
Audit work only	The 92 audit issues (Sprints 1-8): security fixes, tenant isolation, payment correctness, XSS, performance, test coverage	Correct, secure app code — but no monitoring, logging, CI/CD, on-call, or tested backups	7-9 weeks	4-5 weeks
Audit + base infra	All of the above, plus CI/CD pipeline, monitoring & APM, centralized logging, alerting & on-call, backups + restore drill, staging environment, IaC, secrets management, support tooling	Production-grade: secure code and operational readiness — deployable, observable, recoverable	~3-4 months	~2 months

Recommendation Budget for "Audit + base infra." The audit-only scope makes the code correct on paper but leaves the team flying blind operationally — not actually safe for paying customers despite being secure.

Monthly cost — external services

Tier	Service	Purpose	Cost (launch)
Required to run	Fly.io	App hosting + Postgres	$50-200/mo
	Stripe	Billing & subscriptions	2.9% + 30¢/txn
	OpenAI	GPT models for content generation	$50-2,000/mo
	Anthropic	Claude models	$50-2,000/mo
	Postmark (or Resend/SendGrid)	Transactional email	$15-50/mo
	Cloudflare R2 (or AWS S3)	Media & file storage	$5-50/mo
	Tiptap Cloud	Collaborative editor backend	$50-200/mo
	GitHub	Source + Actions CI	$0-4/user/mo
	Domain registrar	DNS for redigital.ai	$1-5/mo
Required for safe ops	Sentry	Error tracking	$0-26/mo
	Datadog / Grafana Cloud / Better Stack	Metrics, APM, dashboards	$0-100/mo
	Better Stack / Logtail	Log aggregation (often bundled)	—
	PagerDuty / Better Stack	On-call paging	$10-25/user/mo
Team operations	1Password	Shared secrets across team	$8/user/mo
Team operations	Linear / GitHub Issues	Issue tracking	$0-10/user/mo
Other	Cloudflare	DNS, CDN, WAF (R2 lives here too)	$0/mo

Cost bottom line

Scenario	Monthly burn
Pre-launch — running, no users	$100-200/mo
Launch / private beta (light usage)	$300-600/mo
Growth (1k+ active users)	$1,500-5,000/mo, dominated by AI providers
One-time setup (domain, account verifications, etc.)	~$100-300

Wildcards to plan for AI provider bills (OpenAI, Anthropic) can spike 10× overnight with a viral feature or a runaway agent loop. Set hard budget caps on both dashboards on day one. Costs at growth scale are dominated by these two services, not infra.

Account ownership — verify week zero

If the codebase has been inherited from a prior team or vendor, ownership of the existing external accounts must be confirmed before engineering work starts. Account-recovery and ownership-transfer flows can take 1-3 weeks (MFA resets, identity verification, dispute processes) and will block deployment if not handled early.

Stripe account that holds the existing API keys
Fly.io organization where the app is currently deployed (if anywhere)
OpenAI, Anthropic, Postmark, Cloudflare R2 / Tiptap accounts whose credentials are in the source code
Domain registrar for redigital.ai
GitHub organization holding the repository
Sentry organization (existing DSN in source)

Final recommendation

Plan for 3-4 months of engineering effort

Budget for "Audit + base infra" scope. One experienced engineer with Claude Code assistance, or two in parallel for ~2 months calendar. Milestones:

End of week 2 → safe for private beta
End of week 4 → defensible for external users
End of month 3-4 → production-grade for paying customers

AI assistance compresses the work meaningfully versus an unassisted senior engineer (estimated 6+ months solo), but it cannot compress: soak time after the tenant-isolation refactor, real-world integration testing against Stripe / Postmark / etc., or operational readiness work that needs human judgment.