Redigital — Production Readiness Assessment

Internal decision document · Prepared with AI-assisted analysis

Executive summary

The Redigital codebase (Django + React multi-tenant SaaS) is structurally complete and runs locally. However, a comprehensive audit identified 92 issues including 29 critical ones across security, data isolation, payment correctness, and performance. The system cannot safely host paying customers in its current state.

Bringing it to production-grade — secure code plus the operational infrastructure a real SaaS needs (CI/CD, monitoring, alerting, backups) — is realistically a 3-4 month effort for one engineer working with AI assistance, or about 2 months with two engineers in parallel.

Most urgent, regardless of any other decision: Live production API keys (Stripe, OpenAI, Anthropic, Postmark, Cloudflare R2, Tiptap) are hardcoded directly into the source code and committed to git history. These need to be rotated this week — independent of any remediation timeline.

Current state of the codebase

What works

What doesn't

Multi-tenancy: real by design, broken in practice

The architecture is row-level multi-tenant. Every tenant-scoped model inherits TenantModelMixin (apps/tenants/models.py:504), which adds a tenant_id foreign key. A custom TenantModelManager (apps/tenants/mixins.py:268) auto-filters queries by the active tenant. Nine apps participate in this scheme; users can belong to multiple workspaces with different roles.

The flaw is in how the active tenant is tracked. apps/tenants/utils.py stores tenant context in Python's threading.local(). Under production-style thread-pooled workers (Gunicorn/uWSGI), threads are reused across requests, and the middleware doesn't reliably clear the context at request boundaries. In single-threaded dev/test environments this bug is invisible; under real concurrency it leaks tenant data silently.

Fix is a ~4-hour refactor: replace threading.local() with contextvars.ContextVar and clear the context in middleware on every request. This is the single most important fix in the entire remediation plan.

Workstream breakdown

Effort estimates assume one engineer working with Claude Code AI assistance. Calendar time includes review, integration testing, and discovery cycles.

Workstream Effort Notes
Sprint 1 — Emergency security
Rotate credentials + purge git history, fix WebSocket auth, IDOR on 10+ endpoints, swap threading.local()ContextVar, wrap payment/tenant/invite flows in @transaction.atomic, webhook dedup/idempotency
3-4 days Patterns repeat across endpoints, AI compresses well. Cred rotation cascades into CI/CD breakage — budget extra. After this sprint: safe for private beta.
Sprint 2 — Frontend & input security
DOMPurify in XSS sinks, open redirect fix, CSP headers, API rate limits, move Sentry DSN server-side
3-4 days Mostly mechanical. After this: defensible for external users.
Sprints 3-4 — Performance & correctness
N+1 queries, pagination, WebSocket memory leak, DB cursor leaks, circuit breakers, email rate limiting
1.5-2 weeks AI finds patterns fast; measurement and verification take real time.
Sprints 5-8 — Test coverage & edge cases
Build coverage from near-zero, edge cases from audit, dependency upgrades, soak testing after tenant refactor
2.5-3 weeks Tests are AI's strength but need real human review. Tenant-isolation soak is calendar-bound — you can't compress real traffic.
CI/CD pipeline
GitHub Actions → Fly: lint, test, build, staging promotion
3-4 days YAML-heavy, AI-friendly. Some .github workflows exist already.
Monitoring & APM
Datadog or Grafana Cloud: metrics, traces, dashboards, SLOs
~1 week Dashboard config compresses well; choosing what to measure is human judgment.
Centralized logging
Structured logs, aggregation, retention policy
2 days Currently stdout to Fly logs only.
Alerting & on-call
PagerDuty or Better Stack, alert routing, runbooks
3-4 days Runbook drafting compresses well with AI.
Backups & DR
Postgres snapshot policy, tested restore drill, RPO/RTO docs
3 days Restore drill is calendar-bound — must be done for real.
Staging environment + deploy gates
Activate fly-staging.toml, promotion flow
2 days Config exists; needs to be wired into the pipeline.
Infrastructure-as-code
Terraform for Fly + DNS + Cloudflare
3-4 days
Secrets management
Proper vault, rotation policy, dev/prod parity
2 days
Support tooling
django-hijack guardrails, audit-log search UI for support
2-3 days Tooling is partially installed; needs proper guardrails before use.

Scope & timing

Scope What's included What you get 1 engineer (w/ AI) 2 engineers parallel
Audit work only The 92 audit issues (Sprints 1-8): security fixes, tenant isolation, payment correctness, XSS, performance, test coverage Correct, secure app code — but no monitoring, logging, CI/CD, on-call, or tested backups 7-9 weeks 4-5 weeks
Audit + base infra All of the above, plus CI/CD pipeline, monitoring & APM, centralized logging, alerting & on-call, backups + restore drill, staging environment, IaC, secrets management, support tooling Production-grade: secure code and operational readiness — deployable, observable, recoverable ~3-4 months ~2 months
Recommendation Budget for "Audit + base infra." The audit-only scope makes the code correct on paper but leaves the team flying blind operationally — not actually safe for paying customers despite being secure.

Monthly cost — external services

Tier Service Purpose Cost (launch)
Required to runFly.ioApp hosting + Postgres$50-200/mo
StripeBilling & subscriptions2.9% + 30¢/txn
OpenAIGPT models for content generation$50-2,000/mo
AnthropicClaude models$50-2,000/mo
Postmark (or Resend/SendGrid)Transactional email$15-50/mo
Cloudflare R2 (or AWS S3)Media & file storage$5-50/mo
Tiptap CloudCollaborative editor backend$50-200/mo
GitHubSource + Actions CI$0-4/user/mo
Domain registrarDNS for redigital.ai$1-5/mo
Required for safe opsSentryError tracking$0-26/mo
Datadog / Grafana Cloud / Better StackMetrics, APM, dashboards$0-100/mo
Better Stack / LogtailLog aggregation (often bundled)
PagerDuty / Better StackOn-call paging$10-25/user/mo
Team operations1PasswordShared secrets across team$8/user/mo
Linear / GitHub IssuesIssue tracking$0-10/user/mo
OtherCloudflareDNS, CDN, WAF (R2 lives here too)$0/mo

Cost bottom line

ScenarioMonthly burn
Pre-launch — running, no users$100-200/mo
Launch / private beta (light usage)$300-600/mo
Growth (1k+ active users)$1,500-5,000/mo, dominated by AI providers
One-time setup (domain, account verifications, etc.)~$100-300
Wildcards to plan for AI provider bills (OpenAI, Anthropic) can spike 10× overnight with a viral feature or a runaway agent loop. Set hard budget caps on both dashboards on day one. Costs at growth scale are dominated by these two services, not infra.

Account ownership — verify week zero

If the codebase has been inherited from a prior team or vendor, ownership of the existing external accounts must be confirmed before engineering work starts. Account-recovery and ownership-transfer flows can take 1-3 weeks (MFA resets, identity verification, dispute processes) and will block deployment if not handled early.

Final recommendation

Plan for 3-4 months of engineering effort

Budget for "Audit + base infra" scope. One experienced engineer with Claude Code assistance, or two in parallel for ~2 months calendar. Milestones:

AI assistance compresses the work meaningfully versus an unassisted senior engineer (estimated 6+ months solo), but it cannot compress: soak time after the tenant-isolation refactor, real-world integration testing against Stripe / Postmark / etc., or operational readiness work that needs human judgment.