The Redigital codebase (Django + React multi-tenant SaaS) is structurally complete and runs locally. However, a comprehensive audit identified 92 issues including 29 critical ones across security, data isolation, payment correctness, and performance. The system cannot safely host paying customers in its current state.
Bringing it to production-grade — secure code plus the operational infrastructure a real SaaS needs (CI/CD, monitoring, alerting, backups) — is realistically a 3-4 month effort for one engineer working with AI assistance, or about 2 months with two engineers in parallel.
apps/tenants/utils.py:11 uses threading.local() to track the active tenant; under thread-pooled workers, threads are reused across requests and Tenant A's context can persist into Tenant B's request. Result: User B sees User A's data. Silent, no error.apps/live/consumers.py:37-48. Anonymous users can connect to any channel UUID and intercept real-time collaboration data.apps/ai/services/utils/agent.py, config/settings/base.py, config/settings/production.py, fly.toml, fly-staging.toml, and assets/js/components/editor/tiptap.jsx.The architecture is row-level multi-tenant. Every tenant-scoped model inherits TenantModelMixin (apps/tenants/models.py:504), which adds a tenant_id foreign key. A custom TenantModelManager (apps/tenants/mixins.py:268) auto-filters queries by the active tenant. Nine apps participate in this scheme; users can belong to multiple workspaces with different roles.
The flaw is in how the active tenant is tracked. apps/tenants/utils.py stores tenant context in Python's threading.local(). Under production-style thread-pooled workers (Gunicorn/uWSGI), threads are reused across requests, and the middleware doesn't reliably clear the context at request boundaries. In single-threaded dev/test environments this bug is invisible; under real concurrency it leaks tenant data silently.
Fix is a ~4-hour refactor: replace threading.local() with contextvars.ContextVar and clear the context in middleware on every request. This is the single most important fix in the entire remediation plan.
Effort estimates assume one engineer working with Claude Code AI assistance. Calendar time includes review, integration testing, and discovery cycles.
| Workstream | Effort | Notes |
|---|---|---|
| Sprint 1 — Emergency security Rotate credentials + purge git history, fix WebSocket auth, IDOR on 10+ endpoints, swap threading.local() → ContextVar, wrap payment/tenant/invite flows in @transaction.atomic, webhook dedup/idempotency |
3-4 days | Patterns repeat across endpoints, AI compresses well. Cred rotation cascades into CI/CD breakage — budget extra. After this sprint: safe for private beta. |
| Sprint 2 — Frontend & input security DOMPurify in XSS sinks, open redirect fix, CSP headers, API rate limits, move Sentry DSN server-side |
3-4 days | Mostly mechanical. After this: defensible for external users. |
| Sprints 3-4 — Performance & correctness N+1 queries, pagination, WebSocket memory leak, DB cursor leaks, circuit breakers, email rate limiting |
1.5-2 weeks | AI finds patterns fast; measurement and verification take real time. |
| Sprints 5-8 — Test coverage & edge cases Build coverage from near-zero, edge cases from audit, dependency upgrades, soak testing after tenant refactor |
2.5-3 weeks | Tests are AI's strength but need real human review. Tenant-isolation soak is calendar-bound — you can't compress real traffic. |
| CI/CD pipeline GitHub Actions → Fly: lint, test, build, staging promotion |
3-4 days | YAML-heavy, AI-friendly. Some .github workflows exist already. |
| Monitoring & APM Datadog or Grafana Cloud: metrics, traces, dashboards, SLOs |
~1 week | Dashboard config compresses well; choosing what to measure is human judgment. |
| Centralized logging Structured logs, aggregation, retention policy |
2 days | Currently stdout to Fly logs only. |
| Alerting & on-call PagerDuty or Better Stack, alert routing, runbooks |
3-4 days | Runbook drafting compresses well with AI. |
| Backups & DR Postgres snapshot policy, tested restore drill, RPO/RTO docs |
3 days | Restore drill is calendar-bound — must be done for real. |
| Staging environment + deploy gates Activate fly-staging.toml, promotion flow |
2 days | Config exists; needs to be wired into the pipeline. |
| Infrastructure-as-code Terraform for Fly + DNS + Cloudflare |
3-4 days | |
| Secrets management Proper vault, rotation policy, dev/prod parity |
2 days | |
| Support tooling django-hijack guardrails, audit-log search UI for support |
2-3 days | Tooling is partially installed; needs proper guardrails before use. |
| Scope | What's included | What you get | 1 engineer (w/ AI) | 2 engineers parallel |
|---|---|---|---|---|
| Audit work only | The 92 audit issues (Sprints 1-8): security fixes, tenant isolation, payment correctness, XSS, performance, test coverage | Correct, secure app code — but no monitoring, logging, CI/CD, on-call, or tested backups | 7-9 weeks | 4-5 weeks |
| Audit + base infra | All of the above, plus CI/CD pipeline, monitoring & APM, centralized logging, alerting & on-call, backups + restore drill, staging environment, IaC, secrets management, support tooling | Production-grade: secure code and operational readiness — deployable, observable, recoverable | ~3-4 months | ~2 months |
| Tier | Service | Purpose | Cost (launch) |
|---|---|---|---|
| Required to run | Fly.io | App hosting + Postgres | $50-200/mo |
| Stripe | Billing & subscriptions | 2.9% + 30¢/txn | |
| OpenAI | GPT models for content generation | $50-2,000/mo | |
| Anthropic | Claude models | $50-2,000/mo | |
| Postmark (or Resend/SendGrid) | Transactional email | $15-50/mo | |
| Cloudflare R2 (or AWS S3) | Media & file storage | $5-50/mo | |
| Tiptap Cloud | Collaborative editor backend | $50-200/mo | |
| GitHub | Source + Actions CI | $0-4/user/mo | |
| Domain registrar | DNS for redigital.ai | $1-5/mo | |
| Required for safe ops | Sentry | Error tracking | $0-26/mo |
| Datadog / Grafana Cloud / Better Stack | Metrics, APM, dashboards | $0-100/mo | |
| Better Stack / Logtail | Log aggregation (often bundled) | — | |
| PagerDuty / Better Stack | On-call paging | $10-25/user/mo | |
| Team operations | 1Password | Shared secrets across team | $8/user/mo |
| Linear / GitHub Issues | Issue tracking | $0-10/user/mo | |
| Other | Cloudflare | DNS, CDN, WAF (R2 lives here too) | $0/mo |
| Scenario | Monthly burn |
|---|---|
| Pre-launch — running, no users | $100-200/mo |
| Launch / private beta (light usage) | $300-600/mo |
| Growth (1k+ active users) | $1,500-5,000/mo, dominated by AI providers |
| One-time setup (domain, account verifications, etc.) | ~$100-300 |
If the codebase has been inherited from a prior team or vendor, ownership of the existing external accounts must be confirmed before engineering work starts. Account-recovery and ownership-transfer flows can take 1-3 weeks (MFA resets, identity verification, dispute processes) and will block deployment if not handled early.
redigital.aiBudget for "Audit + base infra" scope. One experienced engineer with Claude Code assistance, or two in parallel for ~2 months calendar. Milestones:
AI assistance compresses the work meaningfully versus an unassisted senior engineer (estimated 6+ months solo), but it cannot compress: soak time after the tenant-isolation refactor, real-world integration testing against Stripe / Postmark / etc., or operational readiness work that needs human judgment.