# Platform Upgrade Design Spec **Date**: 2026-04-02 **Approach**: Bottom-Up (shared library → infra → services → API security → operations) --- ## Phase 1: Shared Library Hardening ### 1-1. Resilience Module (`shared/src/shared/resilience.py`) Currently empty. Implement: - **`retry_async()`** — tenacity-based exponential backoff + jitter decorator. Configurable max retries (default 3), base delay (1s), max delay (30s). - **`CircuitBreaker`** — Tracks consecutive failures. Opens after N failures (default 5), stays open for configurable cooldown (default 60s), transitions to half-open to test recovery. - **`timeout()`** — asyncio-based timeout wrapper. Raises `TimeoutError` after configurable duration. - All decorators composable: `@retry_async() @circuit_breaker() async def call_api():` ### 1-2. DB Connection Pooling (`shared/src/shared/db.py`) Add to `create_async_engine()`: - `pool_size=20` (configurable via `DB_POOL_SIZE`) - `max_overflow=10` (configurable via `DB_MAX_OVERFLOW`) - `pool_pre_ping=True` (verify connections before use) - `pool_recycle=3600` (recycle stale connections) Add corresponding fields to `Settings`. ### 1-3. Redis Resilience (`shared/src/shared/broker.py`) - Add to `redis.asyncio.from_url()`: `socket_keepalive=True`, `health_check_interval=30`, `retry_on_timeout=True` - Wrap `publish()`, `read_group()`, `ensure_group()` with `@retry_async()` from resilience module - Add `reconnect()` method for connection loss recovery ### 1-4. Config Validation (`shared/src/shared/config.py`) - Add `field_validator` for business logic: `risk_max_position_size > 0`, `health_port` in 1024-65535, `log_level` in valid set - Change secret fields to `SecretStr`: `alpaca_api_key`, `alpaca_api_secret`, `database_url`, `redis_url`, `telegram_bot_token`, `anthropic_api_key`, `finnhub_api_key` - Update all consumers to call `.get_secret_value()` where needed ### 1-5. Dependency Pinning All `pyproject.toml` files: add upper bounds. Examples: - `pydantic>=2.8,<3` - `redis>=5.0,<6` - `sqlalchemy[asyncio]>=2.0,<3` - `numpy>=1.26,<3` - `pandas>=2.1,<3` - `anthropic>=0.40,<1` Run `uv lock` to generate lock file. --- ## Phase 2: Infrastructure Hardening ### 2-1. Docker Secrets & Environment - Remove hardcoded `POSTGRES_USER: trading` / `POSTGRES_PASSWORD: trading` from `docker-compose.yml` - Reference via `${POSTGRES_USER}` / `${POSTGRES_PASSWORD}` from `.env` - Add comments in `.env.example` marking secret vs config variables ### 2-2. Dockerfile Optimization (all 7 services) Pattern for each Dockerfile: ```dockerfile # Stage 1: builder FROM python:3.12-slim AS builder WORKDIR /app COPY shared/pyproject.toml shared/setup.cfg shared/ COPY shared/src/ shared/src/ RUN pip install --no-cache-dir ./shared COPY services//pyproject.toml services// COPY services//src/ services//src/ RUN pip install --no-cache-dir ./services/ # Stage 2: runtime FROM python:3.12-slim RUN useradd -r -s /bin/false appuser WORKDIR /app COPY --from=builder /usr/local/lib/python3.12/site-packages /usr/local/lib/python3.12/site-packages COPY --from=builder /usr/local/bin /usr/local/bin USER appuser CMD ["python", "-m", ".main"] ``` Create root `.dockerignore`: ``` __pycache__ *.pyc .git .venv .env tests/ docs/ *.md .ruff_cache ``` ### 2-3. Database Index Migration (`003_add_missing_indexes.py`) New Alembic migration adding: - `idx_signals_symbol_created` on `signals(symbol, created_at)` - `idx_orders_symbol_status_created` on `orders(symbol, status, created_at)` - `idx_trades_order_id` on `trades(order_id)` - `idx_trades_symbol_traded` on `trades(symbol, traded_at)` - `idx_portfolio_snapshots_at` on `portfolio_snapshots(snapshot_at)` - `idx_symbol_scores_symbol` unique on `symbol_scores(symbol)` ### 2-4. Docker Compose Resource Limits Add to each service: ```yaml deploy: resources: limits: memory: 512M cpus: '1.0' ``` Strategy-engine and backtester get `memory: 1G` (pandas/numpy usage). Add explicit networks: ```yaml networks: internal: driver: bridge monitoring: driver: bridge ``` --- ## Phase 3: Service-Level Improvements ### 3-1. Graceful Shutdown (all services) Add to each service's `main()`: ```python shutdown_event = asyncio.Event() def _signal_handler(): log.info("shutdown_signal_received") shutdown_event.set() loop = asyncio.get_event_loop() loop.add_signal_handler(signal.SIGTERM, _signal_handler) loop.add_signal_handler(signal.SIGINT, _signal_handler) ``` Main loops check `shutdown_event.is_set()` to exit gracefully. API service: add `--timeout-graceful-shutdown 30` to uvicorn CMD. ### 3-2. Exception Specialization (all services) Replace broad `except Exception` with layered handling: - `ConnectionError`, `TimeoutError` → retry via resilience module - `ValueError`, `KeyError` → log warning, skip item, continue - `Exception` → top-level only, `exc_info=True` for full traceback + Telegram alert Target: reduce 63 broad catches to ~10 top-level safety nets. ### 3-3. LLM Parsing Deduplication (`stock_selector.py`) Extract `_extract_json_from_text(text: str) -> list | dict | None`: - Tries ```` ```json ``` ```` code block extraction - Falls back to `re.search(r"\[.*\]", text, re.DOTALL)` - Falls back to raw `json.loads(text.strip())` Replace 3 duplicate parsing blocks with single call. ### 3-4. aiohttp Session Reuse (`stock_selector.py`) - Add `_session: aiohttp.ClientSession | None = None` to `StockSelector` - Lazy-init in `_ensure_session()`, close in `close()` - Replace all `async with aiohttp.ClientSession()` with `self._session` --- ## Phase 4: API Security ### 4-1. Bearer Token Authentication - Add `api_auth_token: SecretStr = ""` to `Settings` - Create `dependencies/auth.py` with `verify_token()` dependency - Apply to all `/api/v1/*` routes via router-level `dependencies=[Depends(verify_token)]` - If token is empty string → skip auth (dev mode), log warning on startup ### 4-2. CORS Configuration ```python app.add_middleware( CORSMiddleware, allow_origins=settings.cors_origins.split(","), # default: "http://localhost:3000" allow_methods=["GET", "POST"], allow_headers=["Authorization", "Content-Type"], ) ``` Add `cors_origins: str = "http://localhost:3000"` to Settings. ### 4-3. Rate Limiting - Add `slowapi` dependency - Global default: 60 req/min per IP - Order-related endpoints: 10 req/min per IP - Return `429 Too Many Requests` with `Retry-After` header ### 4-4. Input Validation - All `limit` params: `Query(default=50, ge=1, le=1000)` - All `days` params: `Query(default=30, ge=1, le=365)` - Add Pydantic `response_model` to all endpoints (enables auto OpenAPI docs) - Add `symbol` param validation: uppercase, 1-5 chars, alphanumeric --- ## Phase 5: Operational Maturity ### 5-1. GitHub Actions CI/CD File: `.github/workflows/ci.yml` **PR trigger** (`pull_request`): 1. Install deps (`uv sync`) 2. Ruff lint + format check 3. pytest with coverage (`--cov --cov-report=xml`) 4. Upload coverage to PR comment **Main push** (`push: branches: [master]`): 1. Same lint + test 2. `docker compose build` 3. (Future: push to registry) ### 5-2. Ruff Rules Enhancement ```toml [tool.ruff.lint] select = ["E", "W", "F", "I", "B", "UP", "ASYNC", "PERF", "C4", "RUF"] ignore = ["E501"] [tool.ruff.lint.per-file-ignores] "tests/*" = ["F841"] [tool.ruff.lint.isort] known-first-party = ["shared"] ``` Run `ruff check --fix .` and `ruff format .` to fix existing violations, commit separately. ### 5-3. Prometheus Alerting File: `monitoring/prometheus/alert_rules.yml` Rules: - `ServiceDown`: `service_up == 0` for 1 min → critical - `HighErrorRate`: `rate(errors_total[5m]) > 10` → warning - `HighLatency`: `histogram_quantile(0.95, processing_seconds) > 5` → warning Add Alertmanager config with Telegram webhook (reuse existing bot token). Reference alert rules in `monitoring/prometheus.yml`. ### 5-4. Code Coverage Add to root `pyproject.toml`: ```toml [tool.pytest.ini_options] addopts = "--cov=shared/src --cov=services --cov-report=term-missing" [tool.coverage.run] branch = true omit = ["tests/*", "*/alembic/*"] [tool.coverage.report] fail_under = 70 ``` Add `pytest-cov` to dev dependencies. --- ## Out of Scope - Kubernetes/Helm charts (premature — Docker Compose sufficient for current scale) - External secrets manager (Vault, AWS SM — overkill for single-machine deployment) - OpenTelemetry distributed tracing (add when debugging cross-service issues) - API versioning beyond `/api/v1/` prefix - Data retention/partitioning (address when data volume becomes an issue)