diff options
| author | TheSiahxyz <164138827+TheSiahxyz@users.noreply.github.com> | 2026-04-02 14:52:37 +0900 |
|---|---|---|
| committer | TheSiahxyz <164138827+TheSiahxyz@users.noreply.github.com> | 2026-04-02 14:52:37 +0900 |
| commit | 206e4269d422e0f6e3e8f9827c1e5eacbbf2137b (patch) | |
| tree | d8b851aaabb0d04aa3bfdc41c1f91b6578b7a1e3 /docs/superpowers/specs | |
| parent | 34340d5c1c3f9406c26c52b5e0bd2170e1242f49 (diff) | |
docs: add platform upgrade design spec
Bottom-up upgrade plan covering 5 phases: shared library hardening,
infrastructure improvements, service-level fixes, API security, and
operational maturity.
Diffstat (limited to 'docs/superpowers/specs')
| -rw-r--r-- | docs/superpowers/specs/2026-04-02-platform-upgrade-design.md | 257 |
1 files changed, 257 insertions, 0 deletions
diff --git a/docs/superpowers/specs/2026-04-02-platform-upgrade-design.md b/docs/superpowers/specs/2026-04-02-platform-upgrade-design.md new file mode 100644 index 0000000..9c84e10 --- /dev/null +++ b/docs/superpowers/specs/2026-04-02-platform-upgrade-design.md @@ -0,0 +1,257 @@ +# Platform Upgrade Design Spec + +**Date**: 2026-04-02 +**Approach**: Bottom-Up (shared library → infra → services → API security → operations) + +--- + +## Phase 1: Shared Library Hardening + +### 1-1. Resilience Module (`shared/src/shared/resilience.py`) +Currently empty. Implement: +- **`retry_async()`** — tenacity-based exponential backoff + jitter decorator. Configurable max retries (default 3), base delay (1s), max delay (30s). +- **`CircuitBreaker`** — Tracks consecutive failures. Opens after N failures (default 5), stays open for configurable cooldown (default 60s), transitions to half-open to test recovery. +- **`timeout()`** — asyncio-based timeout wrapper. Raises `TimeoutError` after configurable duration. +- All decorators composable: `@retry_async() @circuit_breaker() async def call_api():` + +### 1-2. DB Connection Pooling (`shared/src/shared/db.py`) +Add to `create_async_engine()`: +- `pool_size=20` (configurable via `DB_POOL_SIZE`) +- `max_overflow=10` (configurable via `DB_MAX_OVERFLOW`) +- `pool_pre_ping=True` (verify connections before use) +- `pool_recycle=3600` (recycle stale connections) +Add corresponding fields to `Settings`. + +### 1-3. Redis Resilience (`shared/src/shared/broker.py`) +- Add to `redis.asyncio.from_url()`: `socket_keepalive=True`, `health_check_interval=30`, `retry_on_timeout=True` +- Wrap `publish()`, `read_group()`, `ensure_group()` with `@retry_async()` from resilience module +- Add `reconnect()` method for connection loss recovery + +### 1-4. Config Validation (`shared/src/shared/config.py`) +- Add `field_validator` for business logic: `risk_max_position_size > 0`, `health_port` in 1024-65535, `log_level` in valid set +- Change secret fields to `SecretStr`: `alpaca_api_key`, `alpaca_api_secret`, `database_url`, `redis_url`, `telegram_bot_token`, `anthropic_api_key`, `finnhub_api_key` +- Update all consumers to call `.get_secret_value()` where needed + +### 1-5. Dependency Pinning +All `pyproject.toml` files: add upper bounds. +Examples: +- `pydantic>=2.8,<3` +- `redis>=5.0,<6` +- `sqlalchemy[asyncio]>=2.0,<3` +- `numpy>=1.26,<3` +- `pandas>=2.1,<3` +- `anthropic>=0.40,<1` +Run `uv lock` to generate lock file. + +--- + +## Phase 2: Infrastructure Hardening + +### 2-1. Docker Secrets & Environment +- Remove hardcoded `POSTGRES_USER: trading` / `POSTGRES_PASSWORD: trading` from `docker-compose.yml` +- Reference via `${POSTGRES_USER}` / `${POSTGRES_PASSWORD}` from `.env` +- Add comments in `.env.example` marking secret vs config variables + +### 2-2. Dockerfile Optimization (all 7 services) +Pattern for each Dockerfile: +```dockerfile +# Stage 1: builder +FROM python:3.12-slim AS builder +WORKDIR /app +COPY shared/pyproject.toml shared/setup.cfg shared/ +COPY shared/src/ shared/src/ +RUN pip install --no-cache-dir ./shared +COPY services/<name>/pyproject.toml services/<name>/ +COPY services/<name>/src/ services/<name>/src/ +RUN pip install --no-cache-dir ./services/<name> + +# Stage 2: runtime +FROM python:3.12-slim +RUN useradd -r -s /bin/false appuser +WORKDIR /app +COPY --from=builder /usr/local/lib/python3.12/site-packages /usr/local/lib/python3.12/site-packages +COPY --from=builder /usr/local/bin /usr/local/bin +USER appuser +CMD ["python", "-m", "<module>.main"] +``` + +Create root `.dockerignore`: +``` +__pycache__ +*.pyc +.git +.venv +.env +tests/ +docs/ +*.md +.ruff_cache +``` + +### 2-3. Database Index Migration (`003_add_missing_indexes.py`) +New Alembic migration adding: +- `idx_signals_symbol_created` on `signals(symbol, created_at)` +- `idx_orders_symbol_status_created` on `orders(symbol, status, created_at)` +- `idx_trades_order_id` on `trades(order_id)` +- `idx_trades_symbol_traded` on `trades(symbol, traded_at)` +- `idx_portfolio_snapshots_at` on `portfolio_snapshots(snapshot_at)` +- `idx_symbol_scores_symbol` unique on `symbol_scores(symbol)` + +### 2-4. Docker Compose Resource Limits +Add to each service: +```yaml +deploy: + resources: + limits: + memory: 512M + cpus: '1.0' +``` +Strategy-engine and backtester get `memory: 1G` (pandas/numpy usage). + +Add explicit networks: +```yaml +networks: + internal: + driver: bridge + monitoring: + driver: bridge +``` + +--- + +## Phase 3: Service-Level Improvements + +### 3-1. Graceful Shutdown (all services) +Add to each service's `main()`: +```python +shutdown_event = asyncio.Event() + +def _signal_handler(): + log.info("shutdown_signal_received") + shutdown_event.set() + +loop = asyncio.get_event_loop() +loop.add_signal_handler(signal.SIGTERM, _signal_handler) +loop.add_signal_handler(signal.SIGINT, _signal_handler) +``` +Main loops check `shutdown_event.is_set()` to exit gracefully. +API service: add `--timeout-graceful-shutdown 30` to uvicorn CMD. + +### 3-2. Exception Specialization (all services) +Replace broad `except Exception` with layered handling: +- `ConnectionError`, `TimeoutError` → retry via resilience module +- `ValueError`, `KeyError` → log warning, skip item, continue +- `Exception` → top-level only, `exc_info=True` for full traceback + Telegram alert + +Target: reduce 63 broad catches to ~10 top-level safety nets. + +### 3-3. LLM Parsing Deduplication (`stock_selector.py`) +Extract `_extract_json_from_text(text: str) -> list | dict | None`: +- Tries ```` ```json ``` ```` code block extraction +- Falls back to `re.search(r"\[.*\]", text, re.DOTALL)` +- Falls back to raw `json.loads(text.strip())` +Replace 3 duplicate parsing blocks with single call. + +### 3-4. aiohttp Session Reuse (`stock_selector.py`) +- Add `_session: aiohttp.ClientSession | None = None` to `StockSelector` +- Lazy-init in `_ensure_session()`, close in `close()` +- Replace all `async with aiohttp.ClientSession()` with `self._session` + +--- + +## Phase 4: API Security + +### 4-1. Bearer Token Authentication +- Add `api_auth_token: SecretStr = ""` to `Settings` +- Create `dependencies/auth.py` with `verify_token()` dependency +- Apply to all `/api/v1/*` routes via router-level `dependencies=[Depends(verify_token)]` +- If token is empty string → skip auth (dev mode), log warning on startup + +### 4-2. CORS Configuration +```python +app.add_middleware( + CORSMiddleware, + allow_origins=settings.cors_origins.split(","), # default: "http://localhost:3000" + allow_methods=["GET", "POST"], + allow_headers=["Authorization", "Content-Type"], +) +``` +Add `cors_origins: str = "http://localhost:3000"` to Settings. + +### 4-3. Rate Limiting +- Add `slowapi` dependency +- Global default: 60 req/min per IP +- Order-related endpoints: 10 req/min per IP +- Return `429 Too Many Requests` with `Retry-After` header + +### 4-4. Input Validation +- All `limit` params: `Query(default=50, ge=1, le=1000)` +- All `days` params: `Query(default=30, ge=1, le=365)` +- Add Pydantic `response_model` to all endpoints (enables auto OpenAPI docs) +- Add `symbol` param validation: uppercase, 1-5 chars, alphanumeric + +--- + +## Phase 5: Operational Maturity + +### 5-1. GitHub Actions CI/CD +File: `.github/workflows/ci.yml` + +**PR trigger** (`pull_request`): +1. Install deps (`uv sync`) +2. Ruff lint + format check +3. pytest with coverage (`--cov --cov-report=xml`) +4. Upload coverage to PR comment + +**Main push** (`push: branches: [master]`): +1. Same lint + test +2. `docker compose build` +3. (Future: push to registry) + +### 5-2. Ruff Rules Enhancement +```toml +[tool.ruff.lint] +select = ["E", "W", "F", "I", "B", "UP", "ASYNC", "PERF", "C4", "RUF"] +ignore = ["E501"] + +[tool.ruff.lint.per-file-ignores] +"tests/*" = ["F841"] + +[tool.ruff.lint.isort] +known-first-party = ["shared"] +``` +Run `ruff check --fix .` and `ruff format .` to fix existing violations, commit separately. + +### 5-3. Prometheus Alerting +File: `monitoring/prometheus/alert_rules.yml` +Rules: +- `ServiceDown`: `service_up == 0` for 1 min → critical +- `HighErrorRate`: `rate(errors_total[5m]) > 10` → warning +- `HighLatency`: `histogram_quantile(0.95, processing_seconds) > 5` → warning + +Add Alertmanager config with Telegram webhook (reuse existing bot token). +Reference alert rules in `monitoring/prometheus.yml`. + +### 5-4. Code Coverage +Add to root `pyproject.toml`: +```toml +[tool.pytest.ini_options] +addopts = "--cov=shared/src --cov=services --cov-report=term-missing" + +[tool.coverage.run] +branch = true +omit = ["tests/*", "*/alembic/*"] + +[tool.coverage.report] +fail_under = 70 +``` +Add `pytest-cov` to dev dependencies. + +--- + +## Out of Scope +- Kubernetes/Helm charts (premature — Docker Compose sufficient for current scale) +- External secrets manager (Vault, AWS SM — overkill for single-machine deployment) +- OpenTelemetry distributed tracing (add when debugging cross-service issues) +- API versioning beyond `/api/v1/` prefix +- Data retention/partitioning (address when data volume becomes an issue) |
