summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorTheSiahxyz <164138827+TheSiahxyz@users.noreply.github.com>2026-04-02 14:52:37 +0900
committerTheSiahxyz <164138827+TheSiahxyz@users.noreply.github.com>2026-04-02 14:52:37 +0900
commit206e4269d422e0f6e3e8f9827c1e5eacbbf2137b (patch)
treed8b851aaabb0d04aa3bfdc41c1f91b6578b7a1e3
parent34340d5c1c3f9406c26c52b5e0bd2170e1242f49 (diff)
docs: add platform upgrade design spec
Bottom-up upgrade plan covering 5 phases: shared library hardening, infrastructure improvements, service-level fixes, API security, and operational maturity.
-rw-r--r--docs/superpowers/specs/2026-04-02-platform-upgrade-design.md257
1 files changed, 257 insertions, 0 deletions
diff --git a/docs/superpowers/specs/2026-04-02-platform-upgrade-design.md b/docs/superpowers/specs/2026-04-02-platform-upgrade-design.md
new file mode 100644
index 0000000..9c84e10
--- /dev/null
+++ b/docs/superpowers/specs/2026-04-02-platform-upgrade-design.md
@@ -0,0 +1,257 @@
+# Platform Upgrade Design Spec
+
+**Date**: 2026-04-02
+**Approach**: Bottom-Up (shared library → infra → services → API security → operations)
+
+---
+
+## Phase 1: Shared Library Hardening
+
+### 1-1. Resilience Module (`shared/src/shared/resilience.py`)
+Currently empty. Implement:
+- **`retry_async()`** — tenacity-based exponential backoff + jitter decorator. Configurable max retries (default 3), base delay (1s), max delay (30s).
+- **`CircuitBreaker`** — Tracks consecutive failures. Opens after N failures (default 5), stays open for configurable cooldown (default 60s), transitions to half-open to test recovery.
+- **`timeout()`** — asyncio-based timeout wrapper. Raises `TimeoutError` after configurable duration.
+- All decorators composable: `@retry_async() @circuit_breaker() async def call_api():`
+
+### 1-2. DB Connection Pooling (`shared/src/shared/db.py`)
+Add to `create_async_engine()`:
+- `pool_size=20` (configurable via `DB_POOL_SIZE`)
+- `max_overflow=10` (configurable via `DB_MAX_OVERFLOW`)
+- `pool_pre_ping=True` (verify connections before use)
+- `pool_recycle=3600` (recycle stale connections)
+Add corresponding fields to `Settings`.
+
+### 1-3. Redis Resilience (`shared/src/shared/broker.py`)
+- Add to `redis.asyncio.from_url()`: `socket_keepalive=True`, `health_check_interval=30`, `retry_on_timeout=True`
+- Wrap `publish()`, `read_group()`, `ensure_group()` with `@retry_async()` from resilience module
+- Add `reconnect()` method for connection loss recovery
+
+### 1-4. Config Validation (`shared/src/shared/config.py`)
+- Add `field_validator` for business logic: `risk_max_position_size > 0`, `health_port` in 1024-65535, `log_level` in valid set
+- Change secret fields to `SecretStr`: `alpaca_api_key`, `alpaca_api_secret`, `database_url`, `redis_url`, `telegram_bot_token`, `anthropic_api_key`, `finnhub_api_key`
+- Update all consumers to call `.get_secret_value()` where needed
+
+### 1-5. Dependency Pinning
+All `pyproject.toml` files: add upper bounds.
+Examples:
+- `pydantic>=2.8,<3`
+- `redis>=5.0,<6`
+- `sqlalchemy[asyncio]>=2.0,<3`
+- `numpy>=1.26,<3`
+- `pandas>=2.1,<3`
+- `anthropic>=0.40,<1`
+Run `uv lock` to generate lock file.
+
+---
+
+## Phase 2: Infrastructure Hardening
+
+### 2-1. Docker Secrets & Environment
+- Remove hardcoded `POSTGRES_USER: trading` / `POSTGRES_PASSWORD: trading` from `docker-compose.yml`
+- Reference via `${POSTGRES_USER}` / `${POSTGRES_PASSWORD}` from `.env`
+- Add comments in `.env.example` marking secret vs config variables
+
+### 2-2. Dockerfile Optimization (all 7 services)
+Pattern for each Dockerfile:
+```dockerfile
+# Stage 1: builder
+FROM python:3.12-slim AS builder
+WORKDIR /app
+COPY shared/pyproject.toml shared/setup.cfg shared/
+COPY shared/src/ shared/src/
+RUN pip install --no-cache-dir ./shared
+COPY services/<name>/pyproject.toml services/<name>/
+COPY services/<name>/src/ services/<name>/src/
+RUN pip install --no-cache-dir ./services/<name>
+
+# Stage 2: runtime
+FROM python:3.12-slim
+RUN useradd -r -s /bin/false appuser
+WORKDIR /app
+COPY --from=builder /usr/local/lib/python3.12/site-packages /usr/local/lib/python3.12/site-packages
+COPY --from=builder /usr/local/bin /usr/local/bin
+USER appuser
+CMD ["python", "-m", "<module>.main"]
+```
+
+Create root `.dockerignore`:
+```
+__pycache__
+*.pyc
+.git
+.venv
+.env
+tests/
+docs/
+*.md
+.ruff_cache
+```
+
+### 2-3. Database Index Migration (`003_add_missing_indexes.py`)
+New Alembic migration adding:
+- `idx_signals_symbol_created` on `signals(symbol, created_at)`
+- `idx_orders_symbol_status_created` on `orders(symbol, status, created_at)`
+- `idx_trades_order_id` on `trades(order_id)`
+- `idx_trades_symbol_traded` on `trades(symbol, traded_at)`
+- `idx_portfolio_snapshots_at` on `portfolio_snapshots(snapshot_at)`
+- `idx_symbol_scores_symbol` unique on `symbol_scores(symbol)`
+
+### 2-4. Docker Compose Resource Limits
+Add to each service:
+```yaml
+deploy:
+ resources:
+ limits:
+ memory: 512M
+ cpus: '1.0'
+```
+Strategy-engine and backtester get `memory: 1G` (pandas/numpy usage).
+
+Add explicit networks:
+```yaml
+networks:
+ internal:
+ driver: bridge
+ monitoring:
+ driver: bridge
+```
+
+---
+
+## Phase 3: Service-Level Improvements
+
+### 3-1. Graceful Shutdown (all services)
+Add to each service's `main()`:
+```python
+shutdown_event = asyncio.Event()
+
+def _signal_handler():
+ log.info("shutdown_signal_received")
+ shutdown_event.set()
+
+loop = asyncio.get_event_loop()
+loop.add_signal_handler(signal.SIGTERM, _signal_handler)
+loop.add_signal_handler(signal.SIGINT, _signal_handler)
+```
+Main loops check `shutdown_event.is_set()` to exit gracefully.
+API service: add `--timeout-graceful-shutdown 30` to uvicorn CMD.
+
+### 3-2. Exception Specialization (all services)
+Replace broad `except Exception` with layered handling:
+- `ConnectionError`, `TimeoutError` → retry via resilience module
+- `ValueError`, `KeyError` → log warning, skip item, continue
+- `Exception` → top-level only, `exc_info=True` for full traceback + Telegram alert
+
+Target: reduce 63 broad catches to ~10 top-level safety nets.
+
+### 3-3. LLM Parsing Deduplication (`stock_selector.py`)
+Extract `_extract_json_from_text(text: str) -> list | dict | None`:
+- Tries ```` ```json ``` ```` code block extraction
+- Falls back to `re.search(r"\[.*\]", text, re.DOTALL)`
+- Falls back to raw `json.loads(text.strip())`
+Replace 3 duplicate parsing blocks with single call.
+
+### 3-4. aiohttp Session Reuse (`stock_selector.py`)
+- Add `_session: aiohttp.ClientSession | None = None` to `StockSelector`
+- Lazy-init in `_ensure_session()`, close in `close()`
+- Replace all `async with aiohttp.ClientSession()` with `self._session`
+
+---
+
+## Phase 4: API Security
+
+### 4-1. Bearer Token Authentication
+- Add `api_auth_token: SecretStr = ""` to `Settings`
+- Create `dependencies/auth.py` with `verify_token()` dependency
+- Apply to all `/api/v1/*` routes via router-level `dependencies=[Depends(verify_token)]`
+- If token is empty string → skip auth (dev mode), log warning on startup
+
+### 4-2. CORS Configuration
+```python
+app.add_middleware(
+ CORSMiddleware,
+ allow_origins=settings.cors_origins.split(","), # default: "http://localhost:3000"
+ allow_methods=["GET", "POST"],
+ allow_headers=["Authorization", "Content-Type"],
+)
+```
+Add `cors_origins: str = "http://localhost:3000"` to Settings.
+
+### 4-3. Rate Limiting
+- Add `slowapi` dependency
+- Global default: 60 req/min per IP
+- Order-related endpoints: 10 req/min per IP
+- Return `429 Too Many Requests` with `Retry-After` header
+
+### 4-4. Input Validation
+- All `limit` params: `Query(default=50, ge=1, le=1000)`
+- All `days` params: `Query(default=30, ge=1, le=365)`
+- Add Pydantic `response_model` to all endpoints (enables auto OpenAPI docs)
+- Add `symbol` param validation: uppercase, 1-5 chars, alphanumeric
+
+---
+
+## Phase 5: Operational Maturity
+
+### 5-1. GitHub Actions CI/CD
+File: `.github/workflows/ci.yml`
+
+**PR trigger** (`pull_request`):
+1. Install deps (`uv sync`)
+2. Ruff lint + format check
+3. pytest with coverage (`--cov --cov-report=xml`)
+4. Upload coverage to PR comment
+
+**Main push** (`push: branches: [master]`):
+1. Same lint + test
+2. `docker compose build`
+3. (Future: push to registry)
+
+### 5-2. Ruff Rules Enhancement
+```toml
+[tool.ruff.lint]
+select = ["E", "W", "F", "I", "B", "UP", "ASYNC", "PERF", "C4", "RUF"]
+ignore = ["E501"]
+
+[tool.ruff.lint.per-file-ignores]
+"tests/*" = ["F841"]
+
+[tool.ruff.lint.isort]
+known-first-party = ["shared"]
+```
+Run `ruff check --fix .` and `ruff format .` to fix existing violations, commit separately.
+
+### 5-3. Prometheus Alerting
+File: `monitoring/prometheus/alert_rules.yml`
+Rules:
+- `ServiceDown`: `service_up == 0` for 1 min → critical
+- `HighErrorRate`: `rate(errors_total[5m]) > 10` → warning
+- `HighLatency`: `histogram_quantile(0.95, processing_seconds) > 5` → warning
+
+Add Alertmanager config with Telegram webhook (reuse existing bot token).
+Reference alert rules in `monitoring/prometheus.yml`.
+
+### 5-4. Code Coverage
+Add to root `pyproject.toml`:
+```toml
+[tool.pytest.ini_options]
+addopts = "--cov=shared/src --cov=services --cov-report=term-missing"
+
+[tool.coverage.run]
+branch = true
+omit = ["tests/*", "*/alembic/*"]
+
+[tool.coverage.report]
+fail_under = 70
+```
+Add `pytest-cov` to dev dependencies.
+
+---
+
+## Out of Scope
+- Kubernetes/Helm charts (premature — Docker Compose sufficient for current scale)
+- External secrets manager (Vault, AWS SM — overkill for single-machine deployment)
+- OpenTelemetry distributed tracing (add when debugging cross-service issues)
+- API versioning beyond `/api/v1/` prefix
+- Data retention/partitioning (address when data volume becomes an issue)