1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
|
# Platform Upgrade Design Spec
**Date**: 2026-04-02
**Approach**: Bottom-Up (shared library → infra → services → API security → operations)
---
## Phase 1: Shared Library Hardening
### 1-1. Resilience Module (`shared/src/shared/resilience.py`)
Currently empty. Implement:
- **`retry_async()`** — tenacity-based exponential backoff + jitter decorator. Configurable max retries (default 3), base delay (1s), max delay (30s).
- **`CircuitBreaker`** — Tracks consecutive failures. Opens after N failures (default 5), stays open for configurable cooldown (default 60s), transitions to half-open to test recovery.
- **`timeout()`** — asyncio-based timeout wrapper. Raises `TimeoutError` after configurable duration.
- All decorators composable: `@retry_async() @circuit_breaker() async def call_api():`
### 1-2. DB Connection Pooling (`shared/src/shared/db.py`)
Add to `create_async_engine()`:
- `pool_size=20` (configurable via `DB_POOL_SIZE`)
- `max_overflow=10` (configurable via `DB_MAX_OVERFLOW`)
- `pool_pre_ping=True` (verify connections before use)
- `pool_recycle=3600` (recycle stale connections)
Add corresponding fields to `Settings`.
### 1-3. Redis Resilience (`shared/src/shared/broker.py`)
- Add to `redis.asyncio.from_url()`: `socket_keepalive=True`, `health_check_interval=30`, `retry_on_timeout=True`
- Wrap `publish()`, `read_group()`, `ensure_group()` with `@retry_async()` from resilience module
- Add `reconnect()` method for connection loss recovery
### 1-4. Config Validation (`shared/src/shared/config.py`)
- Add `field_validator` for business logic: `risk_max_position_size > 0`, `health_port` in 1024-65535, `log_level` in valid set
- Change secret fields to `SecretStr`: `alpaca_api_key`, `alpaca_api_secret`, `database_url`, `redis_url`, `telegram_bot_token`, `anthropic_api_key`, `finnhub_api_key`
- Update all consumers to call `.get_secret_value()` where needed
### 1-5. Dependency Pinning
All `pyproject.toml` files: add upper bounds.
Examples:
- `pydantic>=2.8,<3`
- `redis>=5.0,<6`
- `sqlalchemy[asyncio]>=2.0,<3`
- `numpy>=1.26,<3`
- `pandas>=2.1,<3`
- `anthropic>=0.40,<1`
Run `uv lock` to generate lock file.
---
## Phase 2: Infrastructure Hardening
### 2-1. Docker Secrets & Environment
- Remove hardcoded `POSTGRES_USER: trading` / `POSTGRES_PASSWORD: trading` from `docker-compose.yml`
- Reference via `${POSTGRES_USER}` / `${POSTGRES_PASSWORD}` from `.env`
- Add comments in `.env.example` marking secret vs config variables
### 2-2. Dockerfile Optimization (all 7 services)
Pattern for each Dockerfile:
```dockerfile
# Stage 1: builder
FROM python:3.12-slim AS builder
WORKDIR /app
COPY shared/pyproject.toml shared/setup.cfg shared/
COPY shared/src/ shared/src/
RUN pip install --no-cache-dir ./shared
COPY services/<name>/pyproject.toml services/<name>/
COPY services/<name>/src/ services/<name>/src/
RUN pip install --no-cache-dir ./services/<name>
# Stage 2: runtime
FROM python:3.12-slim
RUN useradd -r -s /bin/false appuser
WORKDIR /app
COPY --from=builder /usr/local/lib/python3.12/site-packages /usr/local/lib/python3.12/site-packages
COPY --from=builder /usr/local/bin /usr/local/bin
USER appuser
CMD ["python", "-m", "<module>.main"]
```
Create root `.dockerignore`:
```
__pycache__
*.pyc
.git
.venv
.env
tests/
docs/
*.md
.ruff_cache
```
### 2-3. Database Index Migration (`003_add_missing_indexes.py`)
New Alembic migration adding:
- `idx_signals_symbol_created` on `signals(symbol, created_at)`
- `idx_orders_symbol_status_created` on `orders(symbol, status, created_at)`
- `idx_trades_order_id` on `trades(order_id)`
- `idx_trades_symbol_traded` on `trades(symbol, traded_at)`
- `idx_portfolio_snapshots_at` on `portfolio_snapshots(snapshot_at)`
- `idx_symbol_scores_symbol` unique on `symbol_scores(symbol)`
### 2-4. Docker Compose Resource Limits
Add to each service:
```yaml
deploy:
resources:
limits:
memory: 512M
cpus: '1.0'
```
Strategy-engine and backtester get `memory: 1G` (pandas/numpy usage).
Add explicit networks:
```yaml
networks:
internal:
driver: bridge
monitoring:
driver: bridge
```
---
## Phase 3: Service-Level Improvements
### 3-1. Graceful Shutdown (all services)
Add to each service's `main()`:
```python
shutdown_event = asyncio.Event()
def _signal_handler():
log.info("shutdown_signal_received")
shutdown_event.set()
loop = asyncio.get_event_loop()
loop.add_signal_handler(signal.SIGTERM, _signal_handler)
loop.add_signal_handler(signal.SIGINT, _signal_handler)
```
Main loops check `shutdown_event.is_set()` to exit gracefully.
API service: add `--timeout-graceful-shutdown 30` to uvicorn CMD.
### 3-2. Exception Specialization (all services)
Replace broad `except Exception` with layered handling:
- `ConnectionError`, `TimeoutError` → retry via resilience module
- `ValueError`, `KeyError` → log warning, skip item, continue
- `Exception` → top-level only, `exc_info=True` for full traceback + Telegram alert
Target: reduce 63 broad catches to ~10 top-level safety nets.
### 3-3. LLM Parsing Deduplication (`stock_selector.py`)
Extract `_extract_json_from_text(text: str) -> list | dict | None`:
- Tries ```` ```json ``` ```` code block extraction
- Falls back to `re.search(r"\[.*\]", text, re.DOTALL)`
- Falls back to raw `json.loads(text.strip())`
Replace 3 duplicate parsing blocks with single call.
### 3-4. aiohttp Session Reuse (`stock_selector.py`)
- Add `_session: aiohttp.ClientSession | None = None` to `StockSelector`
- Lazy-init in `_ensure_session()`, close in `close()`
- Replace all `async with aiohttp.ClientSession()` with `self._session`
---
## Phase 4: API Security
### 4-1. Bearer Token Authentication
- Add `api_auth_token: SecretStr = ""` to `Settings`
- Create `dependencies/auth.py` with `verify_token()` dependency
- Apply to all `/api/v1/*` routes via router-level `dependencies=[Depends(verify_token)]`
- If token is empty string → skip auth (dev mode), log warning on startup
### 4-2. CORS Configuration
```python
app.add_middleware(
CORSMiddleware,
allow_origins=settings.cors_origins.split(","), # default: "http://localhost:3000"
allow_methods=["GET", "POST"],
allow_headers=["Authorization", "Content-Type"],
)
```
Add `cors_origins: str = "http://localhost:3000"` to Settings.
### 4-3. Rate Limiting
- Add `slowapi` dependency
- Global default: 60 req/min per IP
- Order-related endpoints: 10 req/min per IP
- Return `429 Too Many Requests` with `Retry-After` header
### 4-4. Input Validation
- All `limit` params: `Query(default=50, ge=1, le=1000)`
- All `days` params: `Query(default=30, ge=1, le=365)`
- Add Pydantic `response_model` to all endpoints (enables auto OpenAPI docs)
- Add `symbol` param validation: uppercase, 1-5 chars, alphanumeric
---
## Phase 5: Operational Maturity
### 5-1. GitHub Actions CI/CD
File: `.github/workflows/ci.yml`
**PR trigger** (`pull_request`):
1. Install deps (`uv sync`)
2. Ruff lint + format check
3. pytest with coverage (`--cov --cov-report=xml`)
4. Upload coverage to PR comment
**Main push** (`push: branches: [master]`):
1. Same lint + test
2. `docker compose build`
3. (Future: push to registry)
### 5-2. Ruff Rules Enhancement
```toml
[tool.ruff.lint]
select = ["E", "W", "F", "I", "B", "UP", "ASYNC", "PERF", "C4", "RUF"]
ignore = ["E501"]
[tool.ruff.lint.per-file-ignores]
"tests/*" = ["F841"]
[tool.ruff.lint.isort]
known-first-party = ["shared"]
```
Run `ruff check --fix .` and `ruff format .` to fix existing violations, commit separately.
### 5-3. Prometheus Alerting
File: `monitoring/prometheus/alert_rules.yml`
Rules:
- `ServiceDown`: `service_up == 0` for 1 min → critical
- `HighErrorRate`: `rate(errors_total[5m]) > 10` → warning
- `HighLatency`: `histogram_quantile(0.95, processing_seconds) > 5` → warning
Add Alertmanager config with Telegram webhook (reuse existing bot token).
Reference alert rules in `monitoring/prometheus.yml`.
### 5-4. Code Coverage
Add to root `pyproject.toml`:
```toml
[tool.pytest.ini_options]
addopts = "--cov=shared/src --cov=services --cov-report=term-missing"
[tool.coverage.run]
branch = true
omit = ["tests/*", "*/alembic/*"]
[tool.coverage.report]
fail_under = 70
```
Add `pytest-cov` to dev dependencies.
---
## Out of Scope
- Kubernetes/Helm charts (premature — Docker Compose sufficient for current scale)
- External secrets manager (Vault, AWS SM — overkill for single-machine deployment)
- OpenTelemetry distributed tracing (add when debugging cross-service issues)
- API versioning beyond `/api/v1/` prefix
- Data retention/partitioning (address when data volume becomes an issue)
|