Alloy — Known limitations (TestFlight beta)
Things you might run into while testing. None of these are bugs in the "this should never happen" sense — they're the parts of the shipped MVP that we know are rough.
Recenter + Completion plan (2026-05-13)
The build between commits 4dfff54 (Phase 27) and a8f2e32 (RC-10)
shifted significantly per the recenter plan in
docs/plans/RECENTER_COMPLETION_PLAN.md. Surfaces hidden so far —
all behind AlloyFeatures flags or build-config gates, all
reversible with a one-line flip:
- Ask Alloy tab (RC-4). The 4th primary tab was retired. Ask is reachable only from inside the Today recommendation's Why This sheet ("Ask Alloy about this"). The AskMessage model, worker route, and Siri intent are unchanged.
- Today PR + deload cards (RC-5). The full insight cards are
hidden. The highest-confidence single signal still surfaces as a
one-line row above the recommendation (deload ≥0.5 conf >
top PR ≥0.7 > yesterday recap). Engine methods (
prCandidates,deloadSignal) still run. - Today template chips (RC-5). The 3-chip horizontal carousel
is hidden. A single quiet "Recent:
" row above the recommendation lets you repeat the most-recently-used template in one tap. Hidden when no template has a lastUsedAt. - Today tweak chips (RC-5). The always-visible row of four bias chips moved behind an "Adjust" disclosure under the recommendation. Collapsed by default; tap to reveal.
- Today yesterday recap card (RC-5). Replaced by a single tappable line in the secondary-signal row (lowest priority — only shows when there's no deload or top PR to surface).
- "Thinking…" loading card (RC-5). Replaced by a subtle shimmering placeholder with no text (D-036).
- Achievements library (RC-6). The Settings → Training → Achievements link is hidden and the post-finish CelebrationSheet no longer surfaces. The PR banner inside Active Session stays (quiet, in-the-moment recognition is the original-concept intent).
- Learned streaming banner (RC-7). The "Alloy is thinking…" / "Alloy is noticing…" live token tail is hidden. Memory refresh still runs in the background; the small toolbar spinner is the only loading indicator.
- AI Use spend dashboard + per-row cost/model/tokens (RC-8). The Spend section + model-family pill + raw model name + per-call latency / tokens / cost line all moved to Settings → Developer → AI Spend (DEBUG-only). User-facing AI Use keeps provider name, CLOUD / LOCAL / CACHED pills, response status, data shape, retention copy, ZDR, policy version.
- Cloudflare debug payload toggle in Release builds (RC-8).
Compiled out via
#if DEBUGso shipped binaries can't enable raw payload retention from the UI. Debug builds still expose it. - HKWorkout auto-write + calorie estimate (RC-9). Workout
writes are opt-in via a sub-toggle "Write workouts to Health";
default off. The prior
activeEnergyBurnedestimate (~5 cal/min) is gone entirely — no fake precision against Activity Rings.
Trust surfaces that picked up new behaviour (not hidden):
- Body weight on History (RC-9) — 30-day Swift Charts line replaced by a single "Body weight · 82.4 lb · 3 days ago" row.
- "Training pressure" in user-facing prose (RC-9) — rewritten to "your load" / "recent training load" everywhere user-facing. Internal engine + wire-format field names are unchanged.
- Bodyweight unit preference (RC-10) —
LogBodyWeightSheetreadsTrainingProfile.preferredUnit. New profiles pick.kgoutside US locales,.lbfor US. - Current Focus (RC-10) — new 4-week-block picker in
Preferences (Strength / Hypertrophy / Endurance / Maintenance /
No current focus). Threaded through the recommend payload as
profile.currentFocus; the worker prompt has a dedicated CURRENT FOCUS section explaining how to bias session selection. - Cardio progression hint (RC-10) — Active Session shows
"Last time: 10 min Moderate" for cardio-shaped exercises
(progressionType
.time/.distance). No directional suggestion — cardio without HR/RPE data is too noisy to extrapolate. - Avoid-list quick-add (RC-10) — long-press an exercise row in
the Today recommendation's "Preview exercises" disclosure to
add it to
TrainingProfile.avoidedMovements. Today refreshes automatically on the next task tick.
What's still on the same defaults (handled in later RC phases): Learned screen placement + badge, CloudKit sync default-on, SIWA in onboarding, full App Attest production hardening.
Testing infrastructure
- iOS UI tests are flake-prone in parallel-clone mode. xcodebuild
spawns multiple simulator clones for the UI test runner; one of
them periodically fails to initialise with "The test runner failed
to initialize for UI testing. (Underlying Error: Timed out waiting
for AX loaded notification)" and the suite reports
** TEST FAILED **even when every individual test case passed on the surviving clones. Mitigation: re-run UI tests in isolation (-only-testing:alloyUITests/...) and they pass cleanly. Same pattern observed in RC-2, RC-4, and RC-6. alloyUITests.testLaunchPerformance()occasionally reports "Received unexpected number of metrics: 0 in iteration with index 1" on the first run. XCUITest's metric collection drops one iteration's reading; retries pass. Treated as a flake, not a regression.
Worker safety floor
- Defensive scrubbing on
evidenceIds[]. Phase 7b addedalternatives[].titleandalternatives[].whyNotto the recommend response, and the model has been observed echoingprofile.goalverbatim intoevidenceIds[]as"goal:<freeform>". RC-2.5 extendedrecommend-validator.tsto scrub all of these; RC-6.5 tightened the recommend system prompt to instruct the model thatevidenceIds[]entries are reference keys ("lastWorkouts:5") and not free text. Belt-and-suspenders — prompt avoids the leak, validator catches anything that slips. - Hype / shame stem matching (RC-10). The prior word-boundary
regex
\b(crush|...)\bmissed noun forms like "Crusher" / "Smasher" that the model occasionally puts inalternatives[].title. Switched to stem-based\b(crush|dominat|...)\w*so all derivations (crushing, crusher, smashing, smasher, weakness, lazily, excuses) scrub. Caught live in the RC-10 eval rerun on a "Leg Day Crusher" alternative title.
AI behavior
- Every Quick Log save calls the cloud (when Cloud AI is on). D-013 (inverted 2026-05-10): the parse pipeline is cloud-first now — Grok → on-device Foundation Models → deterministic regex. So every Save in Quick Log waits on the cloud round-trip rather than the regex parsing locally. If you save a lot of workouts in one session and hit the per-day quota (50 anonymous / 200 SIWA), remaining saves fall through to the on-device + deterministic floor — they still save, but the quality is the floor's, not the AI's.
- Cloud AI latency: each call to xAI Grok takes 25–60 seconds because the model runs internal reasoning on every request. The Today screen renders the local recommendation immediately and swaps in the AI version when it returns. Quick Log shows a clear "Saving with AI…" state during the wait.
- AI memories are proposals, not facts. They're scrubbed for hype, shame, and diagnosis but the underlying model can still miss patterns or invent ones. Accept / dismiss / edit each.
- Foundation Models on-device only works on Apple Intelligence capable hardware (A17 Pro / M-series). On unsupported devices the ambiguous-parse path goes straight to manual confirmation.
Identity and sync
- Sign in with Apple token expiry: Apple's identity tokens expire about 10 minutes after sign-in. Alloy detects this on the iOS side and surfaces a "Refresh credential" button in Settings. Until you tap it, you'll be on the per-device anonymous quota rather than the higher per-account quota.
- App Attest: full device-attestation flow is scaffolded but the dev-bypass path is what's active during the beta. The production attestation flow ships before the public release.
- CloudKit sync assumes the same iCloud account on each device. Toggling sync needs an app relaunch — Settings shows a "Restart Alloy to apply" hint after a flip.
Quotas
- Per-device quota during the beta is generous (500 cloud calls / day) so you don't have to think about it. Production ships at 30/day for anonymous, 200/day for SIWA-authenticated.
- Daily $-cap kill switch trips at $10/day on staging. If you see "Cloud AI paused for the day" in Settings ▸ AI Use, that's what fired. It resets at UTC midnight.
UI polish
- The recommendation card sometimes shows "AI · LOCAL TEMPLATE" when the cloud's exercise IDs don't all map onto our seeded graph — so you get the AI's narrative, but Start launches the local Upper Strength template. This will get smoother as the seed graph expands.
- The iOS "Forget AI memories only" navigation pops back to Settings after a beat — that's intentional, but the brief flash of the destination screen is jarring.
- Confirmation sheet picker for unrecognized exercises uses an inline Menu rather than a fully searchable picker. Fine for beta, replace with the full graph picker post-launch.
Documents
- The privacy policy quotes xAI's stated retention as of the
policy version stamp at the top of
PRIVACY_POLICY.md. If xAI changes their terms, the AI Use audit row stays pinned to the version that applied at the time of each call — historical accuracy is intentional.
What we want most from beta feedback
- Where does the local parser miss? Send the input string and what you expected.
- Where does the cloud AI lean lazy / hype / speculative despite the safety validator? The safety eval suite catches the obvious cases; the long tail is what beta testers see best.
- Anything in the Settings ▸ Privacy flows that surprised you (good or bad).
- Performance on older devices — anything worse than 3 seconds to open Today is worth reporting.