armada · 968546796a867fbc9423ca4a19bc36a605b00cf4 · PLN / Tidal

perf(clap): encode the text tower once, audio-only forward per folder · 96854679

The ~88 fine descriptors were re-encoded through RoBERTa on every folder's
forward — a fixed cost that made batch size irrelevant. Now cache the normalized
text embeddings + logit scale at load; per-folder forwards run only the audio
tower (get_audio_features) and a matmul against the cached text embeds. ~1.8x
(14s->7.7s/folder; text was ~45% of per-folder cost).

API note (transformers 5.10.2): get_text/audio_features return a model-output
object whose .pooler_output IS the projected 512-d joint embedding — verified
identical to a full ClapModel forward to 1.6e-7. No classification change.

authored Jun 07, 2026

96854679

Name	Last commit	Last update
..
escales		Loading commit data...
manifeste		Loading commit data...
semaphore		Loading commit data...
tasks		Loading commit data...
tide-table		Loading commit data...
ui		Loading commit data...
.gitignore		Loading commit data...
DESIGN.md		Loading commit data...
PRODUCT.md		Loading commit data...
README.md		Loading commit data...
serve.py		Loading commit data...

README.md