Commit ce494fda by PLN (Algolia)

feat(tfidf): index synths as sound sources; identify Take35 = Septième Armée

PLN: "samples and synths are the same class for fingerprinting." sample_tfidf now
indexes synthdef names (SC synthdefs/ + SCLOrkSynths quark + SuperDirt builtins)
alongside Dirt-Samples, tags each sound sample|synth, and persists kinds. vocab
871 sounds (293 samples + 63 synths used). septieme_armee signature now surfaces
moogBass/FMRhodes1/bassWarsaw — but all common (no rare tell): its identity is the
orbit-arrangement (the SNA riff), not a signature sound. L3 needs both signals.

Take35 identified by blind ear-test as Septième Armée (Seven Nation Army cover,
septieme_armee.tidal, 4:35), NOT the 38C3 "Pitbul Punk" the ±3d date-join guessed.
Corroborated by orbit→sound map (d4 bassWarsaw = the riff bass, etc.). take_gig_map
corrected; performance_notes logs the find + cover-license caveat.
parent 78f564ff
......@@ -131,10 +131,11 @@ class SampleHit(BaseModel):
sample: str
score: float # tf-idf
df: int # document frequency across corpus
kind: str = "?" # 'sample' | 'synth' — same class for fingerprinting
class TrackSignature(BaseModel):
n_samples: int
n_samples: int # distinct sound sources (samples ∪ synths)
tf: dict[str, int]
top_tfidf: list[SampleHit] = Field(default_factory=list)
......@@ -143,6 +144,9 @@ class TfidfReport(BaseModel):
corpus: str
n_docs: int
vocab_size: int
n_samples: int = 0 # distinct samples used across corpus
n_synths: int = 0 # distinct synths used across corpus
df: dict[str, int]
idf: dict[str, float]
kinds: dict[str, str] = Field(default_factory=dict) # sound -> sample|synth
tracks: dict[str, TrackSignature]
......@@ -146,3 +146,28 @@ Boundaries (machine-readable, merged + conflict-resolved):
- ⚠️ The old "Hamburg/Take87" PunkAChien was a **misidentification** (actually La Fin
de l'Insouciance → Liquid Finale @ 39C3). Do NOT A/B against it; real second take =
Take35 (38C3), plus Take36 (the 61:46 Toilet set) for an in-set version.
---
## Take35 — "Septième Armée" (Seven Nation Army cover), 2024-12-25
**Blind-test identification (2026-06-05).** Hunting the 38C3 PunkAChien, the take the
date-join labeled "Pitbul Punk/38C3 (±3d)" turned out — on PLN's ears, blind — to be
**Septième Armée** (`live/collab/raph/septieme_armee.tidal`), a Seven Nation Army-riff
cover at 90 BPM. Objective corroboration: the take's active orbits (1,2,3,4,5,8,9;
6/7/10/11/12 silent) match the `.tidal` orbit→sound map exactly, incl. **d4 `bassWarsaw`**
= the SNA bass riff (Take35 orbit-04 was 89% sub — the bass), d5 `FMRhodes1`, d9 `moogBass`.
Lessons:
- **Date-joins lie; the ear is the oracle** (3rd metadata miss caught by ear — see
[[feedback_locate_matrix_method]]). Take35 ≠ PunkAChien; take_gig_map corrected.
- **Fingerprint must include SYNTHS, not just samples**`bassWarsaw`/`moogBass`/
`FMRhodes1` are the identity here and TF-IDF (Dirt-Samples-only) was blind to them.
Fixed: sample_tfidf now indexes all sound-context tokens (samples ∪ synths).
- **PLN reaction:** "great one — good single, or at least a SoundCloud ébauche to push."
⚠️ It's a **cover** (White Stripes / Jack White) → needs a mechanical/cover license for
paid/DSP release; SoundCloud ébauche is lower-risk. Treat like the other covers bucket.
**Open:** the real 38C3 PunkAChien ("Pitbul Punk") is still unfound — Take35 eliminated.
Candidates left: Take36 / the 61:41 "House of Tea" set, Take37/38 (Chaos Music Club),
or it was never recorded as a standalone. Hunt deferred behind the bleed-detector build.
This source diff could not be displayed because it is too large. You can view the blob instead.
......@@ -25,7 +25,7 @@ _mtime≈gig date · duration: SET≥25m / track / sketch / empty(skip) · gig m
| 2024-11-24 | Take32 | 3:51 | 13 | track | |
| 2024-12-01 | Take33 | 2:02 | 13 | sketch | |
| 2024-12-20 | Take34 | 2:27 | 13 | sketch | TOPLAP Solstice 2024 (±1d) |
| 2024-12-25 | Take35 | 4:35 | 13 | track | [38C3] Secret Toilet Rave (±3d) |
| 2024-12-25 | Take35 | 4:35 | 13 | track | **Septième Armée** (septieme_armee.tidal) — EAR-VERIFIED ✓ blind test; NOT Pitbul Punk/38C3 (±3d date-join was wrong) |
| 2024-12-28 | Take36 | 61:46 | 12 | SET | [38C3] Secret Toilet Rave |
| 2024-12-29 | Take37 | 11:40 | 12 | track | [38C3] Chaos Music Club |
| 2024-12-29 | Take38 | 14:27 | 13 | track | [38C3] Chaos Music Club |
......
......@@ -28,7 +28,14 @@ sys.path.insert(0, str(Path(__file__).resolve().parent.parent / "armada" / "tide
from models import TfidfReport # noqa: E402
CORPUS = Path("/home/pln/Work/Sound/Tidal")
DIRT = Path("/home/pln/.local/share/SuperCollider/downloaded-quarks/Dirt-Samples")
SC = Path("/home/pln/.local/share/SuperCollider")
DIRT = SC / "downloaded-quarks" / "Dirt-Samples"
# synthdef sources — a synth is a sound source too (PLN: same class for fingerprinting)
SYNTH_DIRS = [SC / "synthdefs", SC / "downloaded-quarks" / "SCLOrkSynths" / "SynthDefs"]
SUPERDIRT_BUILTIN = {"superpiano", "supermandolin", "supergong", "superpwm",
"superhammond", "supersaw", "supersquare", "super808", "superchip",
"superhoover", "superzow", "supernoise", "superreese", "superfork",
"supercomparator", "supervibe", "soskick", "sossnare", "default"}
OUT = CORPUS / "armada" / "tide-table" / "sample_tfidf.json"
# split a quoted mininotation string into candidate tokens. KEEP '_' and digits
......@@ -42,13 +49,24 @@ SPLIT = re.compile(r"[\s\[\](){}<>*/.~?!@|,;:%+\-]+")
SOUND_CTX = re.compile(r'(?:\bsound\b|\bs\b|#)\s*"([^"]*)"')
def vocab():
"""Authoritative sample names = entries under local Dirt-Samples."""
return {p.name for p in DIRT.iterdir() if not p.name.startswith(".")}
def sound_vocab():
"""Authoritative sound-source names = Dirt-Samples folders ∪ synthdef names.
Returns (vocab_set, kind_map) where kind ∈ {'sample','synth'}."""
kind = {}
for p in DIRT.iterdir():
if not p.name.startswith("."):
kind[p.name] = "sample"
for d in SYNTH_DIRS:
if d.exists():
for p in d.rglob("*.scd"):
kind.setdefault(p.stem, "synth") # don't override a sample name
for s in SUPERDIRT_BUILTIN:
kind.setdefault(s, "synth")
return set(kind), kind
def samples_in(text, vocab):
"""Multiset of sample tokens present in one .tidal, sound-context only."""
"""Multiset of sound tokens (samples ∪ synths) in one .tidal, sound-context only."""
counts = Counter()
for q in SOUND_CTX.findall(text):
for tok in SPLIT.split(q):
......@@ -58,7 +76,7 @@ def samples_in(text, vocab):
def build():
voc = vocab()
voc, kind = sound_vocab()
files = sorted(CORPUS.rglob("*.tidal"))
docs = {} # rel_path -> Counter(sample -> tf)
df = Counter() # sample -> # docs containing it
......@@ -82,12 +100,16 @@ def build():
tfidf = {s: round(tf * idf[s], 3) for s, tf in c.items()}
top = sorted(tfidf.items(), key=lambda kv: -kv[1])[:6]
tracks[rel] = {"n_samples": len(c), "tf": dict(c),
"top_tfidf": [{"sample": s, "score": v, "df": df[s]}
for s, v in top]}
"top_tfidf": [{"sample": s, "score": v, "df": df[s],
"kind": kind.get(s, "?")} for s, v in top]}
used_kinds = {s: kind.get(s, "?") for s in df}
return {
"corpus": str(CORPUS), "n_docs": n, "vocab_size": len(voc),
"n_samples": sum(1 for k in used_kinds.values() if k == "sample"),
"n_synths": sum(1 for k in used_kinds.values() if k == "synth"),
"df": dict(df.most_common()),
"idf": dict(sorted(idf.items(), key=lambda kv: -kv[1])),
"kinds": used_kinds,
"tracks": tracks,
}
......@@ -96,14 +118,16 @@ def report(data, args):
n = data["n_docs"]
df = data["df"]
idf = data["idf"]
kinds = data.get("kinds", {})
K = lambda s: kinds.get(s, "?")
if args.sample:
s = args.sample
if s not in df:
print(f"'{s}' not used in any .tidal (or not a Dirt-Samples name).")
print(f"'{s}' not used in any .tidal (or not a known sample/synth name).")
return
users = [(rel, t["tf"][s]) for rel, t in data["tracks"].items() if s in t["tf"]]
users.sort(key=lambda x: -x[1])
print(f"\n■ '{s}' df={df[s]}/{n} docs idf={idf[s]} "
print(f"\n■ '{s}' [{K(s)}] df={df[s]}/{n} docs idf={idf[s]} "
f"({'RARE tell' if df[s] <= 3 else 'common' if df[s] >= 20 else 'mid'})")
print(f" used in {len(users)} tracks:")
for rel, tf in users[:25]:
......@@ -112,24 +136,25 @@ def report(data, args):
if args.track:
hits = {rel: t for rel, t in data["tracks"].items() if args.track in rel}
for rel, t in hits.items():
print(f"\n■ {rel} ({t['n_samples']} distinct samples)")
print(f"\n■ {rel} ({t['n_samples']} distinct sounds)")
print(" signature (TF-IDF):")
for h in t["top_tfidf"]:
d = df[h["sample"]]
print(f" {h['score']:>7} {h['sample']:<20} (df={d}, "
print(f" {h['score']:>7} {h['sample']:<18} {h.get('kind','?'):<6} (df={d}, "
f"{'rare' if d <= 3 else 'common' if d >= 20 else 'mid'})")
if not hits:
print(f"no track matching '{args.track}'")
return
print(f"\nCorpus: {n} .tidal docs · vocab {data['vocab_size']} sample names\n")
print(f"\nCorpus: {n} .tidal docs · vocab {data['vocab_size']} sound names "
f"({data.get('n_samples','?')} samples + {data.get('n_synths','?')} synths used)\n")
rare = [s for s, d in df.items() if d == 1]
print(f"■ RARE TELLS (df=1, used in exactly one track) — {len(rare)} samples")
for s in list(df)[::-1][:25]:
print(f"■ RARE TELLS (df=1, one track only) — {len(rare)} sounds; sample of them:")
for s in list(df)[::-1][:22]:
if df[s] <= 2:
print(f" df={df[s]} idf={idf[s]:<6} {s}")
print(f" df={df[s]} idf={idf[s]:<6} [{K(s):<6}] {s}")
print(f"\n■ COMMON / ubiquitous (high df, weak for ID):")
for s, d in list(df.items())[:18]:
print(f" df={d:>4} idf={idf[s]:<6} {s}")
for s, d in list(df.items())[:16]:
print(f" df={d:>4} idf={idf[s]:<6} [{K(s):<6}] {s}")
def main():
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment