Can Codex use Gemini as ears? Driving Ableton via MCP / Vanja Oljaca

All my experience using AI as an Ableton coach has been bad. Was it my fault, or the coach’s fault? I hooked up two AIs to see if they performed any better without a Human In The Loop: Codex got hands in Ableton, Gemini got asked to be the ears.

These are AI-generated Ableton exports from the experiment, not reference tracks. The song attempts are the visible artifact. The useful part is the evaluator loop: Gemini turned sound into critique, Codex used that text to drive Ableton, and the process kept needing patches because fluent listening notes were not the same thing as reliable judgement.

Main story, with sound

Codex could operate Ableton through MCP: create tracks, edit MIDI, place clips in Arrangement, add devices, play the set, export MP3s. Gemini could listen to those MP3s and write back musical observations. The interesting part was not “AI made a song.” It was watching the judging harness learn what counted as evidence.

Before the real Ableton ladder, there was a fake win: Codex discovered it could just code the song directly in Python. It made standalone WAV/MP3 previews from a MIDI-ish plan, sent those to Gemini, and got feedback. Funny problem: that feedback was about the Python-rendered song, not necessarily about anything playing out of Ableton. The reveal was basically, “yes, there is stuff in the Live set, but the critic has been listening to a different artifact.”

Missing artifact: the retired Python procedural preview

The notes name old files like maluma_6of10_v2_procedural_preview.mp3 and maluma_7of10_v8_lowend_width_preview.mp3, but those preview generators and exports were later deleted/cleaned because they were non-authoritative. This is the one sound slot I would restore if we recover the old file; the point of it is the reveal, not that it was a valid Ableton export.

Maluma V9: stock acoustic attempt, still “MIDI karaoke” territory

Maluma V12: warmer device/effects pass, still synthetic/rompler-like

First failure: the loop could score the wrong thing. Some Gemini feedback came from procedural previews instead of verified Ableton Arrangement exports. A prior “9/10” and a later procedural “7/10” had to be treated as contaminated, then rerun against real audio. The rubric changed from “does the model like this?” to “did the model actually hear the file Codex just made, from the system we claim made it?”

Maluma V17: Gemini 6/10, then human veto and Gemini audit to 3/10

Maluma V24: later patched ladder after hat/kick/snare gates

Second failure: the model could hear something real and still grade it wrong. V17 got a Gemini 6/10, but my listen rejected the score: the timing was random, the hi-hat was absurd, and Gemini had mistaken randomness for humanized timing. Codex fed that failure back into the rubric. Later checks split structure from production, added hat/kick/snare gates, and forced Gemini to revise scores when the critique missed obvious audible problems.

Chencho V2: real upload, scored 2.5/10 as rigid electronic pop

Chencho V3: dembow skeleton becomes more audible, score rises to 3.5/10

Chencho V4: density reset, dembow becomes less perceptible, score falls to 2/10

Chencho V36: drums audible, but Gemini calls the grammar house/IDM

Third failure: “Gemini has ears” had to become “Gemini needs an access test.” It sometimes judged only part of a 65-second file, silently fell back to Flash-Lite, or produced useful observations with a bad final score. Codex changed the process: observation first, comparison second, judgement last. When V36 finally made drums audible, Gemini still called the grammar house/IDM. That was useful because it named the actual failure: not “add drums,” but “the listener hears the wrong musical language.”

Bachata V42: source-proof groove correction, scored 6.5/10

Bachata V57: Gemini inferred bongos; follow-up said they were not actually audible

Fourth failure: the model could hallucinate the instrument source. In the bachata run, Gemini inferred bongos from context even when the actual audible bongo source was missing, masked, or silently exported wrong. Human feedback forced the key question from “does this read as bachata?” to “can the listener actually hear the source it claims to hear?” That produced the final protocol: verify the export, ask for blind observation, save the answer, then show the build log only as an audit.

Vanja commentary slot:

Optional: add your read of the workflow. Recommended category: where you intervened, which Gemini failures were funny versus disqualifying, and what you changed about the prompts/rubric because of human listening.

Raw runs

Song 1: Maluma / "Felices los 4" attempt

What the loop was trying to prove

The initial question was whether the coach was bad because the model could not listen well, or because the human-in-the-loop workflow was weak.
Codex used Ableton MCP to create and modify tracks, MIDI clips, Session rows, Arrangement placements, devices, playback, and exports.
Gemini supplied text critique from audio uploads.
Early iterations used procedural previews for quick listening.
Those previews were retired as authoritative evidence because they could diverge from the Ableton session.
The reliable export preflight became: stop Session clips, click Back to Arrangement, export the intended range, then verify duration and non-silence.

The scoring mess

The first high-scoring ladder was partly a Python-song ladder. Old generate_v*.py scripts rendered standalone full-length WAV/MP3 previews for fast Gemini listening.
Those scripts did not create Ableton clips, Ableton instruments, or reusable WAV samples loaded into Live.
V2 clips were created in Live, and a real-time Arrangement print was attempted, but Live’s UI window became unavailable before export. Codex then generated maluma_6of10_v2_procedural_preview.mp3 from the same V2 MIDI plan for Gemini feedback.
V6 was rendered as a procedural preview and not pushed to Live because AbletonMCP rejected creating Session slot 8.
During the V8 preview run, generate_v8.py rendered pure-Python audio and wrote maluma_7of10_v8_lowend_width_preview.{wav,mp3}.
The old V6 “9/10” was structure-biased. The same audio was later rerated 3/10 after feedback exposed guitar texture, synthetic layers, phrase resolution, and thinness failures.
V8 had an old procedural Stage 2 7/10. That lane was quarantined because Python-rendered previews, not verified Ableton exports, had been sent to Gemini.
The first new authoritative real Ableton export scored 1/10. Gemini reported no audible drums/percussion/dembow and only a weak synthetic motif.
V5 used duplicate_scene_to_arrangement(scene_index=7, start_time=640) to place Session slot 7 at bar 161, then exported 96 bars after stop_all_clips and Back to Arrangement. It scored 2/10.
V7 used direct Arrangement duplication into 12 eight-bar sections. The first automated export was rejected because Live opened on a stale 8-bar selected-clip range (489.1.1, length 8.0.0) instead of the intended 96 bars.
The final V7 export only counted after the render start/length fields were manually corrected. Gemini scored it 4/10 and used a “Mute-the-Kick Momentum Test.”

The reveal

The validation loop had split in two: Codex was updating Ableton in one path while sending Gemini Python-rendered WAV/MP3 previews from another path.
Gemini’s feedback was linked to the procedural render, not necessarily to Ableton playback.
Score movement proved the standalone MP3s changed in ways Gemini responded to. It did not prove the Live set improved.
The fix was not “never use Python.” The fix was provenance: no Gemini score counts unless the scored file came from an Ableton Arrangement export, or from explicit audio/MIDI assets placed in Arrangement View before export.
The old procedural preview files were retired/deleted after V8 because they made it too easy to confuse “Gemini liked a generated MP3” with “Ableton contains a working version of the song.”

Where human feedback patched the rubric

V8 real Ableton scored 3/10. The added MIDI density made the groove more mechanized.
V9 added stock acoustic/Latin-ish Ableton sounds. Gemini described the result as moving from chiptune toward “MIDI karaoke” timbre.
V10 was invalid: browser sample-loop URIs appeared loaded, but get_track_info showed empty device chains. The rubric gained a device-chain/source check.
V12 added warmer nylon/effects work and still scored 4/10. Gemini described the guitar source as “early 2000s rompler workstation.”
V13 scored 4.5/10. Human feedback said individual sounds were nicer, but the song structure was badly wrong. The rubric had to separate local timbre wins from whole-song structure.
V16 got a Gemini 5/10, while human feedback logged it as 1/10: “horrible,” rhythmically wrong, harmonically gross.
V17 first got 6/10. Human feedback called it “ass,” random, with absurd/out-of-time hi-hat. Gemini audit capped it at 3/10 and said it had mistaken randomized timing for humanized timing.
V19 had an upload failure: automation pasted a local file path into Gemini instead of attaching the audio. Gemini correctly rejected it as non-audio.
V20 got 7.5/10 and 6.5/10 answers while the UI showed Flash-Lite. After Pro reset, the same attached MP3 was rerated 3.5/10 and marked “not even close” to 7.

Later patched scoring gates

Stage 1 scores could reward structure before sound design was believable.
The workflow later separated Stage 1 structure from Stage 2 full production.
V23 first got 7/10. Feedback then flagged hats too forward/off, kick/snare buried, and a muffled mix. Gemini accepted the rubric failure and revised V23 to 5/10.
V24 got a patched 7.5/10 after passing hat/kick/snare gates.
V25 hit 8/10 and V26 hit 8.5/10 after smoothness and texture changes.
V27 got 9/10, but feedback said the lead was hard to hear.
V28 patched lead audibility and got 9.5/10.
V29 reached 9.9/10 after Gemini was corrected away from requiring vocals.
V30 reached a patched Stage 2 instrumental 10/10.
These later high scores are part of the chronology, but should stay separate from the earlier false-positive scores and invalid scoring lanes.

Vanja slot:

Optional: add whether you trust any of the later climb, or whether you want the whole thing framed as “a scoring harness can become its own slop machine.”

Song 2: Chencho Corleone / "Un Cigarrillo" attempt

What the loop was trying to prove

The study started fresh around Chencho Corleone “Un Cigarrillo.”
The continuation automation id still contained old bachata wording because the app could not rename it or create a second heartbeat.
Native Gemini / Computer Use failed early, so the workflow fell back to native Gemini app control through macOS accessibility and AppleScript.
The value of the ladder was that each export created a checkable link between the Ableton change, rendered audio, Gemini observation, Gemini score or invalidation, and next attempted fix.

Prompt/rubric changes under pressure

V2 was built as a 24-bar Ableton diagnostic and exported as a real Arrangement render.
Gemini Pro read the roughly 62-second V2 upload and scored it 2.5/10. Main misses: rigid electronic pop, straight backbeat, dense percussion, sustained bass.
V3 rebuilt the rhythm around dembow/tresillo timing.
Gemini scored V3 at 3.5/10 because it heard a more correct skeleton, but still heard continuous hats, drone bass, dry space, and abrupt ending.
V4 stripped the arrangement down. Gemini scored it 2/10 because the dembow skeleton became less perceptible.
V11 had a credible observation, but the comparison run was invalid because Gemini silently downgraded to Flash-Lite. The process gained a model-access check.
V12-V15 could not get authoritative Pro scores because Pro quota was blocked and only Flash-Lite was available.

When local metrics stopped being enough

V15 layer audit showed the top texture was nearly inaudible alone: mean level around -43 dB.
V24 local measurements looked promising because section RMS windows resembled the no-vocals reference shape.
Gemini still heard V24 as static/disjointed, with late-section house/techno behavior, continuous sub drone, hollow midrange, dry space, and no vocal-proxy guide.
V28 had local RMS and band RMS close to the reference.
Gemini observation hard-failed V28 anyway: 2/4 snap, electronic drum groove, masking, weak topline, scooped midrange, limiter pumping, and abrupt cutoff.
V30 lifted source levels enough for upload, but Gemini still heard backbeats, sub masking, a midrange void, strict hats, and static drone.

Access failures and chunking

V32 local upload was 65.3 seconds. Gemini claimed exactly 1:00 and missed the final tail, so it was not accepted as a scored gate.
V35 was invalid for full judgement because Gemini only described about 25 seconds of a 65-second file.
The audio was split into chunks.
Chunk B revealed the body failure: Gemini heard synthesized pluck/mallet registers and no traditional drums.
V36 made drums audible.
Gemini accepted the audibility improvement, then said the musical grammar still read as strict house/IDM: static 2/4 backbeat, continuous open hats, rounded low-mid bass, hyperactive 16th-note pluck, and poor vocal-space carving.

Evaluator side quest

The failures produced a small evaluator suite around three public references: Bad Bunny and Chencho Corleone “Me Porto Bonito,” Taylor Swift “Style,” and Daft Punk “Get Lucky.”
Bad Bunny tested dembow rhythm grounding and whether Gemini would prefer generic textbook dembow over reference-derived rhythm material.
Taylor tested synth-pop section hierarchy and whether Gemini would reject an instrumental proxy for not sounding like the commercial master.
Daft tested live-feel funk/disco pocket and bass/guitar/drum interlock.
Recurring failure modes: genre-template overfitting, timbre/render contamination, source-role hallucination, observation/comparison inconsistency, false precision, and prompt-shape sensitivity.

Vanja slot:

Optional: add what readers should listen for across V2/V3/V4/V36. Recommended category: rhythm grammar, not just whether there are drums.

Song 3: Bachata / source-proof attempt

What the loop was trying to prove

The bachata branch became a source-recognition problem.
Codex could create bongo, guira, bass, and guitar-like parts, but Gemini and human listening repeatedly disagreed about whether the intended sources were actually audible.
Provenance had to include scene or row, Arrangement start, render length, track list, export path, duration, loudness, and whether the Gemini score applied to the actual Ableton set.

Source and export failures that changed the rubric

V22 rebuilt bongo grammar from scratch and became a candidate rhythm direction.
V23 added guira and segunda, but guira became distracting MIDI ticking and the guitar read as rigid block MIDI.
V25/V27 became human-positive anchors, with caveats around rattling, bongo pitch/body, and bass behavior.
V31 tried raw NI Cuba Bachata 1.mid pitches and rendered effectively silent.
V32 translated NI Cuba timing into audible bongo zones and got a 6.5/10 segment-level Gemini pass.
V33a crashed/hung on a longer render and recovered as an 8-bar proof.
V35 no-bass source-recognition still sounded like clicking/ticking to Gemini.
V37 revealed polluted Arrangement/Main export state: Gemini heard a pop/rock kit that was not part of the intended recipe.
V40 used fresh Ableton browser devices and still scored 2/10.

The fragile win and why it was fragile

V42 corrected the V41 segment directly: front guira, beat-4 bongo/macho/body overlays, anticipated held bass, and room/chorus color.
Gemini scored V42 at 6.5/10 and said bachata rhythm was immediately legible.
V43 scored 7/10.
V45/V46 scored 7.5/10 but still had stiff/grid-like feel and source/tone issues.
V47 produced a biased Gemini 8.5/10 because a temporary chat and filename leaked context. A cleaner blind style score was 5.5/10.
V49 got an 8.5/10 in a regular Gemini chat after audio/reference access worked once.
V50 immediately hit AUDIO_UNAVAILABLE for both upload and reference.
V52 human override logged that V49 still failed the human listener despite the archived 8.5/10.

The bongo problem

V57 had Gemini infer martillo/bongo structure.
Human feedback said no bongos were audible.
A follow-up had Gemini admit it could not actually hear the bongos.
V58b showed the old Kontakt NI Cuba instance exported silent. A fresh Kontakt/NI Cuba track worked.
Later pitch-map proof found raw Cuba MIDI pitches were silent while pitches 48/49/60/61 were audible.
V72 still hard-failed source recognition: synthetic waveforms, noise percussion, bongos not clearly audible.
V76/V80 staged prompts caught bogus Gemini access: Gemini first gave synthetic/artifact reports on an exact reference copy, then admitted it could not listen; later V80 comparison said it could not access either MP3.

Protocol after this failure

Build the candidate in Ableton with named tracks, clips, scenes, and Arrangement placement.
Export the intended range from Arrangement view.
Verify duration, non-silence, level, file path, and hash when available.
Upload the real export to Gemini.
Ask for observation only: no score, no reference comparison, no implementation details.
Ask for comparison against the public reference only after the observation is grounded.
Ask for judgement only after observation and comparison.
Save Gemini’s response immediately.
Only after the blind score is saved, show Gemini the build log and scene snapshot as an implementation audit.
Do not let the implementation audit replace the blind score.
If the model downgrades, duration access fails, the attachment is not real, or Gemini praises inaudible sources, the run is invalid.

Vanja slot:

Optional: add whether this source-proof branch stays as a third song or gets reframed as “the instrument-source debugging chapter.”

Vanja closing slot:

Optional: add the conclusion in your voice. Recommended category: no broad AI claims; just what this experiment changed about how you would run the next agent/audio test.