Audio Un-mixing and compression

I've been playing with "music technology" for a long time. From the BEEP command on a ZX Spectrum, through to early sequencers on the Commodore Amiga ("trackers", MED and Octamed - which somehow magically turned its 4 tracks of CD-quality audio samples into eight tracks, and a dodgy copy of MusicX), synths and effects (software and hardware), sequencers on Windows PCs and Macs, and 'hardware' sequencers (well, one - but I spent a lot of time with the MPC 2000 and a ZIP drive) - I've spent countless hours playing around with all of them.

Sometimes, you realise that there are things that you just can't do with a home audio setup. Sometimes, you learn that things you assumed were done on expensive studio hardware, you later discover were just done by some kid in their bedroom, using exactly the same computer as you.

Sometimes, you learn that certain things are simply impossible.

For example; if you've got a mixed track - a recording of drums, vocals, guitar, bass etc. - then you can't really separate the instruments out and listen to each instrument in isolation.

There are a few exceptions. For example, if you've got a hard-panned stereo record (eg. lots of early 1960s recordings - like the Beatles songs that have guitars in one speaker and bass in the other), then you can just listen to one earphone to cut out certain instruments.

There's also something called the 'OOPS' or 'Out Of Phase Stereo' effect; if an instrument is exactly in the middle of a stereo recording, you can hook up a speaker to the positive terminal of each channel and - if you're lucky - the centred sound cancels itself out, letting you just hear everything else.1

About a decade or so ago, there was a fad in the gaming industry for "rhythm games"- games like Guitar Hero and Rock Band would have "official" recordings you'd play along to on a plastic guitar, but if you missed a note then the guitar would cut out. To make that work, they would build a new 4-track mix from the original multitrack recordings - which would then invariably leak online. Which meant if you wanted to listen to, say just the drums from Love Spreads, or just the guitar - then you could.2.

There are various other isolated tracks circulating online, leaked from the master recordings or bootlegs. A personal favourite of mine is Eric Clapton's isolated guitar from While My Guitar Gently Weeps3. And for years, there have been people doing really cool things with these kind of recordings - my personal favourite is Girl Talk's "All Day" which, for me, is the absolute peak of its genre.4

But these are all special cases, where specific recordings become available. In general, once different sounds have been mixed together, you can't just "extract" the bits you want. The analogy is like cake mix - once you've beaten the eggs and mixed in flour and sugar, you simply can't take out any of the individual ingredients from the batter.

Except… that isn't true any more. I mean - I think its still true for batter; you can't take out the eggs once you've beaten them in with the flour and butter - but for mixed audio, you can pull out specific elements and either listen to them in isolation, or listen to the rest of the music without it.

Want to make yourself a kareoke version of your favourite record to sing along to? Thats now a function in Apple Music - just turn down the vocals.

Want to make yourself a backing track of your favourite record with just the vocals, drums and bass, so you can play along and pretend to be Jimi Hendrix/Eric Clapton/John Squire? Just load the song into Logic Pro and use the Stem Splitter tu turn the original audio track into four separate tracks that you can mute, fade, EQ or whatever. Want to pull the bass and drums out of something where the whole band was recorded through the same microphone? Go for it…

Compression Artefacts

The one caveat is that lots of isolated tracks (especially vocals - or at least, most noticable in vocal tracks) have a weird, 'ghostly' feel to them that seemed characteristic of this AI-driven sound isolation where it doesn't quite work properly.

Or so I thought - what I think I've come to realise is that what is actually going on is an artefact of an effect called compression. Compression is where the volume gets 'flattened' down at the peaks, and/or boosted in the quieter parts.

There are various ways that compression can happen (radio broadcasts apply it, cheap speakers impose it…), but one of the easiest to understand is "optical compression"; there's a light hooked up to the sound source that shines brighter when the source is louder, and a light sensor hooked up to the output that turns the volume down when the light is shining. Play quietly and you'll hear the sound; play a loud note and the volume gets turned down - but only while you're playing the loud notes. As soon as the note fades, the volume comes back up again.

Another way compression can be applied is "side-chaining", where the compression takes its input from one track's volume and then applies it to a different track. If you listen to Daft Punk's One More Time and pay attention to the horn sound, you'll hear that it has a kind of 'pumping' effect where the volume seems to come down in time with the beat. What I think is going on here is a sidechain compression; the kick drum plays on the beat, and that applies the compression to the horn, making that 'pumping' effect (and also making the kick drum on the beat sound more prominent in the mix.)

So, what sounds like an "AI artefact" from these AI-isolated tracks is - I think - actually a compression artefact that is revelaed by the isolation; the track has been compressed (probably as part of compression to the whole track), and that compression is something you very rarely hear without hearing the rest of the track. So - the drums cause a volume spike, the compression brings the volume of everything down a little for that spike, and what you hear in the isolated vocals is a weird "shadow" of the drum sound. Not because the AI isn't doing a good enough job, but because its doing something that we rarely got to hear before.

(And no, no chatbots are required…)

  1. Of course, this probably isn't good for your speakers - these days, its probably much easier to get the same effect in software.

  2. What you can hear from that pretty clearly is that there are two separate guitar tracks switching in and out of the 'guitar' track - actually, switching between the 'guitar' and 'backing' tracks.

  3. Yes, its Eric Clapton. The whole story behind it is fascinating - probably the subject of its own blog post.

  4. To the extent that I can't really listen to other mashups without comparing them, and they never stand up. Its a bit like how The Watchmen ruined all other graphic novels for me.

More backwards Stone Roses

I recently discovered that Full Fathom Five - the B-side to the Stone Roses first single, Elephant Stone - has two versions. One - a fairly boring backwards version of the original single (albeit, a slightly different mix), on reissues. The other a bit more interesting.

So, In the spirit of completion, I made a stereo comparison track…

The future of 'documentation'

In the late 1990s, right back at the start of my squiggly career, I was a 'Technical Author' - writing tests for mobile phone software. (Not writing software tests - literally, instructions for the people like me who would have to run the tests. (eg. "Press Menu > 1 > 1 : You should be in a 'Compose New SMS' screen".)

I liked the idea of writing - in particular, the idea of writing something that would help people make the most of consumer technology (phones - this was pre-smartphone, PCs, software etc.) User manuals at the time were typically a joke - written in technical jargon that you could only understand if you already understood how the things worked. 1 So my plan at the time was to get into that side of "technical authoring"; making complicated things simple.

After a few meetings and chats with various people in the field, the first realisation I had was that the reason was that the people writing the "user" documentation were spending >90% of their time writing the technical documentation for engineers; their job was literally the opposite of communicating complex things in simple ways for non-technical people.

The other thing I realised was that the future of this kind of 'documentation' wasn't going to be printed on a piece of paper in the box that the technology came in; it was going to be on the website, where it could be constantly updated, revised etc.

Well - that wasn't entirely incorrect. In the last few weeks, I've needed to find manuals for a few things; we've moved into a new house, and I've needed to understand things like an extractor fan that had stopped spinning, a water pump that controls the heating, and some flat-pack furniture from the old house I needed to reassemble. For all three, I found what I was looking for online - and for all three, it was in the form of a PDF of the printed piece of paper that presumably originally came in the box.

But still - I think its true as a more general trend. Or at least, it has been.

"Documentation" might not have quite made the leap from the static paper-based things to a truly dynamic, searchable, interactive version - but the vast majority of the time, the web will still get you the answers to your questions. Maybe thats a Reddit thread. Maybe its an obscure electricians' forum where someone has asked for help for exactly the same extractor fan problem that I've had - and someone else has provided it.

But the thing that occurred to me this morning, as I was playing with a local LLM and asking it for help about some technical details around how to configure it, was that there is an opportunity for these models to be "self-documenting" that seems to be being missed. Meta's Llama model seemed to struggle with some questions about configuring itself (sending me off on a weird path of writing python scripts, editing .zshrc configuration files - before I did a google search and realised I could do what I wanted to do with two lines of code in the same window that I was 'talking' to the Llama model in.

A fairly small LLM, trained on the model's own documentation should surely be able to get you to an accurate answer much faster and easier than the current 'best option' of Google/Reddit/Stack Overflow searches - which can just as easily get you to outdated/obsolete advice as to the "right" answer.

Sure - LLMs hallucinate; but only when they are trying to provide an "answer" that they don't have enough information to provide and are forced into a 'best available information' situation - which a well-trained model with a single use case should not have a problem with.

  1. Honestly - I think this is still true, for the most part. For example, the manual for a robot hoover we recently got tells you to push a button that is *not labelled on the actual robot* - only in the manual itself, in text so small I had to get my daughter to read it for me. (OK - my eyesight isn't as good as it used to be, but this was literally text on a diagram about 1mm high.) I'm sure it made perfect sense in the version on the designer's 5K screen - but the actual version that the user had to rely on was almost useless.

"Changing" TV viewing behaviour

About five years ago, I wrote about how much TV viewing has really changed;

So - 85% of viewing is definitely not watched "as a family", because nearly half is watched alone... but that said, the point here is that I'm not seeing anything here that makes me think that there has been a significant change in viewing behaviour over the last decade. (Interestingly, we watch more TV on our own in the summer months- which is when we tend to watch less TV overall.)

This is a chart of the share of viewing time spent alone - ie. with nobody else in the room, whether or not they are actually "watching" the TV;

Bar chart showing very little change is the amount of time spent watching TV alone from 2013 to 2019.

Since then - well, things have obviously changed.

Covid and lockdowns had a massive impact on the time people spent at home vs. out of home, with the longer-term legacy of working from home being increasingly 'normalised' (the current trend seems to be a gradual return to the office - but at least day or two WFH seems to be the 'normal' for what was 100% "office-based" work before.

SVOD subscriptions have rocketed; more TV viewing is to stuff that never had a TV broadcast.

Phones have got faster, and we've gone to 5G - making watching video on other devices a virtually frictionless experience.

More people are watching new platforms like TikTok.

AI is about to change the world again - and is already flooding social media feeds with growing volumes of slop.

So, how have those changes manifested themselves? What does an updated version of the chart above look like?

I've coloured 2020 - the obvious 'change point' black to make it clearer where all that dramatic change in behaviour really kicked off.

Bar chart showing very little change in the amount of time spent watching TV alone from 2015 to mid-2024.

Plus ça change, plus c'est la même chose…

(Footnote 1- this data is about viewing to television sets; whatever it is that is being viewed, based on Barb Audiences' data. I've written extensively before on what I think "television" is, and in my view, its basically "whatever is on the TV". There's a whole other conversation where that might have come from and which "pipe" it went down to get there - but my personal interest and focus is always on the media behaviour rather than the content.)

(Footnote 2 - In case you were wondering why the data only goes up to June, its because I wrote this post in July as a draft, forgot to click 'publish', and just spent ages trying to find the link to a thing I knew I had written but couldn't find. Whoops.)

The role of Synthetic Respondents in 'Human-centred' Research

In the growing buzz around generative AI, a new concept in research methodologies has arisen; "synthetic respondents". Instead of asking people the questions, a Large Language Model creates 'synthetic respondents' which you can ask as many questions as you like. And they will give you answers. And they will probably sound like real people. They will never get bored. They will never try to disguise their "true" thoughts and feelings (as David Ogilvy once said, “People don’t think what they feel, don’t say what they think, and don’t do what they say.”.) You can get answers from thousands of them, very quickly and at very little costs.

(Also - they never leave behind a bad smell, and won't eat all of your biscuits.)

But again - so obvious as to be barely worth mentioning - they aren't real people. They are synthetic - "made up." Just like the 'actors', pretending to be the sort of people we actually want to talk to.

They will do it faster. They will do it cheaper. Will they do it better - or at least, 'good enough'? Well... that's the real question.

AI and "Apple Intelligence"

(Title based on a Threads post from Daniel Jalkut.

What Is Apple Doing in AI? Summaries, Cloud and On-Device LLMs, OpenAI Deal - Bloomberg

But the company knows consumers will demand such a feature, and so it’s teaming up with OpenAI to add the startup’s technology to iOS 18, the next version of the iPhone’s software. The companies are preparing a major announcement of their partnership at WWDC, with Sam Altman-led OpenAI now racing to ensure it has the capacity to support the influx of users later this year.

Nah- I'm not buying it. For one - OpenAI and Microsoft are "true" partners; Microsoft's Azure compute + OpenAI's models + Microsoft's apps/OS 1 are getting deeply intertwined. An Apple + OpenAI partnership seems like a strategy to be permanently one step behind Microsoft.

But it seems inevitable that there's big Apple + AI news coming. Siri needs a significant upgrade. The new iPad Pro announcement made a big deal about having "AI compute" power2. "AI features" announcements at WWDC 2024 seems like the safest bet in tech.

So, what might be coming?

  • Siri was a pretty early bet on the future - but possibly too early, given the advancements in machine learning/deep learning since Siri was first released. But "Siri" is more of a brand than a technology - there are "Siri" features that don't seem to have anything to do with the voice assistant.
  • Meanwhile, while Siri might seem to be stuck in a rut, Apple's own proprietary ML/AI technology has been coming along. Apple Music has had a 'karaoke' feature where you can turn down the vocals - which I had wondered whether it was an AI-powered thing or just that they get separate stems; the latest Logic Pro which lets you split vocals, drums, bass and 'everything else' seems to suggest that its AI rather than a 'special access to the masters' situation. Given Apple's insistence on owning any technology that they rely on, this seems like the most likely approach. Whether Siri as a brand is dead or not… We'll see. (Worth noting that Siri came out before the film 'Her' - while OpenAI's latest release seems clearly... 'inspired' by the film, Siri looks more like the inspiration for the film.)
  • That said - it seems that Apple needs to catch up a lot, and fast. So a partnership with a technology leader does make sense. So... with who?
    • Meta? Meta's AI push has been for 'open' models (as opposed to - ironically - OpenAI's proprietary approach) that anyone can use. Apple's strength is in using on-device computing power (because they have a fair amount of it - it also fits with their 'privacy first' approach.) Maybe Apple would be licensing these models - but I suspect that the Apple/Meta relationship isn't strong enough to make it likely that Apple would want to put their future in Mark Zuckerberg's hands.
    • Google's AI journey has been... interesting. They seem to be leaders in terms of the underlying tech - but struggling to actually execute; OpenAI/ChatGPT/Copilots seem to be conclusively winning the PR/branding fight - perhaps they need a partner who can make a better product (away from the issues of competing with Search as a business model) that can better reflect the power of the underlying tech than Google's efforts to date. The Transformer architecture that led to the LLM explosion came out of Google, as did the Tensorflow software framework. While everyone else is fighting over nVidia's GPU chips, Google have been making their own TPUs for nearly a decade. Google is arguably the only GPU-rich business at the moment, and there is a general vibe of the AI industry being Google vs. Everyone Else - having a big partner might be exactly what they need right now. Maybe most importantly, Google pays Apple a lot of money for search engine prominence. The question is how much of the old Android/iOS friction still exists on Apple's side, while Google's opportunity is to make an AI-powered "Google Android" a distinct product from... lets call it the "Samsung Android" ecosystem - which doesn't make as much sense if Apple gets benefits that cheaper Android phones don't.
    • Microsoft? If Apple could get a more... lets say "grown-up" business partner without the volatility of OpenAI, then that could be interesting... And if the 'everyone vs Google' view of the market is right, then it seems to be in Microsoft's interests too. But it seems that Microsoft's clear positioning of 'Copilot+ PCs' as going head-to-head with Macbooks would be unlikely if that was in the pipeline.
    • There are other AI companies (eg. Anthropic) - but I'm not sure how many of them are scalable enough (read: access to the compute resources that would be needed to potentially switch on for every iPhone/iPad/Mac owner in the world overnight.)

If I had to make a bet, my money would be on a Google partnership, with something like the Gemma model running locally on iPhone/iPads etc. as 'Siri 2.0' and access to Gemini for the kind of tasks that need access to 'full fat' LLMs and more computing power.

  1. Also- GitHub CoPilot

  2. Yes, iPads/iPhones/Macs have had 'neural cores' for a few years - but the new iPad seems to be stepping this up significantly, but with no news on what its actually going to power. Worth noting - if you're developing AI/ML/LLM-type software on a Mac, you're using the GPU - not NPU chips. So far, they seem to be locked away for Apple's use (which includes Apple's APIs if you're building apps for the app store - but not if you're running something like TensorFlow in Python.)

The Device is the Boring Bit

The Apple Vision Pro is now on sale. People are getting their hands on them, and sharing their opinions. People who haven't got their hands on them are sharing their opinions. There are a lot of opinions flying around.

First thing - sure, I'm interested in the headset, and the device actually getting in 'normal' people's hands (or on their faces) is this week's news; I'm not going to buy one, because it's ridiculously expensive and if I had that sort of money to throw around, I probably wouldn't be driving a car that's approaching either its 18th birthday or its last trip to the scrapyard and has done the equivalent milage of 5 times around the circumference of the earth.


But what I'm really interested in is the Vision platform; the bits in the software that are going to be the same when the next headset device is launched. And once there are a bunch of different ‘Vision’ devices - where they will fit, in the spaces in people's lives.

Who owns Taylor Swift's voice?

Ben Evans on Threads;

It's a lot easier to understand the IP issues in 'give me this song but in Taylor Swift's voice' than 'make me a song in the style of the top ten hits of the last decade.' If a human did that, they wouldn't necessarily have to pay anyone, so why would an LLM?

There's an interesting twist with the "Taylor Swift's voice" example; Scooter Braun owns all of Taylor Swift's recordings (at least, I think all the ones released before any ChatGPT-era training dataset were compiled) - he bought the record company, so he owns the master recordings (and all the copies of the master recordings, and the rights relating to them) - but not the songs themselves. Taylor Swift still owns them - which is why she can make her "Taylor's Version" re-recordings (which Scooter Braun doesn't get a penny out of.)

So there's a key difference here; a human would copy the songs (that is, they would be working off the version of the songs that are in their heads - the idea of the songs), so Swift would get paid as the owner of the songs.

But the kind of generative AI we're talking about would be copying 100% from the recordings (ie. the training data would be the sounds, digitised and converted into a stream of numbers) - which Swift doesn't own. The AI doesn't "see" the idea of the songs - it wouldn’t “know” what the lyrics were, what key the songs were in, what chords were being played on what instrument - any more than a Large Language Model “knows” what the words in its (tokenised) training dataset or output mean.

She still owns her songs, but she’s sold her voice.

(Pre) WWDC 2023

WWDC usually isn’t one to look forward to - unless you’re the sort of person who cares about things like Xcode features - because it isn’t the venue where they talk about the new iPhones. Maybe there will be clues about new iPhone features in some new APIs or something, but the focus is generally on developers.

This year is different…