self-organizing notes for your grug brain

Voice as a UI/UX pattern can finally be good.

A couple years ago I built a feature in Grug Notes where you press and hold, ask for changes to a note, release, and almost instantly the change happens.

In the background, we send the audio to the server, the server sends to Groq for a Whisper3 transcription, we shoot that back to the client for display and send it to a fast LLM to make the edit, apply the changes on the server, then send an event for the client to sync. It's fast enough, sub 2 second, and it puts a smile on my face every time. But man, that's a lot going on.

So when Google announced Gemini Flash 3.1 Live a few days ago, I wanted to check two things: can we flatten the stack to one call, and can it do the same thing but multi-turn?

The answer is yes and it's kind of amazing. Google has ephemeral tokens which let us bypass the server altogether. So it boils down to one web socket, ie client straight to the Google Live LLM. My mind is blown.

No, it's not perfect, but it feels like a glimpse into the future. It feels like a new way to interact with a computer and in my case, with text. A new modality, not writing or typing. It's massaging text through what you say -- appropriately dictation at times and appropriately editing at times.

If you use AI for code, you might think this is old news, we've been molding code on pure intent for six months! But I think my excitement here comes from the fact that this is not that. This is returning the thinking back to me and you. It's precise.

For simple things like a routine email this is what I want. For note editing, this will be great. For documentation and forms. And for fine grained editing on something I really care about, there might be a use for it.

It's the closest thing I've experienced to text editing at the speed of thought.

It's live as an experimental feature in Grug Notes.

Credit where credit is due, the first I'd seen something like this was Aqua Voice a couple years ago. I never used it myself, but I presume it wasn't quite fluid or natural enough. This feels different, and I suspect it's going to become a commodity, especially for people like me who might not sit at a computer all day. This will also allow me to accurately interact with a computer while being on my feet (I build canoes for living.)

As of this writing Gemini Flash Live is 9 days old. And OpenAI has updates to their gpt realtime today.