Building an Interactive Generative AI for Music in TypeScript – Part 1

Have you ever wanted to listen to some music but couldn’t quite find what you wanted? Or have you ever wanted to make some music, but you aren’t a trained musician or the commercial music making programs seem overwhelming?

We’re going to build an interactive music generator that attempts to solve those problems!

What is an Interactive Generative AI?

Generative AI is a special kind of automated decision model. An Automated Decision Model is a formula or algorithm that you can plug inputs into to get a decision. Automated decision models are used everywhere. Your thermostat is a simple example. If it is set to AC mode, and the temperature in the house rises above a certain value, it decides to turn on. The recommendation system in your favorite streaming platform is an automated decision model. Many financial asset trading companies use automated decision models to decide what stocks, bonds, and futures to buy and sell.

The key decision that a generative AI makes is What value do I produce next?. The generative AI we are going to build is sequential and works with a fixed sized alphabet – a set of symbols. When we say it is sequential, we mean that it generates values in a sequence, so the values that it has emitted so far are important. A fixed size alphabet means we can’t just generate ‘anything’ – we have to choose one of the symbols in the alphabet.

We are building a music generator for a piano, so the alphabet is the set of notes on the piano. We are also going to start small. Instead of using the entire 88 keys from a standard grand piano, we’re going to restrict our generator to a smaller set of notes, somewhere around 20 notes. Why? It is easier to decide between 2 options than 200 options. Many generative decision models scale in complexity with the size of the alphabet. For example, n-grams are one kind of decision model for generating from an alphabet. If you are looking at the last 3 elements to decide which to make next, there are alphabet_size ^ 3 possible combinations. With 88 keys, that is 681472 possible combinations, with 20 keys, that is 8000 possible combinations, which is roughly 2 orders of magnitude smaller.

Online vs Offline Decision Models

An offline decision model can be run or trained against a large set of data. For example, if you had all of Beethoven’s Piano Sonatas, you could train the decision model on that data, and it would (in theory) produce music that sounded like Beethoven wrote it.

An online decision model is updated one input at a time. When you like or dislike a video or song on one of your media streaming apps, you are using an online decision model (at least in part, collaborative filtering and similar offline approaches are incorporated as well). An online decision model is a good fit for an interactive training approach. We want to tailor the decision model to the user’s preferences, so we present the user with something, ask for their feedback, and update the decision model to reflect it.

We want to build an interactive experience where the user can tell the generator what they like and dislike so that it can start generating music specifically tailored to their tastes, so we are going to use an online decision model.

Why TypeScript

TypeScript isn’t the first language you would associate with building AI systems. So why are we using it here?

We want to deploy an interactive system easily to users. A web app is a natural choice. TypeScript is great for building web apps with interactive UIs.

An online model is not computationally intensive. It processes one item of feedback from the user at a time. It needs to be fast enough that the perceived latency to the user is within a normal human-computer interaction range. We don’t need a high performance language to meet those requirements.

What about scientific computing libraries for various mathematical operations? We are building a fairly simple decision model, so we won’t end up needing much of those computations.

And perhaps most importantly – we want to rapidly deploy a full system to our end users. If we hit a wall in a particular area of the system, we can pull that out into a separate component in a different language as needed. But until we get to that point, we can deliver faster by sticking with one language that is well suited to our deployment environment, the web browser.

System Design Overview

This system will consist of several components:

  • A UI to see the generated music and like or dislike it.
  • An audio player to play the music for the user.
  • The decision model that decides what to generate.
  • A persistence solution to save trained decision models for later.

We are going to build it in a user-experience-first manner. We start by building the user facing components first, with minimal backing services. Over time, we fill-in and improve those services. We do this so that we can get a fully working system in front of the user as fast as possible. This lets us get feedback quickly and establishes a platform that allows us to continually deploy updates.

Scaffolding the Project

The repo we will be using is cow-music-ts.

To begin with, we will build our app entirely in the browser. As we progress, we may decide we want to persist our models in a way that requires a server component. So we are going to lay out our project structure to make that easy to add later if we need it. That would look something like this…

/cow-music-ts
  /client
  /server

…so we will create our app in the client subfolder of our project to keep it isolated.

We are going to setup our project with Vite. Vite is a build tool (and more) for JavaScript-based projects. We navigate into the client directory and run the following command…

client> npm create vite@latest

…and we choose React and TypeScript. Then we follow the instructions to install the dependencies and run the app to make sure everything is set up properly.

Next – Building a Piano Roll to Visualize Playback

In Part 2, we will build a Piano Roll so we can visualize the sequences of notes generated by our model. We will also start building out our types to represent notes and keyboards.