A VQGAN is a neural network architecture that can be used to generate new images from a batch of existing images. It can then sequence these new images to generate a video based on what it understands of the characteristics of the original images. I forced the neural network to produce a video from exactly one image - the bare minimum input data required in order to try to see the fundamental characteristics of this algorithm. The image I used was a simple circle. The video generated was a noisy, wobbly circle. I repeated the process a total of 8 different sized circles.
Then, I created a program to compare every frame of the generated video with the original video and calculate a single value of the difference in pixel values. I assigned one of the notes C, D, E, F G, A, B, C to each of the 8 circles, and added an oscillator to the program that generates these notes, but distorts them depending on the difference value. The program then, randomly selects two of these circles to perform at a time corresponding to the right and left channels of audio, and together generating a conversation of wobbly circles.