Story2Hallucination: converting stories to deep learning GAN hallucinated animations

2021·01·22 | will stedden

Using OpenAI CLIP to generate dynamically evolving images as the text of a story is fed in

Late last year, I tried working on a method to use the text from my dynamically updating short story website, a.ttent.io/n, to generate hallucinatory animations. Back then, I was working off of earlier methods to do this sort of thing and wasn't having great results. But earlier this month OpenAI released a newer method called CLIP, which really improved the text-to-image generation results that I've seen.

The focus of CLIP itself was originally for standalone image generation, and that seems to be what most people are doing with it. However, working off a CLIP-based package called BigSleep, I found a pretty cool way to turn paragraphs into quite interesting animations that seem to have at least some connection to the text. As an example, here is an exerpt from the beginning of Franz Kafka's Metamorphosis being used to generate the visuals.

Warning that there is a decent amount of blinking and flickering in the videos that follow.

Clearly there are segments where the visualization breaks down, but there are surprisingly a couple of spots that actually map pretty well to the underlying text they are trying to describe. At the very least, the images clearly capture Kafka's surreal tone. (dare I call it Kafkaesque?)

For the next trick, I wanted to try some more positive visual language so I went with a familiar William Wordsworth poem. In the next clip, I show two runs side by side to illustrate how the images randomly decay to noise and need to be periodically reset.

Again there are places where the algorithm starts spitting out total noise, but overall, the parts that do connect smoothly are vivid and very pertinent. I want to dive into the details over the next few weeks and figure out where the algorithm goes off the rails, but for now, I just built in a few safe-guards to "reset" it when things get too messy or it sticks on the same shape for too long. You can view my Youtube channel to see more experiments.

a.ttent.io/n

But my real application was meant to be used on my own short stories. To optimize that, I fine-tuned the parameter selection manually to see if I could get it to produce more consistent video. I used the first paragraph of a.ttent.io/n, and after many rounds of trial and error, I started to get it to pump out artistic renditions of the real features that are mentioned. Take a look.

Clearly it does well with any mention of octopus, even fictional "harp-toothed" ones.

Big Sleep rendered octopus on ocean floor

But it also renders a wall, glass narcotics cabinet, and bottles pretty well.

Big Sleep rendered wall, narcotics in a cabinet, and bottles on a countertop

"Distribution center" rendered an interesting overhead map view, it seemed.

Big Sleep rendered overhead view of what could be a distribution center

But my personal favorite though was the rendering of the dock.

Big Sleep rendered docks with fog rolling in over nearby rocks

Try it yourself

All the code needed for this is in a Google Colab notebook availablehere, and it's also available on github if you'd like to help me make improvements. My notebook is just a modification of this one from Phil Wang that lays out how to use Big Sleep with CLIP weights.

There's a lot of hacky code that tries to modify parts of Big Sleep from the outside, but the only important modification is just the change to dynamically update the text periodically with this line:

 model.text = all_text_list[epoch].translate(str.maketrans('', '', string.punctuation)) model.encoded_text = tokenize(model.text).cuda()

Pretty much the rest of the code modifications were just there to get the image to robustly "reset" if it gets stuck looking too similar for too long. This became an issue because sometimes the image converges to a very stable form and stops morphing from there. This is not such a big deal when you are generating one-off images because you can always try again, but when you are trying to dynamically morph from one text prompt to the next, getting stuck means the whole video is lost.

You can play around with your own texts and let me know what you come up with. There are quite a few parameters to try to optimize to get good results.

I'd love to see what things happen. If you reference this blog post on twitter or mastodon, then your work will be appended in the comments section.

And have fun hallucinating!