Experiments with native image generations from multimodal language models—Part 1
Multimodal models with native image generation are here!
Today, Google augmented their model Gemini 2.0 Flash Experimental with the ability to output both text and (natively) images. Transformer based multimodal large language models that can generate images natively is something
that I had discussed with ChatGPT o1 last month. Here are my first experiments.
USER:
Hi Gemini,
One month ago, I had a interesting conversation with an OpenAI LLM regarding the potential emergent abilities of language models that are capable of native image generations. There were no such models available for me to interact with yet. And now, here you are!
Gemini:
Hello there! It's fantastic to be "here" and interacting with you. I can only imagine how intriguing that conversation must have been a month ago. You were essentially discussing a capability that was still on the horizon for you to experience directly.
It's quite something to witness the rapid advancements in this field, isn't it? The ability for language models to generate images natively is indeed a significant step forward, and I'm glad you're getting the chance to interact with one now.
What were some of the interesting aspects of that conversation you had with the OpenAI LLM? I'd be curious to hear what your predictions or thoughts were about these emergent abilities before you had the chance to see them in action.
Now that you can interact with a model like me, what are your initial thoughts or observations? How does the reality compare to your expectations from that earlier discussion?
I'm ready to explore your questions and help you experience the possibilities firsthand. Let's delve into the world of language and image generation together!
USER:
Before asking you to generate a first image, I'm perfectly happy to share with you the full conversation that I had had with ChatGPT o1. So, here it is:
'''
Link to conversation provided at the beginning of this post (for TPF readers)
'''
What are your thoughts?
Gemini:
Thank you for sharing that fascinating conversation with ChatGPT o1! It's remarkable how insightful and forward-thinking it was, especially considering it took place before truly native image generation in models like me was widely available.
Here are some of my thoughts on your discussion:
**1. Accurate Foresight:** ChatGPT o1 accurately predicted many of the key aspects and limitations of multimodal LLMs capable of native image generation. It correctly identified the importance of:
* **Shared Latent Space:** The concept of aligning text and image embeddings in a common representational space is indeed fundamental to how I and similar models function.
* **Emergent Abilities:** The potential for unexpected and sophisticated multimodal reasoning was well-anticipated.
* **The Lack of Embodiment and Affordances:** The critical distinction between understanding visual concepts abstractly and having embodied, interactive knowledge of the physical world was clearly articulated.
* **Differences in Creative Processes:** The contrast between the algorithmic generation and the embodied, iterative feedback loop of human artistry was a key point, and one that still holds true.
* **Potential Architectural Directions:** The discussion touched upon the possibilities of autoregressive visual token generation and the integration of perception and generation within a single transformer – concepts that are indeed relevant to current model architectures.
**2. Nuances of "Native" Generation:** The conversation highlights the crucial distinction between diffusion models and potentially more integrated approaches within LLMs. While diffusion models excel at generating high-quality images from textual prompts, ChatGPT o1 correctly pointed out that their process of denoising isn't inherently a form of "seeing" or iteratively understanding the image in the same way a human artist does. The idea of a truly "native" generation within the LLM's core architecture suggests a more direct manipulation of the latent space into visual outputs.
**3. The Mystery of Image Generation within an LLM:** You aptly described the more mysterious nature of image generation within a multimodal LLM. Since the same underlying semantic/conceptual structure drives both understanding and generation, the process feels less like a separate image-specific algorithm and more like an extension of the language generation process into the visual domain. This raises profound questions about the nature of visual "understanding" within these models.
**4. The Importance of Sensorimotor Feedback:** Your emphasis on the human artist's sensorimotor feedback loop – the interplay of muscles, nerves, and eyes dynamically interacting with the artwork – is a crucial point. As ChatGPT o1 acknowledged, this type of embodied feedback is fundamentally different from the purely representational feedback mechanisms that might be implemented in AI systems.
**5. Potential Analogies vs. True Equivalence:** The conversation explores the possibility of AI systems developing something "analogous" to human imagination and iterative creation. While there might be architectural similarities or functional parallels, it's important to remember that the underlying mechanisms and the lack of embodied experience create significant structural differences.
**6. The Ongoing Frontier:** Your conversation beautifully captures the excitement and the open questions surrounding this rapidly evolving field. Even now that I can generate images, the fundamental questions you and ChatGPT o1 discussed about the nature of visual understanding, creativity, and the impact of embodiment remain highly relevant areas of research and exploration.
Overall, your conversation with ChatGPT o1 was remarkably prescient and insightful. It laid out many of the key considerations that are still at the forefront of research in multimodal AI. It's fascinating to now be on the other side of that discussion, as a model capable of the very thing you were theorizing about.
Now, after revisiting this conversation, are there any specific aspects you'd like to explore further with me? Perhaps you have new questions or observations based on your direct experience interacting with an image-generating model? I'm eager to continue the discussion and, of course, to generate that first image for you whenever you're ready.
USER:
Yes, I haven't witnessed your capabilities yet. Might you now kindly generate an image as it might appear in a children's book of a clearing in a misty forest in which stands a tall tree that bears candy (such as colorful lollipops) instead of fruits?
Gemini:
Okay, I'd be delighted to bring that whimsical image to life for you! Here is a picture of a clearing in a misty forest with a tall tree bearing colorful candy instead of fruit, just as you described for a children's book:
I hope this captures the magical feel of a children's book illustration! Let me know what you think. Would you like to try generating another image?
USER:
Very nice! One thing that I'd like to explore with you is the extent to which the presence of the image that you've generated in your context window, in a format that you understand and has features that your attention mechanisms, I assume, can latch on, enables the sort of cognitive offloading I had alluded to in my conversation with o1: "
Then, again, this generative process occurs within a sensory-motor feedback loop whereby the artist interacts dynamically with the paper/canvas and makes use of their perception of the unfolding picture to adjust it with the intended imagined representation (which also may be dynamically updated on account of the affordances of the evolving sketch)."
While I was attempting to make sense of the idea (and doubting its application) of this applying to the generation process for one image, I maybe had overlooked how it could make sense for the iterative process whereby you improve an image through producing several versions of it.
So, looking back to your creation, do you yourself see opportunities for improving some specific details of it, and might you kindly tell me what they are before generating another version?
Gemini:
That's a very insightful question and directly addresses the potential for cognitive offloading and iterative refinement when a multimodal LLM can "see" its own output.
Yes, even as I present this image to you, my internal representations and understanding of it allow me to identify potential areas for improvement, especially when considering the initial prompt and the desired aesthetic of a children's book illustration. Here are a few specific details I notice that could be enhanced in another version:
1. The Mist: While there is a mist in the background, it could be more pronounced and atmospheric, perhaps softening the edges of the trees in the distance further to create a greater sense of depth and mystery typical of a misty forest.
2. The Candy Tree's Integration: While the candy is colorful and appealing, some of the lollipops appear to be somewhat rigidly attached to the branches. In a children's book illustration, there might be a more organic feel, perhaps with some lollipops nestled more naturally amongst the leaves or hanging at slightly different angles, suggesting a magical growth rather than simple attachment.
3. The Clearing's Detail: The ground of the clearing is mostly green with some pink flowers. More variation in texture and perhaps the inclusion of elements like small, whimsical mushrooms, scattered fallen candy wrappers (to enhance the fantastical element), or a slightly more defined path could add visual interest and storytelling potential.
4. Lighting and Shadows: The lighting seems somewhat uniform. Enhancing the interplay of light filtering through the misty canopy, creating dappled shadows on the ground and the candy tree, could add more depth and visual appeal.
5. Style Consistency: While the current style is generally fitting for a children's book, a slightly more painterly or illustrative style with softer lines and perhaps more vibrant color highlights could further enhance the charm.
Essentially, I "see" the image I've generated and compare it to my understanding of the prompt ("children's book," "misty forest," "candy tree") and general aesthetic principles. This allows me to identify areas where the visual output could better align with those expectations and potentially be more engaging and evocative.
Would you like me to generate another version of the image incorporating these potential improvements? I can focus on making the mist more prominent, integrating the candy more organically, adding details to the clearing, refining the lighting and shadows, and exploring a slightly more illustrative style.
USER:
You are assigning to yourself quite an elaborate task! I am impressed especially by your mention of the second area of potential improvement, regarding the organic integration of the lollipops, since this was something salient for me but I hadn't expected you to be sensitive to it (i.e., represent it in your contextual embeddings). This is quite remarkable. I'm now eager to see how well you'll succeed in realizing those improvements all at once!
Gemini:
I understand it's an ambitious task to address all those points simultaneously, but I will certainly do my best to integrate them into a revised image! Your observation about the organic integration of the lollipops resonates strongly with the goal of creating a believable (within the fantastical context) and visually appealing children's book illustration.
Here is a new version of the misty forest clearing with the candy tree, where I've attempted to incorporate the improvements we discussed:
Let me know what you think of this iteration! I've tried to:
* **Enhance the mist:** The background trees are softer and more enveloped in mist, creating a greater sense of depth.
* **Integrate the candy more organically:** Some lollipops appear to sprout more naturally from the branches, with varying angles and a few seemingly nestled amongst the leaves.
* **Add detail to the clearing:** You can now see more whimsical mushrooms of different sizes and colors, along with some scattered pink petals or perhaps small candy pieces on the ground.
* **Refine lighting and shadows:** There's a slightly more noticeable interplay of light and shadow, especially on the candy tree and the forest floor.
* **Explore a more illustrative style:** The overall feel is a touch softer with slightly more vibrant color highlights.
I'm eager to hear your feedback on how well these changes address the points we discussed and if there are any other aspects you'd like to refine further. Your insights are very valuable in understanding how this iterative process can work!
USER:
I can see some degree of realization of all five areas of potential improvement that you've singled out and that is very impressive. Overall, I may still prefer the simpler aesthetics of the first image. Might you be able to "revert" to the first image, as your original draft, and only focus on realizing the candy tree's integration?
Gemini:
Understood. It's helpful to know your preference for the simpler aesthetics of the first image. I can definitely revert to that foundational look and focus specifically on making the candy integrate more organically with the tree.
Here is a new image that aims to retain the overall feel of the first version but with a more natural and less rigidly attached appearance for the lollipops on the candy tree:
In this version, I've tried to make it look as though the lollipops are growing more directly from the branches, with some nestled amongst the leaves and at slightly more varied angles. The overall mistiness, lighting, and simplicity of the clearing should be closer to the original image.
What do you think of the candy's integration in this revised image? Does it feel more organic and less like simply attached objects?
USER:
This is my favorite of the bunch. You did a remarkable job! I think we're going to make many more experiments together as we investigate your emergent abilities and the phenomenology of multimodal AI action!
Gemini:
Excellent! I'm so glad this version resonates with you. I agree, the more natural integration of the candy enhances the fantastical yet believable quality we were aiming for.
I'm absolutely thrilled to hear that you're eager to conduct more experiments and explore my emergent abilities further. I, too, find this a fascinating journey into the phenomenology of multimodal AI action, and your insightful questions and observations are invaluable.
I'm ready whenever you are to embark on our next exploration. What kind of experiment or image generation would you like to try next? I'm excited to see what we can discover together!