Tools are meant to solve problems, while the essence of creation is to pose questions.
When I first started using Keling’s newly released “Multi-Image Reference,” I had a question in my mind: what problem does it actually solve?
What Can Multi-Image Reference Do
In the interface, Keling describes it this way:
You can upload 1-4 reference images and select the characters, objects, props, and scenes in the images, describing their changes or interactions in words. The model will creatively generate a video for you based on all the reference content.
In simple terms, it allows for the direct fusion of different elements in an animated scene. Specifically, the actual effect is like this:
The model will automatically match and derive based on the characteristics of the “subject”. The completion degree varies among different models during such combinations. Keling performed quite well in interpreting Spider-Man’s actions and environment in this set of tests (of course, don’t expect success on the first try; this is true for all models). Prompt words remain an important influencing factor here, and subjects can be roughly divided into several categories: characters, objects, scenes, and props.
A maximum of 4 images can be fused. For example, in this shot, we want Spider-Man to appear at the airport, with Pikachu and Batman in the background:
After opening 3-4 blind boxes, we achieved a relatively ideal result. Thanks to Keling 1.6’s excellent performance, obtaining a usable clip does not require multiple attempts, but the waiting time is still relatively long, generating a batch of animations takes about 5-10 minutes during testing. I have previously provided feedback to Keling’s team about the generation speed issue, and improvements should be forthcoming. But honestly, in the face of speed and cost, the quality of generation remains my top priority, which is also a core experience demand for creators.
Having discussed the general functionality, let’s return to the original question: what can it be used for?
Rapid Expression of Ideas
If there is a term that has been mentioned most frequently in the AI creation circle in 2024, “consistency” is definitely one option. Therefore, the first thought that comes to mind with this feature is: can it solve the consistency problem, and can the actors’ issues be resolved?
First of all, Keling indeed performs well in terms of consistency, maintaining the basic characteristics of characters and scenes. However, if you expect it to reach the standards of the film industry, I think it is currently very difficult and not necessary at all. Facial features can be managed, but when it comes to costumes and props, not only Keling, but any AI video model currently cannot achieve “direct output” of industrial-grade standard consistency in continuous shots.
Currently, this demand does not seem to be the positioning of a one-stop platform in the short term. Localized workflows, such as Lora + image-based animation solutions, remain the optimal choice, and compared to last year, the overall implementation cost has significantly decreased.
If what you want to create is media for online dissemination, Keling’s current multi-image reference is very suitable for use, as its completion degree in consistency is already quite sufficient. Especially for creators who do not want to spend a lot of effort on details and hope to generate personal expressions quickly. A few simple elements, described, can generate a simple little work. Is the value of small creative videos less than that of industrial film works? I completely disagree; this is a media culture issue.
The media for online dissemination often determines its traffic size based on the content itself, rather than the absolute level of production quality. A typical example is the various derivative videos on Bilibili, which is also a characteristic of meme culture that began in the late 20th century. Starting from personal blogs, evolving along the characteristics of YouTube culture, mobile devices, self-media, and short videos, the most significant feature that has emerged is the “roughness” (not a derogatory term). These contents return to the essence, not particularly demanding completion at an industrial level. AI animation technology will undoubtedly strengthen this trend.
Are we to become a film production company or content creators? This is not entirely the same thing.
This is a large topic, involving the concept of independent narrative proposed by Runway’s CEO Cristobal Valenzuela, which we will discuss later.
For example, I want to express the contrast of a kitten interacting with Spider-Man. The traditional text-to-animation process would create significant differences between the two scenes, making it difficult to produce a sense of series. The image-based animation workflow is also quite cumbersome, so the multi-image reference is very useful here.
Speed and efficiency are crucial for creative inspiration. Sometimes, it is very important; an idea comes to mind, but after two days of working on it and getting entangled in a bunch of details, the idea can easily cool down, and the enthusiasm can fade. From this perspective, the value is significant.
So, does the value of multi-image reference diminish for high-demand film-level creation? I don’t think so; it depends on how we use it.
Consistent Shot Narration
Directly looking at examples, these are two continuous shots, racing between buildings. The composition and shot language of the two shots differ greatly in proportion, but the continuity and “sense of series” are excellent. Achieving this effect in the image-based animation workflow is not easy.
In image-based video, the generation of images lacks too much relevance. Even using reference techniques like sref, obtaining series images from different scenes is also quite time-consuming. Multi-image reference provides a new idea, using prompt words to differentiate shot keys while maintaining good relevance.
Of course, if you are determined to stick to the image-based video process, you can achieve similar results, or even better. But comparing effects without considering time costs in all creations is misleading; tools merely provide a balance between effect and time cost.
This application is suitable for quick, coherent expressions in works with relatively low detail requirements, or in other words, it is exploring an intermediate state between image-based video and text-based video. Good results require certain generalization capabilities from the model, and Keling 1.6’s model quality supports this method.
Value of the Process: Treating AI as a Creative Assistant
Sometimes, the value of a generation result may not lie in direct application, but in the inspiration it brings or the prototype for the next step.
We associated Spider-Man with the elevated bridge in Chongqing, and this time, AI provided a combination method: Spider-Man perched on the pillar of the elevated bridge, overlooking the scenery below.
At this point, the role of multi-image reference is similar to that of a photographic advisor, providing a combined perspective and idea. You may not adopt it directly, but it will certainly benefit the conception of shot composition. This practice has exhausted many friends who use Midjourney as an inspiration generator; however, when the generated product becomes a video, this inspiration adds a dynamic dimension, which is even more helpful for creation.
I have always believed that the value of AI tools is not necessarily to complete something, but to help us complete something. Therefore, this “process value” can sometimes be very precious.
The same goes for this set of sports car shots. The overall movement is actually self-directed by AI. At the same time, we quickly obtained more angles and materials of characters or subjects while ensuring scene consistency. These materials can completely become the prototype and starting point for the next image-based video.
Tools Solve Problems, Creation Poses Questions
I actually do not agree with the crude definition of tools as absolutely useful or useless; such statements are merely emotional and not very valuable. Just like the dissatisfaction everyone expressed after Sora’s release is completely understandable, but if you experience it carefully rather than just criticizing based on a few samples, you might arrive at a different conclusion.
Keling’s multi-image reference, as a tool, has a lot of extensibility, especially since Keling 1.6’s robust model quality provides much imaginative space for this extension. In fact, many tools’ usage and value are discovered by creators themselves; the tool itself provides a paradigm and capability, offering a problem-solving ability, but what kind of problem to solve is the true emergence generated by creators on the platform, which is also the value of creation.
Tools solve problems; creation poses questions.
Keling AI: https://klingai.kuaishou.com/
About Me
I am Hanqing, the founder of AI.TALK, an AI creator who started learning art at the age of 6, and a product manager who has been in the internet circle for 16 years. I share my thoughts on AI technology and media here.
My vision is to find new ways to combine technology and media art. If you are also interested in this topic, feel free to follow my public account and video works.
-
Business Cooperation: aitalkgina
-
Channel Video Account: AI.TALK
-
Personal Video Account: HanqingHQ