The latest introduction of "Multimodal LLM", data and essays are packed and taken away directly. 04/28 Update SLTechnology News&Howtos

The latest introduction of "Multimodal LLM", data and essays are packed and taken away directly.

2025-04-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

A comprehensive understanding of multimodal large language models, the first collection of papers tracking MLLM progress published.

Awesome-MLLM: github.com/ BradyFU / Awesome-Multimodal-Large-Language-Models In recent years, there has been significant progress in the study of Large Language Models (LLMs)(e.g. GPT-3, LLaMa, ChatGPT, GPT-4), which demonstrate excellent performance on various natural language processing (NLP) tasks.

By pre-training on massive amounts of data, LLM acquires a wealth of knowledge and powerful reasoning abilities. With only a few user commands, these models can parse commands, reason, and give answers that match user expectations.

Some of the typical capabilities LLM has include:

:: Perform new tasks not seen during training;

·Complete a new task with a small number of examples;

·Perform complex reasoning tasks through reasoning chains;

Coordinate models and tools for complex tasks.

There are many key ideas and techniques behind these abilities, including Instruction Tuning, In-Context Learning, and Chain of Thought.

Multimodal large language models Although large language models have made great progress in the field of NLP, corresponding models and techniques are less explored in multimodal domain, and traditional visual-language models still have limitations such as insufficient generalization and lack of reasoning ability.

For this reason, many scholars have recently turned their attention to an emerging direction: multimodal large language models (MLLM).

The main idea is to integrate, reason, analyze and decide the input multimodal information with LLM as the "brain", so as to complete the tasks entrusted by human beings.

From the perspective of developing general artificial intelligence, MLLM is a step forward compared to LLM, and has the following advantages:

·More in line with human cognitive habits. Human beings have multiple senses and receive multiple modal information, which is often complementary and synergistic. Therefore, the use of multimodal information can generally better understand and complete complex tasks;

·More powerful and user-friendly interfaces. By supporting multimodal input, users can communicate information in a more flexible way;

·Broader mission support. LLM can usually only perform NLP related tasks, while MLLM can perform more tasks by accessing multimodality.

From a system design perspective, MLLM can be divided into two categories:

LLM as an inference engine, cognitive inference systems that support multimodal inputs;

LLM as a multi-tool collaborative system of planners/schedulers/decision makers.

The former usually transforms multimodal information into forms that LLM can receive and process directly through trainable multimodal transformation interfaces, so that LLM can recognize and reason based on these multimodal information and user instructions.

The latter usually uses LLM as planner/scheduler/decision-maker [1] to decompose complex tasks assigned by users into simpler subtasks, distribute them to appropriate models/tools, and finally integrate the results and output them.

We take another perspective, focusing on the key technologies and implementation methods behind MLLM, investigating and summarizing related work, and dividing MLLM into the following categories:

Multimodal Instruction Tuning

Multimodal In-Context Learning

Multimodal Chain-of-Thought

LLM-Aided Visual Reasoning

We will briefly describe these types of work below.

Multimodal Instruction Tuning The basic approach to multimodal Instruction tuning is to use a unified template to unify various types of data and describe task requirements in the form of instructions to form multimodal instruction data, which is then used to fine tune the MLLM.

Because of the consistency of instruction forms during training and testing, LLM can generalize to other tasks more flexibly by virtue of its strong semantic understanding and reasoning ability, and obtain strong zero-sample learning ability.

The basic form of multimodal command data can be summarized as (command, multimodal input, answer) triplets.

An intuitive way to obtain this data is to modify the Benchmark dataset. We take Image Captioning as an example, as shown in Figure 1 below:

Figure 1. The original Caption data sample of the multimodal command data example consists of a picture and a text description (Ground Truth), and this data-GT pairing naturally constitutes the multimodal input and answer parts of the command data.

The instruction part is the description of the corresponding task, which is generally written manually or generated by calling GPT.

When multi-modal command fine-tuning is performed, the MLLM converts the multi-modal input and feeds it into the LLM, which predicts the answer based on the multi-modal information and command text.

Multimodal In-Context Learning The core idea of multimodal context learning is to learn from analogy. For example, we generally come into contact with the following forms when studying:

By learning examples, we can learn basic ideas and methods by analogy when we encounter new problems, so as to solve new problems.

In addition, examples can standardize our answer format, which is more conducive to getting correct and expected answers.

As shown in Figure 2 below, let the model predict the calculation result of 3 x 7 through the example.

Figure 2. Multimodal contextual data example, using examples to let the model predict 3x7 calculations Multimodal Chain-of-Thought is a series of intermediate reasoning steps [2]. The basic idea of multimodal thought chain is to make the model learn to output intermediate steps step by step, and finally reason out the final answer, as shown in Figure 3 below:

Figure 3. Multimodal thought chain data example Compared to the way to output answers directly, thought chain:

·More consistent with human reasoning habits: based on previous reasoning steps and results, gradually leading to the final answer;

·Suitable for complex reasoning tasks, solve complex problems step by step, and improve the accuracy of answers.

LLM-Aided Visual Reasoning (LLM-Aided Visual Reasoning) uses LLM as a decision-making and reasoning mechanism, invoking various multimodal models and tools and integrating the outputs to arrive at the final answer. According to the way to complete the task, it can be divided into single-round model and multi-round model.

The basic idea of the single-round model is that LLM acts as a planner, scheduler and decision-maker to coordinate various models/tools to complete tasks. Generally, it needs to complete the following functions [1]:

Planners: Break complex tasks into solvable subtasks;

Scheduler: assigns subtasks to appropriate models/tools;

·Decision maker: manages the order in which subtasks are executed and integrates subtask results to get the final answer.

The multi-round model is based on iterative thinking, accumulating visual cognition until it is confident enough to get the final answer. In this process, LLM needs to integrate the previous steps (the question asked and the visual cognitive information obtained) to determine whether the final answer can be output [3].

See github.com/ BradyFU / Awesome-Multimodal-Large-Language-Models

References:

[1] Shen, Yongliang, et al. "Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. " arXiv preprint arXiv:2303.17580 (2023).

[2] Wei, Jason, et al. "Chain of thought prompting elicits reasoning in large language models. " arXiv preprint arXiv:2201.11903 (2022).

[3] You, Haoxuan, et al. "IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models. " arXiv preprint arXiv:2305.14985 (2023).

This article comes from Weixin Official Accounts: Xinzhiyuan (ID: AI_era)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.