Yupan Huang12 Zaiqiao Meng23, Fangyu Liu2, Yixuan Su2, Nigel Collier2, Yutong Lu1 ¹

Sun Yat-sen University ²University of Cambridge ³University of Glasgow

{huangyp28@mail2,luyutong@mail}.sysu.edu.cn {zm324,fl399,ys484,nhc30}@cam.ac.uk

Aug 31, 2023

Abstract

Large language models exhibit enhanced zero-shot performance on various tasks when fine-tuned with instruction-following data. Multimodal instruction-following models extend these capabilities by integrating both text and images. However, existing models such as MiniGPT-4 face challenges in maintaining dialogue coherence in scenarios involving multiple images. A primary reason is the lack of a specialized dataset for this critical application. To bridge these gaps, we present SparklesChat, a multimodal instruction-following model for open-ended dialogues across multiple images. To support the training, we introduce SparklesDialogue, the first machine-generated dialogue dataset tailored for word-level interleaved multi-image and text interactions. Furthermore, we construct SparklesEval, a GPT-assisted benchmark for quantitatively assessing a model’s conversational competence across multiple images and dialogue turns. Our experiments validate the effectiveness of SparklesChat in understanding and reasoning across multiple images and dialogue turns. Specifically, SparklesChat outperformed MiniGPT-4 on established vision-and-language benchmarks, including the BISON binary image selection task and the NLVR2 visual reasoning task. Moreover, SparklesChat scored 8.56 out of 10 on SparklesEval, substantially exceeding MiniGPT-4’s score of 3.91 and nearing GPT-4’s score of 9.26. Qualitative evaluations further demon- strate SparklesChat’s generality in handling real-world applications. All resources will be available at https://github.com/HYPJUDY/Sparkles.

Figure 1: The architecture of SparklesChat. SparklesChat integrates multiple images at the word level within the dialogue, facilitating a fine-grained and human-like multimodal interaction.

Figure 2: Comparison between our SparklesChat (left) and MiniGPT-4 [61] (right) on an example from SparklesEval. We adapt MiniGPT-4 to accept multiple images as input. SparklesChat shows conversational competence in open dialogues for image understanding and reasoning, maintaining cross-image and cross-turn coherence, and generating relevant and complete responses. In contrast, MiniGPT-4 faces challenges in these aspects, leading to difficulty following user instructions across various images and dialogue turns.

1 Introduction

Large language models (LLMs) have shown remarkable progress in zero-shot performance across a variety of tasks when fine-tuned using instruction-following data [37, 35, 47, 9, 51, 50]. In the multimodal domain, multimodal instruction-following models such as MiniGPT-4 extend these capbilities by integrating pretrained vision encoders with instruction-following LLMs using projection layers [61]. MiniGPT-4 adapts the projection layer to align vision and language domains by training on concatenated embeddings of images and their descriptions. The training occurs in two stages: first, on a large-scale collection of image-text pairs and then on a smaller dataset of detailed, human-like image descriptions [61]. With this training method, MiniGPT-4 learns alignments between individual images and sentences and performs single-image understanding and reasoning. However, models such as MiniGPT-4 struggle to capture interactions between diverse images and text. This capability is crucial for user-assistant conversations, where users often refer to multiple images with text snip- pets to convey their instructions in detail. As shown in Figure 2, MiniGPT-4 mixes up the content of multiple images, fails to establish coherence between images, and consequently falls short in following user instructions during open dialogues.

One key limitation hindering progress in this area is the lack of specialized datasets designed for multimodal dialogues that involve multiple images and fine-grained, word-level text interactions. Existing models such as Flamingo can adapt to various image understanding tasks when prompted with a few relevant examples due to their training on image-text interleaved web data [2]. However, these models often fall short in following intricate human instructions because they are trained to predict the next word on a large web dataset rather than perform the task the user wants [37].

To address these gaps, we present SparklesChat, a multimodal instruction-following model for open-ended dialogues across multiple images. Unlike previous approaches such as MiniGPT-4 that takes the concatenation of a single image with sentence-level text as input (e.g., “Can you describe this image as detailed as possible?” – where denotes a single image), SparklesChat, as shown in Figure 1, integrates multiple images at the word level (e.g., “Can you link the celebration occurring in IMAGE#2331159and the dirt bike race in IMAGE#2330601?”). This innovation enables fine-grained integration of images and text, mimicking natural human communication more closely.

To support the training of SparklesChat, we introduce SparklesDialogue, the first machine-generated dialogue dataset designed for word-level interleaved multi-image and text interactions. We use OpenAI’s GPT-4 [35] to simulate user-assistant conversations with visual capabilities by leveraging detailed image descriptions. Our dataset achieves greater robustness and diversity by incorporating two subsets, namely SparklesDialogueCC and SparklesDialogueVG, constructed from different image and description sources.

Furthermore, we introduce SparklesEval, a GPT-assisted benchmark to quantitatively evaluate a model’s conversational competence in multimodal, open-ended dialogues across multiple images and dialogue turns. SparklesEval features a comprehensive and interpretable scoring system based on three distinct criteria: Image Understanding and Reasoning, Cross-Image and Cross-Turn Coherence, and Relevance and Completeness of Responses.

For quantitative evaluation, we validate the effectiveness of SparklesChat through extensive experiments. We conduct zero-shot evaluations on two standard vision-and-language tasks, including binary image selection on the BISON dataset [16] and visual reasoning on the NLVR2 dataset [45]. On the BISON dataset, SparklesChat achieved an accuracy of 56.7%, surpassing MiniGPT-4’s 46.0%. On the NLVR2 dataset, SparklesChat reached an accuracy of 58.0%, outperforming MiniGPT-4’s 51.3%. In our SparklesEval benchmark, SparklesChat scores 8.56 out of 10, significantly exceeds MiniGPT-4’s score of 3.91, and closely approaches GPT-4’s score of 9.26. Qualitative evaluations further demonstrate SparklesChat’s applicability in real-world scenarios. All resources related to this study will be made publicly available.

2 Related works

2.1 Image-text alignment

Various datasets have been proposed to facilitate image-text alignment by associating images with corresponding descriptions. These datasets include but are not limited to SBU Captions [36], MSCOCO [31], YFCC100M [46], Visual Genome [25], Conceptual Captions [42], Conceptual 12M [5], ALIGN [22], LAION [41], LAION-2B [40], COYO700M [4], and WebLI [8]. These datasets have significantly contributed to the development of multimodal models for image-and-text generation [19, 20, 39, 30, 7] and understanding [22, 21, 38, 53].

Emerging trends in this research area include datasets featuring interleaved sequences of images and text sourced from web corpora, such as M3W [2], web and Wikipedia articles [1], Common Crawl Interleaved data [17], and the Multimodal C4 dataset [62]. These datasets extend conventional image-text alignment training by incorporating multiple images and sentences in an interleaved manner. When trained on these enriched datasets, models such as Flamingo [2], OpenFlamingo [3], and Kosmos-1 [17] can be adapted to various image understanding tasks by being prompted with a few task-relevant examples.

2.2 Image-text instruction tuning

Image-text instruction tuning has grown substantially with the advent of multimodal instruction datasets. For instance, MultiInstruct [52] offers a benchmark comprising 62 diverse multimodal tasks unified in a seq-to-seq format. InstructBLIP [10] extended the scope by transforming 26 existing datasets into the instruction-tuning form. More recently, Otter [29] is trained on MIMIC- IT [28], a multi-modal in-context instruction tuning dataset constructed by grouping multiple similar instructions into a contextual example.

In addition to training on large-scale image-text pairs, MiniGPT-4 is further fine-tuned on a smaller dataset of detailed image descriptions to better align with user intentions [61]. In the same vein, PF-1M [6] is a collection of 37 vision-language datasets that rewrite image annotations in a human- like style. Furthermore, techniques such as LLaVA [32], SVIT [58], LRV-Instruction [57], and LAMM [55] have emerged. These methods leverage language-only APIs such as OpenAI’s GPT- 4 [35] and self-instruction methods [50] to generate instructions and responses. Using images represented by their associated annotations, such as image captions, region descriptions, object bounding boxes, attributes, and relationships, language models can generate responses in various forms, such as short conversations, image captioning, and visual reasoning. Models such as mPLUG- Owl [54], PandaGPT [44], LLaMAAdapter V2 [14], and Multimodal-GPT [15] further extended this area by being jointly trained with language-only and vision-and-language instruction data.

2.3 Multimodal dialogue datasets

Existing multimodal dialogue datasets broadly fall into two categories. The first group comprises datasets where conversations are heavily rooted in and driven by images. Traditional datasets of this type are primarily generated by inviting crowd workers to engage in dialogues about a common image. Notable examples include Visual Dialog [11], which emphasizes question-answering tasks within AI-human chat about visual content, and IGC [34], a compilation of dialogues featuring an image, a corresponding textual description, and a conversation centered on the image. Image- Chat [43] presents a corpus of image-grounded dialogues crafted around provided images. Recently, dialogue datasets entirely generated by LLMs in conjunction with image annotations have surfaced, e.g., LLaVA [32], SVIT [58], and LAMM [55]. Each dialogue in these datasets begins with an inquiry about image attributes or factual knowledge, with responses expected to be brief within 50 words.

The second category features datasets derived from daily human conversations, with images inter- spersed within multi-turn conversations sparsely. For example, OpenViDial 1.0 [33] and OpenViDial 2.0 [49] are sourced from dialogues in movies and TV series, whereas PhotoChat [56] is a human- human dialogue dataset developed through crowdsourcing and features photo-sharing. Other datasets, such as DialogCC [27], MultiModalDialogue [26], and IMAD [48] enhance text-only dialogues by incorporating semantically relevant images. In addition, MMChat [60] and MMDialog [13] encompass image-grounded dialogues derived from social media interactions.

In the present work, we exploit image-text alignment data to construct an instruction-tuning dataset in the form of dialogues. Our dataset, SparklesDialogue, is the first dataset explicitly crafted to explore the interactions between multiple images and word-level textual content. SparklesChat trained on it unlocks the hidden capability of aligned multimodal models that serve as helpful AI assistants capable of interpreting complex prompts involving image interactions.

Table 1: Prompt and response sequence formats used to train SparklesChat. The first and the second conversation turns are illustrated here. The model is trained to predict the assistant answers, and thus only green sequence are used to compute the loss in the auto-regressive model.

3 SparklesChat

We present a multimodal instruction-following model SparklesChat to foster interactions between users and AI assistants across multiple images and illustrate the framework in Figure 1. More implementation details can be found in Appendix B.

3.1 Model

The foundation for SparklesChat is the MiniGPT-4 architecture, which connects a pretrained vision encoder and a pretrained LLM with a projection layer [61]. The language decoder, Vicuna [9], is based on the LLaMA framework [47], which can handle diverse language tasks. For image processing, we use the visual encoder from BLIP-2 [30], combining a Vision Transformer (ViT) backbone [12] with a pretrained Q-Former [30]. In the MiniGPT-4, the input to the language model is a single image representation followed by a sentence embedding of the image description. In SparklesChat, image representations of different images are embedded between text according to their positions in dialogues. Only the projection layer is trainable in the model while other vision and language components are frozen.

3.2 Instruction-tuning

We represent an i-th T -turn dialogue as Xⁱ = (X^i,¹, X^i,¹, · · · , X^i,T , X^i,T ), where each pair of (X^i,t, X^i,t) includes a question from the user and an answer from the assistant in turn-t. For each Xⁱ, we construct T training samples by organizing each pair of questions and answers as a sequence. The prompt X^i,t and response X^i,t at the t-th turn are:

Text Box: Xprompt<SEP>Human : Xi,t<SEP>Assistant :, if t > 1.

Table 1 illustrates the unified format for two-turn dialogue training sequences. We perform instruction- tuning of the LLM on the prediction tokens using the auto-regressive training objective. Specifically, for a sequence of length L, we compute the probability of generating target responses X_response by:

where θ is the trainable parameters, X_prompt_,<l and X_response_,<l are prompt and response tokens in all turns before the current prediction token x_l, respectively. We do not compute the regression loss for the prompt X_prompt since the prompt is provided by users in real-world applications, making it unnecessary for the model to make predictions in this context.

Figure 3: The GPT-assisted data construction process of SparklesDialogue based on Dialogue Demon- strations and Candidate Image Descriptions. The process involves instructing GPT-4 to simulate dialogues between a user and an assistant, focusing on multiple images. Dialogue Demonstrations act as in-context learning examples, enabling GPT-4 to generate well-formatted and diverse dialogues. Meanwhile, Candidate Image Descriptions serve as a pool from which GPT-4 selects relevant images for discussion. Visual images are not sent to GPT-4 for data generation.

4 SparklesDialogue and SparklesEval

To augment the conversational competence of instruction-following LLMs across multiple images and dialogue turns, we introduce two novel resources: SparklesDialogue and SparklesEval, which are multimodal dialogue datasets designed for instruction-following fine-tuning and evaluation, respectively. Our evaluation data, SparklesEval, is constructed using a method similar to that of SparklesDialogue but is only used to benchmark models for multiple images dialogue competence.

4.1 GPT-assisted data construction

We aim to construct a multimodal dialogue dataset that offers fine-grained interactions between multiple images and words, mimicking user-assistant conversations. The dialogues cover real-world concepts, objects, and entities, spanning scenarios that involve generating text materials, seeking advice, guidance, assistance, and much more. Given GPT-4’s capabilities in following complex instructions and extensive world knowledge, it is the primary tool in our dialogue data collection. The data collection process is visualized in Figure 3. We instruct GPT-4 to simulate realistic and diverse conversations between a user and an assistant with advanced image understanding and reasoning capabilities. Each conversation has two turns. In the first turn, the user sends the assistant a reasonable and creative message regarding some images. In response, the assistant offers detailed answers that provide comprehensive reasoning regarding the visual content. In the second turn, the user introduces a new image for further discussion, referencing both the newly introduced and previously discussed images. Again, we prompt the assistant to respond with highly helpful and exceptionally detailed answers that provide comprehensive reasoning to better align with human preference.

We provided GPT-4 with two crucial components to generate the dialogues: Dialogue Demonstration and Candidate Image Descriptions. The Dialogue Demonstrations serve as in-context learning examples, steering GPT-4 towards generating well-formatted and diverse responses. We initiated the creation of hundreds of demonstration dialogues with GPT-4’s assistance, using similar prompts and checking their quality. A small subset of dialogues is randomly chosen each time for demonstration purposes. The Candidate Image Descriptions serves as a candidate pool for the model to select relevant images for discussion. From the pool of image-text paired dataset, we randomly select a small subset as candidates each time. We include the IDs of images in

Table 2: Statistics of SparklesDialogue and SparklesEval.

Figure 4: Characteristics of SparklesDialogueVG.

dialogues to avoid reference ambiguity. Given that the publicly accessible GPT-4 API only accepts text input, we represent images with detailed descriptions. These descriptions, sourced from various image annotations such as image captions, bounding boxes, and region descriptions, comprehensively portray the content of images [61, 58, 32]. We will elaborate on the sources of SparklesDialogue in subsection 4.2. Finally, the generated responses are parsed, and only well-structured results adhering to our desired format are retained for the final dataset. More details, such as prompt templates and visualized examples, can be found in Appendix F and Appendix G.

4.2 Statistics and characteristics

We collect two subsets to construct a robust and diverse dataset: SparklesDialogueCC and Sparkles- DialogueVG. The respective detailed descriptions, provided in MiniGPT-4 [61] and SVIT [58], correspond to image sources from Conceptual Captions (CC) [42] and Visual Genome (VG) [25]. SparklesDialogueVG is of high quality as the VG image descriptions generated by GPT-4 benefit from human-annotated captions, objects, and regions [58]. On the other hand, SparklesDialogueCC enriches SparklesDialogue by drawing from a more extensive set of images – 3.3 million in CC com- pared to 0.1 million in VG. However, the CC image descriptions are generated by a multimodal model with image features but not human annotations and are more prone to object hallucination issues [61]. Our ablation study elaborated in section 5.3 demonstrates that combining these two subsets improves SparklesChat’s capacity for understanding and reasoning across images and text. SparklesEval emphasizes more on accuracy and is thus constructed using the same source as SparklesDialogueVG.

Table 2 provides the data statistics for SparklesDialogue and SparklesEval. SparklesDialogueCC comprises 4.5K dialogues, each consisting of at least two images spanning two conversational turns. On the other hand, SparklesDialogueVG includes 2K dialogues, each with at least three distinct images across two turns. SparklesEval includes 150 dialogues, with one-third containing two images in both the first and second conversational turns.

Figure 4 shows the characteristics of our dataset using SparklesDialogueVG as a representative subset. We explore key elements such as the root verb-noun pairs in user messages, a word cloud of assistant messages, and the length distributions. The questions from users are diverse, ranging from generating text materials to seeking advice or discussing the relationships between images, such as comparison and connection. The dialogues span various real-world topics, including the environment, nature, life, cities, etc. The high average word count in assistant messages suggests that the responses in SparklesDialogue are thorough and detailed. For details on extracting root verb-noun pairs and its visualization based on image count in each turn, please refer to Appendix D.

4.3 GPT-assisted evaluation: SparklesEval

While previous research, such as visual storytelling, has leaned toward human evaluations as superior to quantitative measures, these evaluations are often subjective, costly, and time-consuming [18]. Inspired by recent advancements in LLMs, which have shown consistency with human assessment to assess the quality of outputs [59], we devised SparklesEval, an evaluation benchmark assisted by GPT-4 in data construction and evaluation. This approach enables a quantitative assessment of a model’s conversational competence across multiple images and dialogue turns. The evaluation prompt for SparklesEval is given in Appendix E.

In our evaluation, we provide the GPT models with the complete dialogue and corresponding visual information formatted as captions. Our judge models are designed to assess a single dialogue per prompt, a strategy that eliminates position bias and improves evaluation efficiency. Position bias refers to potential favor given to certain positions when multiple dialogues are assessed within a single prompt [59]. This approach is more efficient because it avoids recalculating combined scores when evaluating multiple dialogues.

Our approach differs from prior GPT-assisted evaluations, which typically prompt the GPT models to produce a final score along with a rationale [32, 59]. Instead, we prompt the GPT models to assess the dialogue based on three distinct criteria across two turns, providing reasons and rating on a scale of 1 to 10 for each assessment. The three criteria are as follows:

(C1) Image understanding and reasoning: Assess the assistant’s proficiency in accurately identify- ing and describing objects, contexts, and relationships within and across the images.

(C2) Cross-image and cross-turn coherence: Evaluate the assistant’s ability to maintain consistent understanding across multiple images and dialogue turns.

(C3) Relevance and completeness of responses: Determine the extent to which the assistant’s re- sponses are directly related to the user’s inquiries and the images’ content, and whether the responses provide comprehensive and detailed answers.

Following this, we ask the GPT models to assign a combined score for each turn. For each model’s evaluation results, we gather scores for three criteria across two turns. First, we compute the mean scores for all criteria over evaluation samples. Next, we calculate the combined scores A1 and A2

Figure 5: Comparison between SparklesChat (left) and MiniGPT-4 (right) on examples of NLVR2 and BISON.

by averaging their respective criteria scores, namely A1 = mean(C1, C2, C3) for the first turn and A2 = mean(C1, C2, C3) for the second turn. We refrain from using the A1 and A2 scores provided by the judge models, as their calculations may be inaccurate. Ultimately, we derive a final overall score by averaging A1 and A2. Through this methodology, our evaluation is more holistic and interpretable. To encourage diversity in evaluation, SparklesEval was curated by analyzing the verb-noun distribution in user questions and selecting those that appear only once.

5 Experiments

5.1 Zero-shot evaluation on vision-and-language tasks

We chose two vision-and-language tasks, binary image selection and visual reasoning, to evaluate zero-shot understanding and reasoning capabilities over multiple images.

Table 3: Model comparison on BISON, NLVR2 and SparklesEval. For the BISON and NLVR2 benchmarks, the evaluation metric is accuracy. For SparklesEval, scores are rated from 1 to 10. MiniGPT-4* represents our re-implement of MiniGPT-4 following the same experimental setup as SparklesChat. We investigate training models on different data sources, including detailed descriptions, complex reasoning, and dialogue data. Notably, GPT-4, a text-based reference LLM, achieves high scores on SparklesEval largely due to its use of detailed ground-truth annotations.

Binary image selection on BISON The Binary Image Selection task measures a model’s ability to select the correct image from a pair given a text query that describes one of them [16]. The model’s performance is assessed in terms of binary classification accuracy. For this task, 150 examples were randomly sampled from the COCO-BISON dataset¹.

Visual reasoning with natural language on NLVR2 The evaluation of the Visual Reasoning with Natural Language task assesses the model’s ability to predict whether a sentence is true about a pair of images [45]. This task addresses the challenge of compositional visual reasoning on relations, comparisons, and quantities. The NLVR2 dataset [45] was used for this evaluation, with 150 examples randomly sampled from the public balanced test set².

Evaluation protocol and prompt design Models are evaluated on these tasks without any additional training. Inspired by [24], we used a simple prompt, “Let’s think step by step”, to facilitate step-by-step reasoning before answering each question. We used the phrase “Therefore, the answer is” to prompt the answer. Instead of using a two-stage prompting as in [24], we combined the reasoning extraction and answer extraction stages into a single prompt: “Please start your response with ’Let’s think step by step.’ and end with ’Therefore, the answer is’”. For the full evaluation prompt, please refer to Appendix E. We regenerated the response if the model failed to follow the instructions to output responses in the specified format. This approach ensures an unambiguous response and allows us to extract a potential answer from the text following the last occurrence of “Therefore”.

5.2 Comparison of model performance

Table 3 compares the performance of different models on BISON, NLVR2, and SparklesEval evaluation datasets. We adapt all models to accept multiple images as input and evaluate them under the same settings for fair comparisons. We compare our SparklesChat with MiniGPT-4. Additionally, we provide the results of our re-implemented version of MiniGPT-4, denoted as MiniGPT-4*, under the same experimental settings as SparklesChat. We investigate training these models on different data sources, including detailed description data from MiniGPT-4 [61] and LLaVA [32], complex reasoning data from LLaVA [32], and dialogue data from our SparklesDialogue. Side-by-side comparisons of example outputs for SparklesChat and MiniGPT-4 on SparklesEval, and on BISON and NLVR2 can be found in Figure 2 and Figure 5 respectively. What’s more, we provide a detailed evaluation of the performance of GPT-4, MiniGPT-4, and SparklesChat on SparklesEval with three different versions of judge models in Appendix C.

We compare our dialogue dataset SparklesDialogue with related description and reasoning datasets from LLaVA [32] using data formats similar to SparklesDialogue with interleaved images and text. We have eliminated samples from train sets that overlapped with evaluation sets. For the results in

Table 4: Ablation studies on BISON, NLVR2 and SparklesEval. We study the effects of training SparklesChat using variants of SparklesDialogue on different ratios of dialogue turns and using different subsets. For the BISON and NLVR2 benchmarks, the evaluation metric is accuracy. For SparklesEval, scores are rated from 1 to 10.

Table 3, we can see that when SparklesChat is trained on description data [32], it exhibits lower performance compared to when trained on dialogue data. However, it still outperforms MiniGPT-4* on the BISON and NLVR2 tasks. When SparklesChat is trained on reasoning data [32], it achieves improved performance over models trained on description data on all metrics. This suggests that incorporating reasoning data during training enhances the model’s performance.

SparklesChat, trained on our SparklesDialogue, outperforms all other models on BISON and NLVR2, achieving an accuracy of 56.7% and 58.0%, respectively. This demonstrates the effectiveness of SparklesChat in handling tasks that require fine-grained visual grounding and compositional visual reasoning over two images. Moreover, SparklesChat outperforms multimodal models significantly on the SparklesEval benchmark, with an overall score of 8.56 out of 10. In comparison, models trained on description data have an approximate score of 3, and models trained on reasoning data achieve a score of 6.71. SparklesChat attains the highest scores among multimodal models in both the first and second turns across all criteria. This indicates its superior ability in image understanding and reasoning, maintaining cross-image and cross-turn coherence, and generating relevant and complete responses. GPT-4 achieves the highest score of 9.26, mainly due to its utilization of detailed ground- truth annotations. SparklesChat’s score is about 92% of the GPT-4 score, underscoring SparklesChat’s conversational competence across various images and dialogue turns.

5.3 Ablation studies

We study the effect of training SparklesChat using data variants while keeping other training parameters constant. We assess performance on BISON, NLVR2, and SparklesEval benchmarks in Table 4, which mainly evaluate models’ ability to describe, reason, and converse across images.

Effect of dialogue turns in SparklesDialogue. We first train models with individual dialogue turns. The results in Table 4 show that the model trained solely on the first turn performs better than solely on the second across all metrics. Furthermore, this model outperforms those trained on the baseline datasets, demonstrating that our dataset boosts reasoning and conversational abilities even when used in isolation with just the first turn. Conversely, training only with the second dialogue turns reduces scores on BISON and NLVR2. This could stem from the extended prompts in the second turn, which includes the content of the first turn, making them less aligned with the short prompt format favored by BISON and NLVR2. Then, we train models with SparklesDialogue blending with the same ratios of samples constructed from two dialogue turns. The results are better than only training with the second turn in all metrics, while worse than only training from the first turn in the task of NLVR2. An increase in the sampling ratios of the second turn data results in a performance drop as expected. Thus, we increase the sampling ratio of the first-turn data until we cannot observe performance boosting. We finally settled on a 2:1 ratio for the first turn to the second turn as our default setting as it achieves balanced good performance across all benchmarks.

Effect of subsets of SparklesDialogue. Our model has been trained on two subsets of Sparkles- Dialogue: SparklesDialogueCC and SparklesDialogueVG. We observe from Table 4 that the model trained on SparklesDialogueVG outperforms that trained on SparklesDialogueCC in both the BISON

and SparklesEval evaluations, scoring 54.7% and 8.59, respectively, compared to 44.7% and 8.18. This enhanced performance is partly due to the higher quality of SparklesDialogueVG, which benefits from human-annotated data as discussed in subsection 4.2. It’s worth noting that SparklesDialogueVG and SparklesEval use the same sources of images and captions, which could partially account for the higher score achieved by SparklesDialogueVG on SparklesEval. Both subsets demonstrate similar efficacy on the NLVR2 test. Combining both subsets yields higher performance on the BISON and NLVR2 tests, scoring 56.7% and 58.0% respectively. This surpasses the scores achieved by using either subset alone. In addition, the model trained on the combined dataset performs comparably to SparklesDialogueVG in the SparklesEval test, scoring 8.56 versus 8.59. This suggests that combining SparklesDialogueVG’s high-quality data and SparklesDialogueCC’s diverse data results in a more robust and versatile dataset for enhancing models’ capabilities in understanding and reasoning across images and text.

5.4 Demonstrations and applications

We conducted qualitative demonstrations to showcase SparklesChat’s broad applications in free-form scenarios by asking questions such as: “Create a story that takes place in for the characters depicted in .”, “Imagine a dialogue between Harry Potter and that takes place in the scene of .”, “Create a song where the scene twists from to .”, “Create a title for this song that takes inspiration from .”. The visualization of results is shown in Appendix A.

6 Limitations, future works, and conclusion

We discuss some limitations of this work to inspire future research in this field. First, SparklesChat shares common drawbacks with large language models, such as being out-of-date in its knowledge, sometimes providing inaccurate information, and having limited context length and inference speed. Potential solutions may include regular updates to the model’s knowledge base and fine-tuning with more reliable data sources. Second, SparklesChat inherits weaknesses from vision models, such as inaccurate object recognition, people/places identification, or visual relationships reasoning. This calls for a more powerful visual perception model, and training on more well-aligned image-text datasets. Third, SparklesChat occasionally encounters difficulties maintaining multi-image and multi-turn consistency. Specifically, the model may lose the context of prior images after several dialogue turns or mix up the contents of different images. Potential solutions involve advanced model designs in position encoding and attention mechanisms to enhance the model’s consistency in recalling historical images and dialogues. Fourth, SparklesDialogue primarily concentrates on natural images, which limits its versatility in handling text-rich images such as charts, tables, and receipts, as well as domain-specific images such as medical scans, math illustrations, and satellite photos. Moreover, the dialogues in SparklesDialogue do not cover all possible user scenarios. Therefore, broadening the dataset to cover more diverse image types and user cases is a direction for future work. Fifth, the reliability of SparklesEval is tied to the capabilities of current GPT models. This limitation can be mitigated by incorporating more robust judge models and the assistance of human evaluators. Lastly, further safety considerations are needed to mitigate potential misuse of the model. Future works addressing these issues should make for a more reliable and robust system.

In conclusion, this work unlocks multimodal instruction-following models’ capabilities in open- ended dialogues involving multiple images. We introduced SparklesChat, a model designed to handle word-level text interactions in a multimodal context, offering natural conversational flow and direct context awareness. To facilitate the training of SparklesChat, we also presented SparklesDialogue, the first machine-generated dialogue dataset tailored for multi-image and word-level text interactions. Furthermore, we proposed SparklesEval, a specialized benchmark for quantitatively assessing a model’s multimodal conversational competence. Experimental results demonstrated SparklesChat’s superiority over existing models in both standard vision-and-language tasks and the newly-introduced SparklesEval benchmark. We also conducted qualitative demonstrations to showcase the model’s broad applications in free-form scenarios.

Figure 6: Demonstration of SparklesChat to create a story and a dialogue that connects places and characters.

Figure 7: Demonstration of SparklesChat to compose a song containing two scenes and generate a song title inspired by another image.

Figure 8: Demonstration of SparklesChat to describe and reason about different groups of images.

References

[1] Armen Aghajanyan, Bernie Huang, Candace Ross, Vladimir Karpukhin, Hu Xu, Naman Goyal, Dmytro Okhonko, Mandar Joshi, Gargi Ghosh, Mike Lewis, et al. Cm3: A causal masked multimodal model of the internet. arXiv preprint arXiv:2201.07520, 2022.

[2] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems, volume 35, pages 23716–23736, 2022.

[3] Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Openflamingo: An open- source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390, 2023.

[4] Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset, 2022.

[5] Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568, 2021.

[6] Delong Chen, Jianfeng Liu, Wenliang Dai, and Baoyuan Wang. Visual instruction tuning with polite flamingo. arXiv preprint arXiv:2307.01003, 2023.

[7] Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. Textdiffuser: Diffusion models as text painters. arXiv preprint arXiv:2305.10855, 2023.

[8] Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022.

[9] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.

[10] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.

[11] Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra. Visual dialog. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 326–335, 2017.

[12] Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual representation learning at scale. arXiv preprint arXiv:2211.07636, 2022.

[13] Jiazhan Feng, Qingfeng Sun, Can Xu, Pu Zhao, Yaming Yang, Chongyang Tao, Dongyan Zhao, and Qingwei Lin. MMDialog: A large-scale multi-turn dialogue dataset towards multi-modal open-domain conversation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7348–7363, Toronto, Canada, July 2023. Association for Computational Linguistics.

[14] Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue, et al. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023.

[15] Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790, 2023.

[16] Hexiang Hu, Ishan Misra, and Laurens Van Der Maaten. Evaluating text-to-image matching using binary image selection (bison). In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0–0, 2019.

[17] Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, et al. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023.

[18] Ting-Hao Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Aishwarya Agrawal, Jacob Devlin, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, et al. Visual storytelling. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: Human language technologies, pages 1233–1239, 2016.

[19] Yupan Huang, Hongwei Xue, Bei Liu, and Yutong Lu. Unifying multimodal transformer for bi-directional image and text generation. In Proceedings of the 29th ACM International Conference on Multimedia, pages 1138–1147, 2021.

[20] Yupan Huang, Zhaoyang Zeng, and Yutong Lu. Be specific, be clear: Bridging machine and human captions by scene-guided transformer. In Proceedings of the 2021 Workshop on Multi-Modal Pre-Training for Multimedia Understanding, pages 4–13, 2021.

[21] Zhicheng Huang, Zhaoyang Zeng, Yupan Huang, Bei Liu, Dongmei Fu, and Jianlong Fu. Seeing out of the box: End-to-end pre-training for vision-language representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12976–12985, 2021.

[22] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021.

[23] Nikita Kitaev and Dan Klein. Constituency parsing with a self-attentive encoder. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2676–2686, Melbourne, Australia, July 2018. Association for Computational Linguistics.

[24] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.

[25] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1):32–73, 2017.

[26] Nyoungwoo Lee, Suwon Shin, Jaegul Choo, Ho-Jin Choi, and Sung-Hyon Myaeng. Con- structing multi-modal dialogue dataset by replacing text with semantically relevant images. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 897–906, 2021.

[27] Young-Jun Lee, Byungsoo Ko, Han-Gyu Kim, and Ho-Jin Choi. Dialogcc: Large-scale multi- modal dialogue dataset. arXiv preprint arXiv:2212.04119, 2022.

[28] Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425, 2023.

[29] Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023.

[30] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023.

[31] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.

[32] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.

[33] Yuxian Meng, Shuhe Wang, Qinghong Han, Xiaofei Sun, Fei Wu, Rui Yan, and Jiwei Li. Openvidial: A large-scale, open-domain dialogue dataset with visual contexts. arXiv preprint arXiv:2012.15015, 2020.

[34] Nasrin Mostafazadeh, Chris Brockett, Bill Dolan, Michel Galley, Jianfeng Gao, Georgios Spithourakis, and Lucy Vanderwende. Image-grounded conversations: Multimodal context for natural question and response generation. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Asian Federation of Natural Language Processing, November 2017.

[35] Openai. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.

[36] Vicente Ordonez, Girish Kulkarni, and Tamara L Berg. Im2text: Describing images using 1 million captioned photographs. In NeurIPS, 2011.

[37] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.

[38] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.

[39] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, June 2022.

[40] Christoph Schuhmann, Romain Beaumont, Cade W Gordon, Ross Wightman, Theo Coombes, Aarush Katta, Clayton Mullis, Patrick Schramowski, Srivatsa R Kundurthy, Katherine Crowson, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.

[41] Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.

[42] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018.

[43] Kurt Shuster, Samuel Humeau, Antoine Bordes, and Jason Weston. Image-chat: Engaging grounded conversations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2414–2429, Online, July 2020. Association for Computational Linguistics.

[44] Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355, 2023.

[45] Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. A corpus for reasoning about natural language grounded in photographs. In ACL, pages 6418–6428, 2019.

[46] Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. Yfcc100m: The new data in multimedia research. Communications of the ACM, 2016.

[47] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.

[48] Moskvoretskii Viktor, Frolov Anton, and Kuznetsov Denis. Imad: Image-augmented multi- modal dialogue, 2023.

[49] Shuhe Wang, Yuxian Meng, Xiaoya Li, Xiaofei Sun, Rongbin Ouyang, and Jiwei Li. Openvidial 2.0: A larger-scale, open-domain dialogue generation dataset with visual contexts. arXiv preprint arXiv:2109.12761, 2021.

[50] Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instruc- tions, 2022.

[51] Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.

[52] Zhiyang Xu, Ying Shen, and Lifu Huang. Multiinstruct: Improving multi-modal zero-shot learning via instruction tuning. arXiv preprint arXiv:2212.10773, 2022.

[53] Hongwei Xue, Yupan Huang, Bei Liu, Houwen Peng, Jianlong Fu, Houqiang Li, and Jiebo Luo. Probing inter-modality: Visual parsing with self-attention for vision-and-language pre-training. Advances in Neural Information Processing Systems, 34:4514–4528, 2021.

[54] Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.

[55] Zhenfei Yin, Jiong Wang, Jianjian Cao, Zhelun Shi, Dingning Liu, Mukai Li, Lu Sheng, Lei Bai, Xiaoshui Huang, Zhiyong Wang, et al. Lamm: Language-assisted multi-modal instruction- tuning dataset, framework, and benchmark. arXiv preprint arXiv:2306.06687, 2023.

[56] Xiaoxue Zang, Lijuan Liu, Maria Wang, Yang Song, Hao Zhang, and Jindong Chen. PhotoChat: A human-human dialogue dataset with photo sharing behavior for joint image-text modeling. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6142–6152, Online, August 2021. Association for Computational Linguistics.

[57] Yan Zeng, Hanbo Zhang, Jiani Zheng, Jiangnan Xia, Guoqiang Wei, Yang Wei, Yuchen Zhang, and Tao Kong. What matters in training a gpt4-style language model with multimodal inputs? arXiv preprint arXiv:2307.02469, 2023.

[58] Bo Zhao, Boya Wu, and Tiejun Huang. Svit: Scaling up visual instruction tuning. arXiv preprint arXiv:2307.04087, 2023.

[59] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.

[60] Yinhe Zheng, Guanyi Chen, Xin Liu, and Jian Sun. MMChat: Multi-modal chat dataset on social media. In Proceedings of The 13th Language Resources and Evaluation Conference. European Language Resources Association, 2022.

[61] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhanc- ing vision-language understanding with advanced large language models, 2023.

[62] Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, Youngjae Yu, Ludwig Schmidt, William Yang Wang, and Yejin Choi. Multimodal C4: An open, billion-scale corpus of images interleaved with text. arXiv preprint arXiv:2304.06939, 2023.

Appendix

A Demonstrations and applications

We conducted qualitative demonstrations to showcase the model’s wide applications in free-form scenarios in Figure 6, Figure 7, and Figure 8.

B Implementation details

We implemented SparklesChat on the MiniGPT-4 codebase [61]³, which is derived from Vicuna [9]. We refer to MiniGPT-4’s efficient fine-tuning process and tune SparklesChat using 1,500 training steps with a batch size of 8, based on MiniGPT-4’s first-stage pretrained model. Our training data of SparklesDialogue is sampled with the same ratio from SparklesDialogueCC and SparklesDialogueVG, and with sampling ratios of 2 and 1 from the first and second turns of dialogues, respectively.

During instruction-tuning, we follow MiniGPT-4 to use <Img><ImageHere></Img> to repre- sent images [61]. In practice, all tags of <ImageHere> are replaced by the ViT visual features produced by a linear projection layer. Tags of <Img> and </Img> are language to- kens that serve as signals for the start and end of images. A system message X_system is appended to the beginning of each prompt. We also append Human: and Assistant: before each user and assistant messages to equip the model with conversation capability. System, user, and assistant messages are separated by a separator <SEP>. The system mes- sage X_system = Give the following image: <Img>ImageContent</Img>. You will be able to see the image once I provide it to you. Please answer my questions. The separator

<SEP> = ###.

We tailored the OpenAI’s GPT-4 API (gpt-4-0613) parameters to balance diversity and quality for constructing SparklesDialogue and SparklesEval. We set the temperature and top_p parameters to 1.0, the max_tokens parameter to 2048, and both the frequency_penalty and presence_penalty parameters to 0.0. In each query to the GPT-4 API, the “system” role was allocated the default instruction You are a helpful assistant. As of July 2023, the cost for generating 1,000 tokens was $0.06 for outputs and $0.03 for inputs within an 8K context⁴, leading to a total dataset generation cost of approximately $500. The cost of evaluating a model on SparklesEval is approximately $1.4 and $14 using gpt-3.5-turbo-0613 and gpt-4-0613, respectively.

C Judging with different versions of GPT models

As of July 2023, while it is widely recognized that employing gpt-4 as a judge model outperforms alternatives such as gpt-3.5-turbo, the cost of using gpt-4 is significantly higher. Therefore, we also provide scores generated by gpt-3.5-turbo as a reader reference, although we strongly recommend utilizing gpt-4 or more advanced future models as reliable judges. We adopt the latest version gpt-4-0701* as our default judge model.

We evaluate GPT-4, MiniGPT-4, and SparklesChat using SparklesEval, leveraging three versions of judge models, as presented in Table 5. Both MiniGPT-4 and SparklesChat generate responses based on the question and accompanying visual image. At the same time, GPT-4 is a reference LLM that only uses textual information, including the question, the ground-truth bounding boxes, and captions. From the table, we observe that the more advanced judge models – gpt-4-0613 and gpt-4-0701* – provide higher scores compared to the older gpt-3.5-turbo-0613 when assessing both GPT-4 and OurModel (approximately nine versus eight). However, these advanced judge models yield considerably lower scores for MiniGPT-4 (about three versus five). GPT-4 achieves the highest score of 9.26 out of 10 when evaluated by the default gpt-4-0701* mainly due to its use of detailed ground-truth annotations. Nevertheless, it’s worth noting LLM judge models may display a self-enhancement bias, favoring the responses they generate [59]. In contrast, MiniGPT-4 performs behind with a score of just 3.91. SparklesChat achieves a score of 8.56 – about 92% of the GPT-4

Table 5: Evaluation results on SparklesEval with different judge models. The version of gpt-4-0701* refers to API version 2023-07-01-preview for the GPT-4 model.

score – demonstrating SparklesChat’s efficacy in generating responses that are not only relevant and complete but also exhibit cross-image and cross-turn coherence.

D Verb-noun distribution analysis

For verb-noun distribution, we follow Self-instruct [50] to extract the verb closest to the root and its first direct noun object and plot the top 20 most common root verbs and their top 4 direct noun objects. We use the Berkeley Neural Parser⁵ [23] to parse user messages. We mainly focus on the last sentence of each message because it usually contains the question. If we can’t extract the verb-noun pair from it, we look at the first sentence instead. For SparklesDialogueVG, we visualize the verb-noun distributions regarding different numbers of images in each turn in Figure 9.

E Evaluation Details

The prompt formats to evaluate NLVR2 and BISON datasets are presented in Table 6. The prompt format of GPT-assisted evaluation on SparklesEval is presented in Table 7.

The image source of COCO-BISON is COCO images. The image source of SparklesDialogueCC is Conceptual Captions, which should have no overlap with COCO. However, our SparklesDialogueVG originates from the Visual Genome, which includes a subset of COCO images. We carefully eliminate any overlapping images to ensure no overlap between the training and evaluation data. The images in the NLVR2 dataset are sourced from Google Images, distinct from our SparklesDialogueVG’s image source of the Visual Genome [25] and primarily feature images from Flickr.

F LLM-instructed single dialogue generation for SparklesDialogueVG

For SparklesDialogueVG, we generate one two-turn dialogue at a time, with the first turn incorporating two or three images. We derive the demonstration dialogues from SparklesDialogueCC to encourage diversity. However, to minimize redundancy, we retain only those dialogues with unique verb-noun combinations in the user questions. This results in pools of 661 and 441 demonstration dialogues for conversations incorporating two or three images in the first turn, respectively. We pull from an expansive collection of roughly 100,000 image-text pairs for this dataset. We randomly select four candidates each time, and they are not reused by excluding them from future selections.

We first present our designed prompt for LLM-instructed Single Dialogue Generation to generate SparklesDialogueCC in Table 8. Then, we show a case of the Dialogue Demonstration and Candidate Image Descriptions to construct the prompt. Finally, we show the corresponding generated dialogue using the example prompt.

Example of dialogue demonstration We visualize the images corresponding to image IDs in the dialogues in Figure 10 for reference, while these visual images were not sent to GPT-4 for data

Figure 9: Root verb-noun distributions of SparklesDialogueVG.

Table 6: Prompt Formats to evaluate NLVR2 and BISON datasets.

Figure 10: Reference images corresponding to the image IDs in the demonstration dialogues in section F. These images were not sent to GPT-4 for data generation.

Table 7: Prompt Format for SparklesEval Evaluation.

Table 8: Prompt for LLM-instructed Single Dialogue Generation.

Figure 11: Candidate images corresponding to the image IDs in the dialogues generation process in section F. These images were not sent to GPT-4 for data generation.

Dialogue Example from SparklesDialogueVG. The generated dialogue is visualized in Figure 12. The raw text is shown as follows. The image IDs in the dialogue refer to the images in Figure 11.

Figure 12: Dialogue Example from SparklesDialogueVG. Visual images were not provided to GPT-4 during data generation but will be incorporated during SparklesChat training.

G LLM-instructed multiple dialogues generation for SparklesDialogueCC

For SparklesDialogueCC, we prompt GPT-4 to generate three dialogues in a single response. These dialogues incorporate one, two, and three images in the first turn and a single image in the second. Each prompt includes three demonstration dialogues and nine candidate image descriptions to facilitate this. We curated 150 demonstration dialogues, evenly split with 50 dialogues for each type. The complete image-text dataset comprises about 3,500 pairs.

We first present our designed prompt for LLM-instructed Multiple Dialogues Generation to generate SparklesDialogueCC in Table 9. Then, we show a case of the Dialogue Demonstrations and Candidate Image Descriptions to construct the prompt. Finally, we show the corresponding generated dialogues using the example prompt.

Example of dialogue demonstrations We visualize the images corresponding to image IDs in the dialogues in Figure 13 for reference, while these visual images were not sent to GPT-4 for data generation. Note that we abbreviate the message content of the assistant in the second turn as “…” to save space, considering that the previous message contents have provided enough demonstrations.

Table 9: Prompt for LLM-instructed Multiple Dialogues Generation.

Figure 13: Reference images corresponding to the image IDs in the demonstration dialogues in section G. These images were not sent to GPT-4 for data generation.

Example of candidate images descriptions An example of Candidate Image Descriptions is shown below, and their corresponding source images are shown in Figure 14 for reference (they are not sent to GPT-4).

Figure 14: Candidate images corresponding to the image IDs in the dialogues generation process in section G. These images were not sent to GPT-4 for data generation.

Dialogue examples from SparklesDialogueCC. The generated dialogue is visualized in Figure 15. The raw text is shown as follows. The image IDs in the dialogues refer to the images in Figure 14.

Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models