InterestFM-TTE Embedding
Introduction
We introduce InterestFM-TTE (IFM-TTE), a Universal Multimodal Embedding (UME) model that supports a wide range of tasks such as classification, retrieval, visual question answering, and visual grounding, with one single Multimodal-LLM powered embedding model. IFM-TTE achieved SOTA performance across these diverse sets of tasks as measured by the well known Massive Multimodal Embedding Benchmark, MMEB-v2, outperforming the current SOTA model with a large margin (74.1 vs 71.6). We focus on embeddings that not only capture simple content similarity of image/text, but also the ability to follow user instructions. Such embeddings unlock a wide range of applications, such as retrieving documents or images based on nuanced queries, and powering personalized recommendation systems.
Method
Think Then Embed: Generative Context Improves Multimodal Embedding
We observe that for retrieval tasks that require instruction understanding, existing encoder-based methods often falter on complex instructions that require complex reasoning and contextual grounding. To address this, we introduce an explicit “thinking” stage, by leveraging Chain-of-Thought reasoning traces, before producing embeddings. The thinking stage is handled by an LLM reasoner ranging from 7B to 70B, providing useful context tokens for a 7B embedder. Without additional data or sophisticated training techniques, this approach achieves 10% absolute improvement over the 7B baseline.
To improve the efficiency of TTE, we distill the reasoning trace of the larger teacher reasoner (70B) into a small student reasoner (7B). Inspired by the Cafe work, we explored having a unified model that is able to generate reasoning tokens and then directly extract embeddings on top of the reasoner with a thin layer (Figure 1c). The approach shrinks 50% of the parameters, reducing 20% of the inference cost, and resulting in a more compact model while retaining the performance of having separate reasoners and embedders. More details can be found in the arxiv paper.
To further push for performance, we consider hard negative sampling techniques and by incorporating additional high-quality video retrieval data, demonstrated below.
Hard Negative Mining
Another major challenge in training robust embedding models with contrastive learning is the selection of hard negatives, i.e., samples that are semantically similar to the target but are not true matches. We address this by utilizing a cluster-based hard negative mining approach. First, an embedding model is trained using in-batch negatives, and is then used to generate embeddings for the training pool. For each query, candidates are ranked by embedding similarity, and top-k-ranked non-matching candidates are selected to form a rank matrix. Finally, a clustering algorithm is applied to construct batches containing hard negatives, which are used to retrain and strengthen the embedding model. We observe up to 4% performance gains across tasks with hard negative sampling approach.
In-House Data
To further enhance video retrieval tasks, we construct high quality in-house video training data, which allows us to push SOTA further for another ~1% absolute improvement on the already SOTA 7B model.
Evaluation
MMEB-V2
We adopt the public MMEB-v2 benchmark, which covers 78 retrieval tasks across image, video, and visual document modalities.
The following table shows the results (hits@1 for Image and Video, NDCG@5 for Visdoc) of our final IFM-TTE-7B (TTE + Hard Negative Mining + Additional Data) on the MMEB V2 leaderboard. Notably, IFM-TTE-7B achieves No.1 (overall 74.1%) on the leaderboard, outperforming the second place by a large margin.
| Rank | Models | Embedder size | Overall | Image | Video | Visdoc |
|---|---|---|---|---|---|---|
| 1 | IFM-TTE-7B | 8.29 | 74.1 | 77.9 | 59.2 | 79.5 |
| 2 | RzenEmbed-v2-7B | 8.29 | 71.6 | 75.9 | 55.7 | 77.1 |
| 3 | Seed-1.6-embedding | unknown | 71.3 | 77.8 | 55.3 | 73.4 |
| 4 | RzenEmbed-v1-7B | 8.29 | 68.9 | 73.6 | 48.9 | 76.8 |
| 5 | Ops-MM-embedding-v1-7B | 8.29 | 67.6 | 72.7 | 53.8 | 70.3 |
| 6 | RzenEmbed-v1-2B | 2.21 | 64.4 | 68.5 | 42.6 | 74.4 |
| 7 | Ops-MM-embedding-v1-2B | 2.21 | 63.4 | 69.0 | 47.6 | 66.9 |
| 8 | interestFM-UIR-CAFe-7B | 8.03 | 60.6 | 67.6 | 42.4 | 63.9 |
BibTeX
@misc{cui2025thinkembedgenerativecontext,
title={Think Then Embed: Generative Context Improves Multimodal Embedding},
author={Xuanming Cui and Jianpeng Cheng and Hong-you Chen and Satya Narayan Shukla and Abhijeet Awasthi and Xichen Pan and Chaitanya Ahuja and Shlok Kumar Mishra and Qi Guo and Ser-Nam Lim and Aashu Singh and Xiangjun Fan},
year={2025},
eprint={2510.05014},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2510.05014},
}