Skip to main content

Key Highlights

  • First Natively Multimodal Embedding: Unified vector space for text, image, video, audio, and PDF
  • MTEB English #1: 68.32 score, leading in classification (+9.6), retrieval (+9.0), and clustering (+3.7)
  • Flexible Dimensions: Default 3072, supports 128–3072 truncation via Matryoshka Representation Learning (MRL); 768-dim still achieves 67.99
  • Extended Input: Up to 8192 text tokens, 6 images/request, 120-second video
  • 100+ Languages: Multilingual embeddings, MTEB Multilingual Top 5

Background

On March 10, 2026, Google officially launched Gemini Embedding 2 Preview — the first natively multimodal embedding model in the Gemini series. Unlike the text-only text-embedding-004 and gemini-embedding-001, Gemini Embedding 2 maps text, images, video, audio, and PDF documents into a single unified vector space, enabling true cross-modal semantic retrieval. This means you can search for relevant images using text, or retrieve matching documents using an image — all modalities share the same vector representation without separate processing. APIYI has launched gemini-embedding-2-preview, accessible via the OpenAI-compatible /v1/embeddings endpoint.

Detailed Analysis

Core Features

Native Multimodal

Text, image, video, audio, and PDF in a unified vector space for cross-modal semantic search and similarity

MTEB #1

English 68.32 tops the leaderboard, major leads in classification, retrieval, and clustering; Multilingual Top 5

Matryoshka Dimensions

128–3072 flexible truncation, low dimensions retain high quality, balance performance vs. storage cost

Prompt-Based Tasks

No more fixed task_type enums — describe task types with natural language prompts for more flexible, precise control

Performance Highlights

Gemini Embedding 2 Preview leads across MTEB benchmarks:
DimensionsMTEB English ScoreNotes
3072 (default)68.32#1 overall
204868.16Near full-dimension performance
153668.17Suitable replacement for 3-large
76867.99Half storage, nearly no loss
Category leads (vs. second place):
Task TypeLead
Classification+9.6 points
Retrieval+9.0 points
Clustering+3.7 points
Data sources: Google official blog (blog.google) and MTEB leaderboard. Gemini Embedding 2 Preview launched March 10, 2026.

Comparison with Previous Models

Featuretext-embedding-004gemini-embedding-001gemini-embedding-2-preview
ModalityText onlyText onlyText/Image/Video/Audio/PDF
Max Input2048 tokens2048 tokens8192 tokens
Default Dims76830723072
Dim RangeLimitedMRL support128–3072 (MRL)
Task Configtask_type enumtask_type enumPrompt-based
MTEB EnglishLowerModerate68.32 (#1)
LanguagesLimited100+100+
Gemini Embedding 2’s vector space is incompatible with previous versions. You cannot mix embeddings from different model versions — migration requires regenerating all embeddings.

Multimodal Input Specifications

Input TypeLimitsSupported Formats
TextMax 8192 tokensPlain text
ImageUp to 6 per requestPNG, JPEG
VideoUp to 120 secondsMP4, MOV
AudioNative audio embedding (no transcription)Common audio formats
PDFNative supportPDF documents

Supported Task Types

Gemini Embedding 2 uses prompt-based task descriptions:
TaskDescription
Semantic SimilarityAssess semantic similarity between texts
ClassificationClassify texts by preset labels
ClusteringGroup texts by similarity
Retrieval (Document)Optimize document-side search embeddings
Retrieval (Query)Optimize query-side search embeddings
Code RetrievalRetrieve code snippets from natural language
Question AnsweringGenerate question embeddings for QA systems
Fact VerificationGenerate statement embeddings for verification

Technical Specifications

ParameterGemini Embedding 2 Preview
Model IDgemini-embedding-2-preview
Release DateMarch 10, 2026
DeveloperGoogle
Input TypesText, Image, Video, Audio, PDF
OutputFloat vector
Default Dimensions3072
Dimension Range128–3072 (MRL)
Max Text Input8192 tokens
Languages100+

Practical Applications

  1. Cross-Modal Semantic Search: Search images with text, retrieve documents with images — unified vector space enables mixed retrieval
  2. Multilingual RAG: 100+ languages for building global retrieval-augmented generation systems
  3. Document Intelligence: Embed PDFs directly without preprocessing for document retrieval
  4. Video/Audio Content Retrieval: Native video and audio embedding for media content management
  5. Clustering & Classification: +9.6 classification and +3.7 clustering advantage for large-scale content organization
  6. Code Semantic Search: Query code snippets with natural language to boost developer productivity

Code Examples

Text Embedding

from openai import OpenAI

client = OpenAI(
    api_key="your-apiyi-key",
    base_url="https://api.apiyi.com/v1"
)

response = client.embeddings.create(
    model="gemini-embedding-2-preview",
    input="What are the key features of Google's latest multimodal embedding model?",
    dimensions=768  # Optional: 128–3072
)

embedding = response.data[0].embedding
print(f"Dimensions: {len(embedding)}")

Batch Text Embedding

texts = [
    "Latest trends in artificial intelligence",
    "Machine learning applications in healthcare",
    "How large language models work"
]

response = client.embeddings.create(
    model="gemini-embedding-2-preview",
    input=texts,
    dimensions=1536
)

for i, data in enumerate(response.data):
    print(f"Text {i}: {len(data.embedding)} dimensions")

Semantic Search Example

import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Build document embeddings
docs = ["Quantum computing principles", "Deep learning intro", "Blockchain overview"]
doc_resp = client.embeddings.create(
    model="gemini-embedding-2-preview",
    input=docs,
    dimensions=768
)
doc_embeddings = [d.embedding for d in doc_resp.data]

# Query
query_resp = client.embeddings.create(
    model="gemini-embedding-2-preview",
    input="How do neural networks work?",
    dimensions=768
)
query_embedding = query_resp.data[0].embedding

# Calculate similarity
for i, doc_emb in enumerate(doc_embeddings):
    sim = cosine_similarity(query_embedding, doc_emb)
    print(f"{docs[i]}: {sim:.4f}")

Best Practices

  1. Choose the right dimensions: 768-dim offers the best value (67.99 score, half storage), 3072-dim for maximum precision
  2. Normalize after truncation: 3072-dim vectors are pre-normalized; smaller dimensions require manual normalization
  3. Use prompt instructions: Differentiate query vs. document side for retrieval to significantly improve results
  4. Don’t mix versions: Incompatible with text-embedding-004 or gemini-embedding-001 vectors — migration requires full rebuild

Pricing and Availability

Pricing

Input TypePrice (per million tokens)
Text$0.20
Image$0.45 (~$0.00012/image)
Audio$6.50 (~$0.00016/second)
Video$12.00 (~$0.00079/frame)

Price Comparison

ModelText Price/M TokensDimensionsMultimodal
gemini-embedding-2-preview$0.203072✅ 5 modalities
text-embedding-3-large$0.133072❌ Text only
text-embedding-3-small$0.021536❌ Text only
Text pricing is slightly higher than OpenAI’s text-embedding-3 series, but Gemini Embedding 2 is the only model supporting unified 5-modality embeddings — no additional models needed for cross-modal retrieval.

Deposit Bonus

View Latest Deposit Promotions

APIYI offers deposit bonuses — the more you deposit, the bigger the bonus. Combined with the model’s competitive pricing, your effective cost is even lower.

Available Models

Model NameDescription
gemini-embedding-2-previewNative multimodal embedding, supports text/image/video/audio/PDF

How to Access

APIYI Platform:
  • Website: apiyi.com
  • API Endpoint: https://api.apiyi.com/v1
  • Interface: /v1/embeddings (OpenAI-compatible)
  • Works with all OpenAI SDKs

Summary and Recommendations

Gemini Embedding 2 Preview is the most powerful embedding model available today and the industry’s first natively multimodal embedding model. It tops the MTEB English leaderboard while supporting unified vector representations across five modalities, opening entirely new possibilities for cross-modal retrieval. Core Advantages:
  • Multimodal Unity: Text/image/video/audio/PDF share one vector space — one model for all retrieval
  • Performance Leader: MTEB 68.32 #1, major leads in classification, retrieval, and clustering
  • Flexible Dimensions: MRL supports 128–3072, balance precision vs. cost as needed
  • Extended Input: 8192 tokens, 4x the previous generation
Usage Recommendations:
  1. Cross-modal retrieval: Gemini Embedding 2 is the only choice — and the best one
  2. Text-only + lowest cost: text-embedding-3-small remains the cheapest option
  3. Text-only + high accuracy: Gemini Embedding 2 at 768-dim already surpasses text-embedding-3-large
  4. RAG scenarios: 8192 token input + flexible dimensions, ideal for large document chunking and retrieval
Who Should Use Gemini Embedding 2:
  • Applications requiring cross-modal search (image-to-text, text-to-image, etc.)
  • Developers building multilingual RAG systems
  • Enterprise scenarios processing PDF/video/audio content
  • Retrieval systems seeking the highest embedding quality
Sources: Google official blog (blog.google), Google AI Developer docs (ai.google.dev), MTEB leaderboard. Gemini Embedding 2 Preview launched March 10, 2026. Data retrieved: March 31, 2026.