What is an Embedding Model?
An embedding model converts text into a high-dimensional numeric vector. This vector captures the semantic meaning of the text, so that similar texts will have similar vectors—even if they use different words.
For example:
“What is your return policy?” and “How do I send an item back?” will generate vectors that are close in space.
These embedding vectors are essential for effective semantic search, because they allow the system to understand meaning, not just keywords.
How Vector Similarity Works?
When a user inputs a query, it is converted into a vector using the same embedding model as your documents. The system then computes the similarity between the query vector and all indexed vectors.
There are different similarity computation methods:
Cosine Similarity (default and most common): Measures the angle between two vectors. Values range from -1 to 1. A score close to 1 means high similarity.
Dot Product: Measures the projection of one vector onto another. Sensitive to both direction and magnitude.
Hamming Distance: Measures how many components differ between two binary vectors. Less common for text embeddings.
You can choose the similarity metric when you index the document collection.
What is a Token?
Tokens are the units of text processed by models. A token could be a word, part of a word, or even just punctuation.
“unbelievable” = 1 word = ~3–4 tokens
“Hi!” = ~2 tokens
Knowing token limits helps you select embedding models and chunking strategies appropriately. For example, if your model has a 512-token limit, ensure each chunk doesn’t exceed that.
How to Choose a Good Embedding Model
When selecting a model, consider:
Language support: Choose multilingual models if you handle data in different languages.
Task match: Some models are optimized for sentence similarity, others for retrieval or document search.
Token capacity: Higher token limits are better for longer inputs.
Embedding size: Larger vectors (e.g. 1024, 1536) capture more semantic information but may be slower.
Refer to the embedding model table below for comparison across providers.
Provider | Model Name | Vector Dimensions | Max Text Length | Language Support | Tasks |
---|---|---|---|---|---|
Sentence Transformer | paraphrase-multilingual-mpnet-base-v2 | 768 | 512 tokens | Multilingual | Semantic search, clustering, sentence similarity |
Sentence Transformer | paraphrase-multilingual-MiniLM-L12-v2 | 384 | 256 tokens | Multilingual | Paraphrase detection, semantic search |
Sentence Transformer | paraphrase-albert-small-v2 | 768 | 128 tokens | English | Paraphrase detection, semantic similarity |
Sentence Transformer | paraphrase-MiniLM-L3-v2 | 384 | 256 tokens | English | Paraphrase detection, semantic search |
Sentence Transformer | multi-qa-mpnet-base-dot-v1 | 768 | 512 tokens | English | Question answering, information retrieval |
Sentence Transformer | multi-qa-distilbert-cos-v1 | 768 | 512 tokens | English | Question answering, semantic search |
Sentence Transformer | multi-qa-MiniLM-L6-cos-v1 | 384 | 256 tokens | English | Question answering, semantic search |
Sentence Transformer | distiluse-base-multilingual-cased-v2 | 512 | 512 tokens | Multilingual | Semantic search, sentence similarity |
Sentence Transformer | distiluse-base-multilingual-cased-v1 | 512 | 512 tokens | Multilingual | Semantic search, sentence similarity |
Sentence Transformer | all-mpnet-base-v2 | 768 | 512 tokens | English | Semantic search, clustering, sentence similarity |
Sentence Transformer | all-distilroberta-v1 | 768 | 512 tokens | English | Semantic search, sentence similarity |
Sentence Transformer | all-MiniLM-L6-v2 | 384 | 256 tokens | English | Semantic search, sentence similarity |
Sentence Transformer | all-MiniLM-L12-v2 | 384 | 256 tokens | English | Semantic search, sentence similarity |
Hugging Face | BAAI/bge-m3 | 1024 | 512 tokens | Multilingual | Retrieval-augmented generation, semantic search |
Hugging Face | BAAI/llm-embedder | 1024 | 512 tokens | English | Language model embedding, retrieval tasks |
Hugging Face | BAAI/bge-large-en-v1.5 | 1024 | 512 tokens | English | Retrieval-augmented generation, semantic search |
Hugging Face | BAAI/bge-base-en-v1.5 | 768 | 512 tokens | English | Retrieval-augmented generation, semantic search |
Hugging Face | BAAI/bge-small-en-v1.5 | 384 | 512 tokens | English | Retrieval-augmented generation, semantic search |
Hugging Face | BAAI/bge-large-zh-v1.5 | 1024 | 512 tokens | Chinese | Retrieval-augmented generation, semantic search |
Hugging Face | BAAI/bge-base-zh-v1.5 | 768 | 512 tokens | Chinese | Retrieval-augmented generation, semantic search |
Hugging Face | BAAI/bge-small-zh-v1.5 | 384 | 512 tokens | Chinese | Retrieval-augmented generation, semantic search |
Hugging Face | DMetaSoul/Dmeta-embedding-zh | 768 | 512 tokens | Chinese | Semantic search, sentence similarity |
Hugging Face | shibing624/text2vec-base-chinese | 768 | 512 tokens | Chinese | Sentence/document embeddings, semantic similarity |
Hugging Face | sentence-transformers/sentence-t5-large | 1024 | 512 tokens | Multilingual | Text generation, summarization, translation |
Hugging Face | sentence-transformers/mpnet | 768 | 512 tokens | Multilingual | General text embeddings, semantic search |
Hugging Face | jinaai/jina-colbert-v2 | 768 | 512 tokens | Multilingual | Late interaction retrieval, semantic search |
Hugging Face | jinaai/jina-embeddings-v3 | 1024 | 8192 tokens | Multilingual | Long-context retrieval, semantic search |
Hugging Face | jinaai/jina-embeddings-v2-base-zh | 768 | 512 tokens | Chinese, English | Bilingual embeddings, semantic search |
Hugging Face | openbmb/MiniCPM-Embedding | 1024 | 512 tokens | Chinese, English | Retrieval tasks, semantic search |
Hugging Face | maidalun1020/bce-embedding-base_v1 | 768 | 512 tokens | Chinese, English | Sentence embeddings, semantic similarity |
OpenAI | text-embedding-ada-002 | 1536 | 8192 tokens | Multilingual | General text embeddings, semantic search |
OpenAI | text-embedding-3-small | 1024 | 8192 tokens | Multilingual | General text embeddings, semantic search |
OpenAI | text-embedding-3-large | 2048 | 8192 tokens | Multilingual | High-quality embeddings, cross-lingual tasks |
Cohere | embed-multilingual-v3.0 | 1024 | 512 tokens | Multilingual | Semantic search, retrieval-augmented generation |
Cohere | embed-english-light-v3.0 | 384 | 512 tokens | English | Semantic search, retrieval-augmented generation |