Blogs

Databricks Mosaic AI Vector Search

Databricks

Databricks Mosaic AI Vector Search

Oct 30, 2024

Mosaic AI Vector Search is Databricks’ latest and most advanced vector database solution, poised to not only revolutionize the architecture of your generative AI applications but also the way we think about vector databases.

To understand Mosaic AI Vector Search and how it could potentially transform your applications we need to first understand the core concepts surrounding Vector Databases.

Vector Embeddings

Embeddings are a specific type of vector that encode high-dimensional, complex data (e.g., text, images, or items in a recommendation system) into a continuous lower dimensional vector space to capture their semantic or contextual meaning. To put it in layman’s terms, it is a way to convert complex data into a list of numbers that will help our systems understand and compare the characters of those objects.

Vector Index

In production-scale databases with millions of entries, each having an embedding associated with it, the brute force approach of searching through all the entries would be highly inefficient and costly. The compute resources would have to scan through all the records to generate results.

A Vector Index is a specialized data structure designed to efficiently organize and search through these embeddings (i.e., the numerical representation of data). Instead of randomly arranging the vectors, vector indexes employ advanced algorithms to systematically structure them, enabling faster and more efficient searches. Techniques such as Approximate Nearest Neighbor (ANN) are used to group similar vectors together, allowing for quick retrieval of the most relevant results without needing to scan every entry.

Let’s take an example, In the diagram above, Let’s assume that the:

Red Dots = Photos of Dragon fruits
Yellow Dots = Photos of Pineapples
Blue Dots = Photos of Blueberries

(Not the actual photos but the embeddings associated with those photos)

In the database without Vector indexing, the rows are stored randomly which means that in order to find images similar to that of our input pineapple, we would have to unnecessarily go through photos of blueberries and dragon fruits as well, which would make this highly inefficient.

In the database with Vector Indexing enabled, the rows are grouped together based on the embeddings using indexing algorithms. What this means is that photos of pineapple (which would naturally have similar embedding values) would be stored together in one bucket, and so would the photos of dragon fruits and blueberries be stored in their own respective buckets. This means when we need to generate results, we can skip parsing unrequired data and go straight the section we need resulting in highly efficient result generation.

Vector Databases

A vector database is a specialized type of database that is optimized to store, manage, and retrieve numerical representation of data, A.K.A Vector Embeddings. They use advanced indexing techniques to group semantically similar data (based on their embedding values), thus accelerating search operations and reducing the need for brute force.

In traditional databases, queries typically retrieve rows where the values exactly match the query. In contrast, vector databases use similarity metrics to find vectors that are most similar to the query, rather than requiring an exact match.

Mosaic AI Vector Search

Mosaic AI Vector Search is a vector database that is built into Databricks Data Intelligence platform, integrated with its governance and productivity tools. Vector Search leverages vector embeddings to perform efficient and scalable searches across large datasets.

Key advantages of using Vector Search

This is where Mosaic AI Vector Search shows its true potential over run-of-the-mill vector databases.

Highly performant and scalable

Auto scales with zero downtime
Can scale to hundreds of millions of vectors
Optimized for high performance at low cost
Provides best-in-class retrieval quality compared to other out-of-the-box solutions

Automated data ingestion

Vector Search makes it possible to synchronize any delta table into a vector index with one click. There is no need for complex, custom-built data ingestion/sync pipelines.

Flexibility when it comes to providing vector embeddings

Delta Sync Index with embeddings computed by Databricks: You provide a delta table that contains data in the text format and Databricks will calculate the embeddings, using a model that you specify.
Delta Sync Index with self-managed embeddings: You provide a delta table that contains pre-calculated embeddings as the delta table is updated, the index stays synced with the Delta Table.
Direct Vector Access Index: You must manually update the index using the REST API when the embeddings table changes.

Ability to choose embedding model

If you choose to have Databricks compute the embeddings, you can use a pre-configured Foundational Model APIs endpoint or create a model serving endpoint to serve the embedding model of your choice.

Uses advanced indexing algorithms

Mosaic AI uses state-of-the-art indexing algorithms like the HNSW (Hierarchical Navigable Small World) algorithm, LSH (Locality Sensitive Hashing) or the IVF (Inverted File Indexes).

Supports multiple data types

The platform can handle various types of data including text, images, audio, and more converting them into vector embeddings.

Support for hybrid keyword similarity search

Combines vector-based embeddings search with traditional keyword-based search techniques. This approach matches exact words in the query while also using a vector-based similarity search to capture semantic relationships and context of the query.

Built-in governance

The unified interface defines policies on data, with fine grained access control on embeddings. With built-in integration to Unity Catalog, Vector Search shows data lineage and tracking automatically without the need for additional tools or security policies thus ensuring that LLM models won’t expose confidential data to users who should not have access.

Mosaic AI Vector Search as an enterprise-level solution

Let’s explore how Mosaic AI Vector Search, combined with Databricks’ native features, can simplify the process of building, deploying, and governing enterprise-level solutions. Take, for example, a Retrieval-Augmented Generation (RAG) application, which heavily relies on vector databases.

One of the biggest challenges today is achieving production-level quality for generative AI applications and successfully deploying them. Additionally, organizations face hurdles such as optimizing storage and compute costs, managing embeddings, ensuring governance and security, scaling solutions, and maintaining costly pipelines.

In a traditional architecture for a RAG application, depending on the technologies you choose, you might encounter some or all the challenges previously mentioned. However, by leveraging Mosaic AI Vector Search alongside Databricks’ native features, we can address these issues seamlessly.

Unity Catalog Volumes can be used to store raw data efficiently.

Databricks Serverless Compute offers a fully-managed, auto-scaling solution that handles all your compute needs efficiently, providing high performance and cost-effectiveness without the need for manual infrastructure management.
Databricks notebooks can be utilized to develop and implement chunking logic, with the results stored in Unity Catalog delta tables. This would enable high performance and scalability along with advanced features like time-travel.
Embedding models can be hosted using Databricks Model Serving, streamlining the deployment process.
Delta Sync Index can automate the end-to-end management of embeddings, ensuring efficient and continuous updates.
Unity Catalog provides a range of powerful features, including centralized metadata management, fine-grained access controls, and data lineage tracking, all of which significantly streamline data governance and enhance security.

This approach eliminates the typical pain points and simplifies the development and production lifecycle for RAG applications.

Vector Search use cases

Recommendation Systems

E-commerce platforms can use Vector Search to provide product recommendations. By representing the product, you are browsing as a vector and quickly scanning the database to find the items having the most similar embedding values. For instance, suggesting similar items of clothing based on the user’s current product page.

Image & Video Search

Vector search can enable visual search features where users can upload an image and retrieve similar images from a database, by finding the closest match. As an example, in Facial Recognition Software, we can upload the photo of a person, and the system retrieves similar-looking people from the database

LLMs & Generative AI

LLMs, like those powering ChatGPT, rely on contextual analysis of text enabled by vector representations. By mapping words, sentences, and ideas as vectors in high-dimensional space, these models can capture the relationships between them, allowing for a deeper understanding of natural language and the ability to generate coherent, contextually relevant text.

As you can see, Mosaic AI Vector Search offers a wide range of applications that can significantly enhance your business operations. Whether it’s improving product recommendations, enabling advanced visual search features, or powering generative AI models, the potential is vast.

Fractal is a Databricks partner, and we have the right expertise to help you leverage the power of Mosaic AI Vector Search. Whether you’re looking to enhance your generative AI applications, optimize your data storage and compute costs, or streamline your data governance and security, our team of experts is here to assist you.

Contact us today to learn how we can help you implement and maximize the benefits of Databricks’ advanced vector database solutions.