Open source USearch library jumpstarts ScyllaDB vector search

ScyllaDB, an open source columnar database, recently added vector search capabilities underpinned by USearch, an open source clustering and vector search library. The addition of vector search lets organizations store vector embeddings alongside structured data attributes in the same ScyllaDB table.
Consequently, the low latency responsiveness of ScyllaDB becomes applicable to statistical AI-native vector embeddings for real-time applications of generative models, fraud detection, feature stores, and more. Organizations are responsible for generating their own embeddings.
The architecture of the vector search capabilities complements that of ScyllaDB, one of the champions of the shard-per-core architecture, which ensures that âthe data and the processes that operate on it are always done or executed in the same core,â Szymon Wasik, director of software engineering at ScyllaDBâs R&D Division, told The New Stack. ScyllaDB and ScyllaDB Vector Search are architected on two different nodes, but share the same engine.
Thus, they can scale independently of each other, are optimized for different tasks, yet grant the vector store the same reliability, performance, and database features that ScyllaDB has always delivered. These characteristics include âfor example, backups, replication, checking the currency of the data, and the management of the data,â Wasik said.
Subsequently, organizations can get tried and true vector search capabilities with the tested, low latency of ScyllaDB while availing themselves of a reputable open source vector search library.
USearch with Rust extension
Both USearch and ScyllaDB were developed in C++, which partly contributes to their celerity for low latency applications. USearch supports Approximate Nearest Neighbor (ANN) as well as Hierarchical Navigable Small World (HNSQ) algorithms for vector embedding retrieval. As an open source library, itâs readily embeddable in different applications. Moreover, itâs making strides as one of the resources of choice for implementing vector search.
âAccording to the benchmarks performed by USearch developers, itâs 10 times faster in many scenarios than Faiss, developed by Facebook,â Wasik said. USearch can also be extended to support numerous languages, including Python, Java, Golang, and Rust.
In addition to being renowned for its swift performance, Rust offers credible memory management and safe data types. The former is imperative for AI retrieval systems; the latter is an integral means of reinforcing security when comparing different data types. Specifically, safe data types make it âmuch easier to implement some standard processes or standard operation signals, like operations management or the data between different types,â Wasik explained.
Disk and in-memory
By maintaining respective nodes for ScyllaDB and ScyllaDB Vector Search, Scyllaâs solution presents numerous advantages. First, these nodes are optimized for different purposes: database management (in which the priorities are compute and storage) and vector embedding retrieval (in which the priority is oftentimes memory).
âBecause weâve got separate nodes that run the vector indexes, you can select different types of instances that are memory optimized for storing vector data and different for storing data inside Scylla,â Wasik said. With this paradigm, the embeddings are stored on disk in ScyllaDB while the indices are stored in-memory in ScyllaDB Vector Search.
One engine and CDC
ScyllaDB relies on Change Data Capture (CDC) as its primary form of ingestion and as the means of updating the vector indexes with the embeddings in the database. Data for the embeddings is âin two places, because we keep them in ScyllaDB so that they can be stored, backed up; we can provide failovers for the data inside ScyllaDB,â Wasik said. âThe other copy is used to index the data in-memory to provide quick access to searching that data.â
ScyllaDBâs data is stored on fast disks (SSDs), which is a more affordable storage method than keeping everything in-memory. Moreover, the fact that there is one engine for both ScyllaDB and ScyllaDB Vector Search means âyou have a single API, single management interface, and single cloud offering so everything is integrated inside one engine,â Wasik added.
Quantization and cost
Although separating the nodes for the vector indices and the vector embeddings delivers benefits, this approach is not without its inefficiencies. Preserving vector indices in-memory indefinitely can add to costs, as does replicating copies of data between ScyllaDB and ScyllaDB Vector Search. The good news is that ScyllaDB âhas been optimized for very fast access to the data stored on disk because it has very good indexes, efficient caches, and efficient data organization,â Wasik noted.
In the near future, however, ScyllaDB will be looking to add quantization methods to reduce the costs of storing vector indices. At present, the vendor is exploring options for binary quantization and scalar quantization, as well as half precision quantization.
âItâs less commonly used, but weâll offer it,â Wasik said about half precision quantization. âYou can think of it like smaller floats, like 16-bit floats.â In the second half of the year, ScyllaDB intends to implement storage tiering to cost-effective object storage. Because of how the vendorâs vector search capabilities are architected, organizations will be able to tier both the vectors and their embeddings to, say, S3, or some other form of object storage.
Coupling vector embeddings with structured data
ScyllaDBâs vector search capabilities will likely arouse developer interest for multiple reasons. Both the core database and its vector search functionality are open source. USearch, which underlies the latter, is gaining credence as one of the libraries of choice for vector retrieval. Each of these constructs is designed for real-time applications that operate at the pace of contemporary AI.
The biggest gain, however, may be the fact that this architecture effectively couples embeddings and structured data assets in the same table. According to Wasik, âIt provides very low latency infrastructure for storing both structured and unstructured data. We can store large amounts of structured data and, together with it, we can store vector embeddings encoded as vectors within our vector indexes.â
The post Open source USearch library jumpstarts ScyllaDB vector search appeared first on The New Stack.
