
Milvus is a vector database that has long focused on embedding-based vector search capabilities, providing high accuracy, high performance, and highly scalable semantic search functions for applications like RAG. With the advent of the large model era bringing various new application explorations, the community has re-recognized the benefits of combining traditional text-matching precise search with hybrid search, especially in scenarios that heavily rely on keyword matching. To meet this demand, Milvus 2.5 introduces full-text search (FTS) functionality, combining it with the sparse vector search capabilities supported since version 2.4 and hybrid search capabilities, thus unleashing powerful synergistic effects.
Hybrid search is a search method that integrates multiple search results. Users can search different fields in the data in various ways, and then merge and rank the results through hybrid search to obtain a comprehensive result. In the current popular RAG scenarios, a typical hybrid search approach is achieved by combining semantic search with lexical search. Specifically, this approach merges the embedding recall with the BM25 search algorithm based on lexical matching through RRF to produce a better result ranking.
In this article, we will demonstrate using a RAG dataset provided by Anthropic. This dataset consists of text search code snippets from nine code repositories, similar to the now-popular AI-assisted programming scenarios. Since the code data contains a large number of definitions, keywords, and other information, text-based search can provide greater gains in this context. At the same time, a densely embedded model trained on a large amount of code data can understand some high-level semantic information. We hope to observe what effects the combination of both will produce through experiments.
To establish a more concrete understanding of hybrid search, we sampled some specific cases for analysis. We used an advanced dense embedding model (voyage-2) trained on a large amount of code data as a baseline and selected cases where hybrid search outperformed dense and sparse results (top 5) to see what characteristics can be reflected behind them.

In addition to the microscopic quality analysis based on individual cases, we also obtained quantitative results through overall evaluation, counting the Pass@5 metric in the dataset. This metric measures the proportion of successfully retrieved relevant results among the top 5 results for each query. From this result, we can see that the advanced embedding model itself can achieve a good baseline effect, but still offers improvements when combined with the full-text search method. Observing the BM25 results and adjusting parameters for specific scenarios can yield even greater improvements.
01.
Case 1: Hybrid Search Outperforms Semantic Search
Question: How is the log file created?
This question seeks to understand the process of creating a log file, with the correct answer being a piece of Rust code that creates a log file. In the semantic search results, we see the inclusion of the log header file and related code for obtaining the logger in C++, but the key to this question is the variable “logfile”. We found this result in the hybrid search result #hybrid 0, and since hybrid search merges semantic search and full-text search results, this result is naturally derived from full-text search. Besides this result, we can find many seemingly unrelated test mock codes in #hybrid 2, especially the repeated line “long string to test how those are handled.” This requires understanding the principles behind the full-text search algorithm BM25. Full-text search aims to match more low-frequency words (as high-frequency words are too common and thus reduce the uniqueness used to identify search objects). If we statistically analyze in a large amount of natural text, it is easy to find that “how” is a very common word, thus occupying a low proportion in the relevance score. However, this article is based on code data, which does not contain a lot of text including the word “how”, thus allowing sentences containing this word to be heavily retrieved.
GroundTruth
use {
crate::args::LogArgs,
anyhow::{anyhow, Result},
simplelog::{Config, LevelFilter, WriteLogger},
std::fs::File,
};
pub struct Logger;
impl Logger {
pub fn init(args: &impl LogArgs) -> Result<()> {
let filter: LevelFilter = args.log_level().into();
if filter != LevelFilter::Off {
let logfile = File::create(args.log_file())
.map_err(|e| anyhow!("Failed to open log file: {e:}"))?;
WriteLogger::init(filter, Config::default(), logfile)
.map_err(|e| anyhow!("Failed to initalize logger: {e:}"))?;
}
Ok(())
}
}
Semantic Search Result:
##dense 0 0.7745316028594971
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#include "logunit.h"
#include <log4cxx/logger.h>
#include <log4cxx/simplelayout.h>
#include <log4cxx/fileappender.h>
#include <log4cxx/helpers/absolutetimedateformat.h>
##dense 1 0.769859254360199
void simple()
{
LayoutPtr layout = LayoutPtr(new SimpleLayout());
AppenderPtr appender = FileAppenderPtr(new FileAppender(layout, LOG4CXX_STR("output/simple"), false));
root->addAppender(appender);
common();
LOGUNIT_ASSERT(Compare::compare(LOG4CXX_FILE("output/simple"), LOG4CXX_FILE("witness/simple")));
}
std::string createMessage(int i, Pool & pool)
{
std::string msg("Message ");
msg.append(pool.itoa(i));
return msg;
}
void common()
{
int i = 0;
// In the lines below, the logger names are chosen as an aid in
// remembering their level values. In general, the logger names
// have no bearing to level values.
LoggerPtr ERRlogger = Logger::getLogger(LOG4CXX_TEST_STR("ERR"));
ERRlogger->setLevel(Level::getError());
##dense 2 0.7591114044189453
log4cxx::spi::LoggingEventPtr logEvt = std::make_shared<log4cxx::spi::LoggingEvent>(LOG4CXX_STR("foo"),
Level::getInfo(),
LOG4CXX_STR("A Message"),
log4cxx::spi::LocationInfo::getLocationUnavailable());
FMTLayout layout(LOG4CXX_STR("{d:%Y-%m-%d %H:%M:%S} {message}"));
LogString output;
log4cxx::helpers::Pool pool;
layout.format( output, logEvt, pool);
##dense 3 0.7562235593795776
#include "util/compare.h"
#include "util/transformer.h"
#include "util/absolutedateandtimefilter.h"
#include "util/iso8601filter.h"
#include "util/absolutetimefilter.h"
#include "util/relativetimefilter.h"
#include "util/controlfilter.h"
#include "util/threadfilter.h"
#include "util/linenumberfilter.h"
#include "util/filenamefilter.h"
#include "vectorappender.h"
#include <log4cxx/fmtlayout.h>
#include <log4cxx/propertyconfigurator.h>
#include <log4cxx/helpers/date.h>
#include <log4cxx/spi/loggingevent.h>
#include <iostream>
#include <iomanip>
#define REGEX_STR(x) x
#define PAT0 REGEX_STR("\[[0-9A-FXx]*]\ (DEBUG|INFO|WARN|ERROR|FATAL) .* - Message [0-9]\{1,2\}")
#define PAT1 ISO8601_PAT REGEX_STR(" ") PAT0
#define PAT2 ABSOLUTE_DATE_AND_TIME_PAT REGEX_STR(" ") PAT0
#define PAT3 ABSOLUTE_TIME_PAT REGEX_STR(" ") PAT0
#define PAT4 RELATIVE_TIME_PAT REGEX_STR(" ") PAT0
#define PAT5 REGEX_STR("\[[0-9A-FXx]*]\ (DEBUG|INFO|WARN|ERROR|FATAL) .* : Message [0-9]\{1,2\}")
##dense 4 0.7557586431503296
std::string msg("Message ");
Pool pool;
// These should all log.----------------------------
LOG4CXX_FATAL(ERRlogger, createMessage(i, pool));
i++; //0
LOG4CXX_ERROR(ERRlogger, createMessage(i, pool));
i++;
LOG4CXX_FATAL(INF, createMessage(i, pool));
i++; // 2
LOG4CXX_ERROR(INF, createMessage(i, pool));
i++;
LOG4CXX_WARN(INF, createMessage(i, pool));
i++;
LOG4CXX_INFO(INF, createMessage(i, pool));
i++;
LOG4CXX_FATAL(INF_UNDEF, createMessage(i, pool));
i++; //6
LOG4CXX_ERROR(INF_UNDEF, createMessage(i, pool));
i++;
LOG4CXX_WARN(INF_UNDEF, createMessage(i, pool));
i++;
LOG4CXX_INFO(INF_UNDEF, createMessage(i, pool));
i++;
LOG4CXX_FATAL(INF_ERR, createMessage(i, pool));
i++; // 10
LOG4CXX_ERROR(INF_ERR, createMessage(i, pool));
i++;
LOG4CXX_FATAL(INF_ERR_UNDEF, createMessage(i, pool));
i++;
LOG4CXX_ERROR(INF_ERR_UNDEF, createMessage(i, pool));
i++;
Hybrid Search Result:
##hybrid 0 0.016393441706895828
use {
crate::args::LogArgs,
anyhow::{anyhow, Result},
simplelog::{Config, LevelFilter, WriteLogger},
std::fs::File,
};
pub struct Logger;
impl Logger {
pub fn init(args: &impl LogArgs) -> Result<()> {
let filter: LevelFilter = args.log_level().into();
if filter != LevelFilter::Off {
let logfile = File::create(args.log_file())
.map_err(|e| anyhow!("Failed to open log file: {e:}"))?;
WriteLogger::init(filter, Config::default(), logfile)
.map_err(|e| anyhow!("Failed to initalize logger: {e:}"))?;
}
Ok(())
}
}
##hybrid 1 0.016393441706895828
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#include "logunit.h"
#include <log4cxx/logger.h>
#include <log4cxx/simplelayout.h>
#include <log4cxx/fileappender.h>
#include <log4cxx/helpers/absolutetimedateformat.h>
##hybrid 2 0.016129031777381897
"long string to test how those are handled. Here goes more text. "
"long string to test how those are handled. Here goes more text. "
"long string to test how those are handled. Here goes more text. "
"long string to test how those are handled. Here goes more text. "
"long string to test how those are handled. "
"long string to test how those are handled. Here goes more text. "
"long string to test how those are handled. Here goes more text. "
"long string to test how those are handled. Here goes more text. "
"long string to test how those are handled. Here goes more text. "
"long string to test how those are handled. Here goes more text. "
"long string to test how those are handled. Here goes more text. "
"long string to test how those are handled. Here goes more text. "
};
}
##hybrid 3 0.016129031777381897
void simple()
{
LayoutPtr layout = LayoutPtr(new SimpleLayout());
AppenderPtr appender = FileAppenderPtr(new FileAppender(layout, LOG4CXX_STR("output/simple"), false));
root->addAppender(appender);
common();
LOGUNIT_ASSERT(Compare::compare(LOG4CXX_FILE("output/simple"), LOG4CXX_FILE("witness/simple")));
}
std::string createMessage(int i, Pool & pool)
{
std::string msg("Message ");
msg.append(pool.itoa(i));
return msg;
}
void common()
{
int i = 0;
// In the lines below, the logger names are chosen as an aid in
// remembering their level values. In general, the logger names
// have no bearing to level values.
LoggerPtr ERRlogger = Logger::getLogger(LOG4CXX_TEST_STR("ERR"));
ERRlogger->setLevel(Level::getError());
##hybrid 4 0.01587301678955555
std::vector<std::string> MakeStrings() {
return {
"a", "ab", "abc", "abcd",
"long string to test how those are handled. Here goes more text. "
"long string to test how those are handled. Here goes more text. "
"long string to test how those are handled. Here goes more text. "
"long string to test how those are handled. Here goes more text. "
"long string to test how those are handled. Here goes more text. "
"long string to test how those are handled. Here goes more text. "
};
}
02.
Case 2: Hybrid Search Outperforms Full-Text Search
Question: How do you initialize the logger
This question is very similar to the previous one, and the answer is also the same. However, this question was found by hybrid search (i.e., obtained through semantic search), but it was not in the results of full-text search. The reason is that the statistical results of various words in the corpus reflect weights that do not align with our cognitive model, as the model did not realize that the matching of the word “how” is not important, and it may even be that “logger” appears more frequently in the code than “how”, making “how” seem more important.
GroundTruth
use {
crate::args::LogArgs,
anyhow::{anyhow, Result},
simplelog::{Config, LevelFilter, WriteLogger},
std::fs::File,
};
pub struct Logger;
impl Logger {
pub fn init(args: &impl LogArgs) -> Result<()> {
let filter: LevelFilter = args.log_level().into();
if filter != LevelFilter::Off {
let logfile = File::create(args.log_file())
.map_err(|e| anyhow!("Failed to open log file: {e:}"))?;
WriteLogger::init(filter, Config::default(), logfile)
.map_err(|e| anyhow!("Failed to initalize logger: {e:}"))?;
}
Ok(())
}
}
Full-Text Search Result:
##sparse 0 10.17311954498291
"long string to test how those are handled. Here goes more text. "
"long string to test how those are handled. Here goes more text. "
"long string to test how those are handled. Here goes more text. "
"long string to test how those are handled. Here goes more text. "
"long string to test how those are handled. Here goes more text. "
"long string to test how those are handled. Here goes more text. "
"long string to test how those are handled. Here goes more text. "
"long string to test how those are handled. Here goes more text. "
"long string to test how those are handled. Here goes more text. "
"long string to test how those are handled. Here goes more text. "
"long string to test how those are handled. Here goes more text. "
"long string to test how those are handled. Here goes more text. "
};
}
##sparse 1 9.775702476501465
std::vector<std::string> MakeStrings() {
return {
"a", "ab", "abc", "abcd",
"long string to test how those are handled. Here goes more text. "
"long string to test how those are handled. Here goes more text. "
"long string to test how those are handled. Here goes more text. "
"long string to test how those are handled. Here goes more text. "
"long string to test how those are handled. Here goes more text. "
"long string to test how those are handled. Here goes more text. "
}
##sparse 2 7.638711452484131
// union ("x|y"), grouping ("(xy)"), brackets ("[xy]"), and
// repetition count ("x{5,7}"), among others.
//
// Below is the syntax that we do support. We chose it to be a
// subset of both PCRE and POSIX extended regex, so it's easy to
// learn wherever you come from. In the following: 'A' denotes a
// literal character, period (.), or a single \ escape sequence;
// 'x' and 'y' denote regular expressions; 'm' and 'n' are for
##sparse 3 7.1208391189575195
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#include "logunit.h"
#include <log4cxx/logger.h>
#include <log4cxx/simplelayout.h>
#include <log4cxx/fileappender.h>
#include <log4cxx/helpers/absolutetimedateformat.h>
##sparse 4 7.066349029541016
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#include <log4cxx/filter/denyallfilter.h>
#include <log4cxx/logger.h>
#include <log4cxx/spi/filter.h>
#include <log4cxx/spi/loggingevent.h>
#include "../logunit.h"
Hybrid Search Result:
##hybrid 0 0.016393441706895828
/** Integration tests for {@link BlobPuller}. */
public class BlobPullerIntegrationTest {
private final FailoverHttpClient httpClient = new FailoverHttpClient(true, false, ignored -> {});
@Test
public void testPull() throws IOException, RegistryException {
RegistryClient registryClient =
RegistryClient.factory(EventHandlers.NONE, "gcr.io", "distroless/base", httpClient)
.newRegistryClient();
V22ManifestTemplate manifestTemplate =
registryClient
.pullManifest(
ManifestPullerIntegrationTest.KNOWN_MANIFEST_V22_SHA, V22ManifestTemplate.class)
.getManifest();
DescriptorDigest realDigest = manifestTemplate.getLayers().get(0).getDigest();
Semantic Search Result:
##dense 0 0.7411458492279053
Mockito.doThrow(mockRegistryUnauthorizedException)
.when(mockJibContainerBuilder)
.containerize(mockContainerizer);
try {
testJibBuildRunner.runBuild();
Assert.fail();
} catch (BuildStepsExecutionException ex) {
Assert.assertEquals(
TEST_HELPFUL_SUGGESTIONS.forHttpStatusCodeForbidden("someregistry/somerepository"),
ex.getMessage());
}
}
##dense 1 0.7346029877662659
verify(mockCredentialRetrieverFactory).known(knownCredential, "credentialSource");
verify(mockCredentialRetrieverFactory).known(inferredCredential, "inferredCredentialSource");
verify(mockCredentialRetrieverFactory)
.dockerCredentialHelper("docker-credential-credentialHelperSuffix");
}
##dense 2 0.7285804748535156
when(mockCredentialRetrieverFactory.dockerCredentialHelper(anyString()))
.thenReturn(mockDockerCredentialHelperCredentialRetriever);
when(mockCredentialRetrieverFactory.known(knownCredential, "credentialSource"))
.thenReturn(mockKnownCredentialRetriever);
when(mockCredentialRetrieverFactory.known(inferredCredential, "inferredCredentialSource"))
.thenReturn(mockInferredCredentialRetriever);
when(mockCredentialRetrieverFactory.wellKnownCredentialHelpers())
.thenReturn(mockWellKnownCredentialHelpersCredentialRetriever);
##dense 3 0.7279614210128784
@Test
public void testBuildImage_insecureRegistryException()
throws InterruptedException, IOException, CacheDirectoryCreationException, RegistryException,
ExecutionException {
InsecureRegistryException mockInsecureRegistryException =
Mockito.mock(InsecureRegistryException.class);
Mockito.doThrow(mockInsecureRegistryUnauthorizedException)
.when(mockJibContainerBuilder)
.containerize(mockContainerizer);
try {
testJibBuildRunner.runBuild();
Assert.fail();
} catch (BuildStepsExecutionException ex) {
Assert.assertEquals(TEST_HELPFUL_SUGGESTIONS.forInsecureRegistry(), ex.getMessage());
}
}
##dense 4 0.724872350692749
@Test
public void testBuildImage_registryCredentialsNotSentException()
throws InterruptedException, IOException, CacheDirectoryCreationException, RegistryException,
ExecutionException {
Mockito.doThrow(mockRegistryCredentialsNotSentException)
.when(mockJibContainerBuilder)
.containerize(mockContainerizer);
try {
testJibBuildRunner.runBuild();
Assert.fail();
} catch (BuildStepsExecutionException ex) {
Assert.assertEquals(TEST_HELPFUL_SUGGESTIONS.forCredentialsNotSent(), ex.getMessage());
}
}
We can draw some conclusions: the semantic search model can help us achieve a good result directly, but when keywords we want to match appear in the query, the semantic search model lacks explicit expression for this need. The full-text search method can achieve this. However, the problem that arises is that some insignificant matches interfere with the overall quality, which requires us to discover these negative cases from specific results and handle them specifically from a business perspective to improve search quality. We hope that the release of the full-text search functionality in Milvus 2.5 will help community users bring flexibility in implementing RAG systems, fully exploring various search strategy combinations to help users cope with the more complex and diverse search needs in the GenAI era. If you want to know how to use full-text search in Milvus, please read further Using Milvus for Full-Text Search (https://milvus.io/docs/zh/full_text_search_with_milvus.md).
Code:
https://github.com/wxywb/milvus_fts_exps
Dataset:
https://github.com/anthropics/anthropic-cookbook/tree/main/skills/contextual-embeddings/data
Author Introduction

Wang Xiangyu
Zilliz Algorithm Engineer

Chen Jiang
Zilliz Ecosystem and AI Platform Leader
Recommended Reading


