Segment Optimization for Ollama+MaxKB Knowledge Base

Last time, I published an article titled “Building Your Own Simple Knowledge Base with Ollama,” and I found that many users encountered various issues during usage. I also faced similar problems. Particularly, after importing an article into the MaxKB knowledge base and asking questions within the application, the answers were completely off-topic and did not meet my requirements. Today, I want to analyze this issue and how to improve it.

Generally speaking, the root cause of this issue lies in poor segmentation during article upload, which leads to a low hit rate and consequently irrelevant answers. Let’s reproduce this problem.

First, import a document.

Segment Optimization for Ollama+MaxKB Knowledge Base

The next step is to use the recommended intelligent segmentation method:

Segment Optimization for Ollama+MaxKB Knowledge Base

As we can see, the segmentation results are chaotic, and there is even some garbled text:

Segment Optimization for Ollama+MaxKB Knowledge Base
Segment Optimization for Ollama+MaxKB Knowledge Base

The actual content of the garbled text is as follows:

Segment Optimization for Ollama+MaxKB Knowledge Base

This means that this method of importing knowledge bases is not perfect for parsing formulas; it is still more convenient for text.

Moreover, for the same document, it is better to upload it in Word format rather than PDF, as text in PDF may exist as images and may contain various formatting symbols, leading to segmentation errors.

As we can see, when I switched to Word, the segmentation was much clearer:

Segment Optimization for Ollama+MaxKB Knowledge Base
Segment Optimization for Ollama+MaxKB Knowledge Base

It can be observed that for the same article, the difference in the number of segments and character counts between Word and PDF is significant, resulting in clearer segmentation.

Segment Optimization for Ollama+MaxKB Knowledge Base

This is the “intelligent segmentation” mode; we can also use the “advanced segmentation” mode, selecting the segmentation identifiers we need.

Segment Optimization for Ollama+MaxKB Knowledge Base
Segment Optimization for Ollama+MaxKB Knowledge Base

This segmentation method results in even more segments. However, this method should be used cautiously; it is best to preprocess it yourself. Otherwise, line breaks or carriage returns may not be the segmentation method you want, easily causing a complete semantic paragraph to be split into many meaningless segments.

Now let’s ask a question again; the results are better than before:

Segment Optimization for Ollama+MaxKB Knowledge Base
Segment Optimization for Ollama+MaxKB Knowledge Base

At the same time, we can also see the question-and-answer results and the referenced segments more intuitively in the debug window:

Segment Optimization for Ollama+MaxKB Knowledge Base
Segment Optimization for Ollama+MaxKB Knowledge Base

Let’s try using the original Ollama large model that hasn’t been added to the knowledge base:

Segment Optimization for Ollama+MaxKB Knowledge Base

It turns out that the response is quite amusing; it completely does not understand what I am saying and merely restates my question, indicating that the knowledge base we segmented after is still effective.

Leave a Comment