Last time, I published an article titled “Building Your Own Simple Knowledge Base with Ollama,” and I found that many users encountered various issues during usage. I also faced similar problems. Particularly, after importing an article into the MaxKB knowledge base and asking questions within the application, the answers were completely off-topic and did not meet my requirements. Today, I want to analyze this issue and how to improve it.
Generally speaking, the root cause of this issue lies in poor segmentation during article upload, which leads to a low hit rate and consequently irrelevant answers. Let’s reproduce this problem.
First, import a document.

The next step is to use the recommended intelligent segmentation method:

As we can see, the segmentation results are chaotic, and there is even some garbled text:


The actual content of the garbled text is as follows:

This means that this method of importing knowledge bases is not perfect for parsing formulas; it is still more convenient for text.
Moreover, for the same document, it is better to upload it in Word format rather than PDF, as text in PDF may exist as images and may contain various formatting symbols, leading to segmentation errors.
As we can see, when I switched to Word, the segmentation was much clearer:


It can be observed that for the same article, the difference in the number of segments and character counts between Word and PDF is significant, resulting in clearer segmentation.

This is the “intelligent segmentation” mode; we can also use the “advanced segmentation” mode, selecting the segmentation identifiers we need.


This segmentation method results in even more segments. However, this method should be used cautiously; it is best to preprocess it yourself. Otherwise, line breaks or carriage returns may not be the segmentation method you want, easily causing a complete semantic paragraph to be split into many meaningless segments.
Now let’s ask a question again; the results are better than before:


At the same time, we can also see the question-and-answer results and the referenced segments more intuitively in the debug window:


Let’s try using the original Ollama large model that hasn’t been added to the knowledge base:

It turns out that the response is quite amusing; it completely does not understand what I am saying and merely restates my question, indicating that the knowledge base we segmented after is still effective.