Building Ollama Cloud - Extending Local Inference to the Cloud

Are you still troubled by the mixed quality and poor performance of AI in China?

Then let’s take a look at Dev Cat AI (3in1).

This is an integrated AI assistant that combines GPT-4, Claude3, and Gemini.

It covers all models of the three AI tools.

Including GPT-4o and Gemini flash

Now you can own them for just ¥68.

The official value is ¥420+.

Send “Dev Cat” in the background to start using.

Become a member now to enjoy one-on-one personal service for your usage.

Ollama is primarily a wrapper for llama.cpp, designed for local inference tasks. If you are looking for cutting-edge performance or features, it is generally not your first choice, but it has its uses, especially in environments where external dependencies need to be managed.

Local AI Development

When developing local AI using Ollama, the setup is simple yet effective. Developers typically leverage Ollama to run inference tasks directly on their local machines. Below is an intuitive description of a typical local development setup using Ollama:

This configuration allows developers to quickly test and iterate without the need for complex remote server communication. It is perfect for initial prototyping and development stages where quick turnaround is crucial.

From Local to Cloud

Transitioning from a local setup to a scalable cloud environment requires evolving from a simple 1:1 setup (one user request to one inference host) to a more complex many-to-many (multiple user requests to multiple inference hosts) configuration. As demand increases, this shift is essential for maintaining efficiency and responsiveness.

The way to scale from local development to production is as follows:

Adopting a direct approach during this transition can significantly increase the complexity of applications, especially when sessions need to maintain consistency across various states. If requests are not routed optimally to the best available inference host, delays and inefficiencies may occur.

Furthermore, the complexity of distributed applications makes it difficult to test locally, which can slow down the development process and increase the risk of failures in production environments.

Server

Server computing abstracts server management and infrastructure details, allowing developers to focus on code and business logic. By separating request handling and consistency maintenance from the application, serverless architecture simplifies scaling.

This approach enables applications to focus on delivering value, addressing many common scaling challenges without burdening developers with infrastructure complexity.

WebAssembly

WebAssembly (Wasm) addresses the dependency management issue by compiling applications into standalone modules. This makes it easier to orchestrate and test applications both locally and in the cloud, ensuring consistency across different environments.

Tau

Tau is a framework for building low-maintenance and highly scalable cloud computing platforms. It is known for its simplicity and scalability. Tau simplifies deployment and supports running local clouds for development, allowing end-to-end (E2E) testing of cloud infrastructure and the applications running on it.

Taubyte refers to this approach as “local coding equals global production,” ensuring that methods that work effectively locally can also work globally, greatly simplifying the development and deployment process.

Integrating Ollama into Tau Using the Orbit Plugin System

The Orbit plugin system of Tau greatly simplifies the process of transforming services into manageable components by wrapping them into WebAssembly host modules. This approach allows Tau to take over orchestration tasks, simplifying deployment and management processes.

Export Functions in Ollama

To make Ollama functions available in the Tau ecosystem, we utilize the Orbit system to export Ollama’s functionalities as callable endpoints. Below is how to export endpoints in Go:

func (s *ollama) W_pull(ctx context.Context, module satellite.Module, modelNamePtr uint32, modelNameSize uint32, pullIdptr uint32) Error {
    model, err := module.ReadString(modelNamePtr, modelNameSize)
    if err != nil {
        return ErrorReadMemory
    }

    id, updateFunc := s.getPullId(model)

    if updateFunc != nil {
        go func() {
            err = server.PullModel(s.ctx, model, &amp;server.RegistryOptions{}, updateFunc)
            s.pullLock.Lock()
            defer s.pullLock.Unlock()
            s.pulls[id].err = err
        }()
    }

    module.WriteUint64(pullIdptr, id)

    return ErrorNone
}

Once defined, these functions (now referred to via satellite.Export) can seamlessly integrate Ollama into the Tau environment:

func main() {
    server := new(context.TODO(), "/tmp/ollama-wasm")
    server.init()
    satellite.Export("ollama", server)
}

Writing Tests for Ollama Plugins

The plugin testing process is straightforward. Below is how to write tests for serverless functions in Go:

//export pull
func pull() {
    var id uint64
    err := Pull("gemma:2b-instruct", &amp;id)
    if err != 0 {
        panic("failed to call pull")
    }
}

Using Tau’s testing suite and Go builder tools, you can build plugins, deploy them in a testing environment, and execute serverless functions to validate functionality:

func TestPull(t *testing.T) {

    ctx := context.Background()

    // Create a testing suite to test the plugin
    ts, err := suite.New(ctx)
    assert.NilError(t, err)

    // Use a Go builder to build plugins and wasm
    gob := builder.New()

    // Build the plugin from the directory
    wd, _ := os.Getwd()
    pluginPath, err := gob.Plugin(path.Join(wd, ".", "ollama"))
    assert.NilError(t, err)

    // Attach plugin to the testing suite
    err = ts.AttachPluginFromPath(pluginPath)
    assert.NilError(t, err)

    // Build a wasm file from serverless function
    wasmPath, err := gob.Wasm(ctx, path.Join(wd, "fixtures", "pull.go"), path.Join(wd, "fixtures", "common.go"))
    assert.NilError(t, err)

    // Load the wasm module and call the function
    module, err := ts.WasmModule(wasmPath)
    assert.NilError(t, err)

    // Call the "pull" function from our wasm module
    _, err = module.Call(ctx, "pull")
    assert.NilError(t, err)
}

Recently, some friends and experts have jointly formed a communication community that includes RAG and AGENT, where many experts in AnythingLLM and Ollama communicate. If you want to join us, please scan the QR code below.

Building Ollama Cloud - Extending Local Inference to the Cloud

Previous hot 🔥 articles:

① Ollama Model Management Tool – Gollama (78)

② Xorbits Inference: The Strongest Competitor of Ollama (73)

③ Configurable Environment Variables for Ollama (68)

If you find this helpful, don’t rush to leave 😝 click “Share and View” before you go🫦