Are you still troubled by the mixed quality of AI in China and poor performance?
Then let’s take a look at Developer Cat AI (3in1)!
This is an integrated AI assistant that combines GPT-4, Claude3, and Gemini.
It covers all models of the three AI tools.
Including GPT-4o and Gemini flash
Now, you can own them for only ¥68.
The official value is ¥420+.
Send “Developer Cat” in the backend to start using.
Become a member now to enjoy one-on-one private service, ensuring your usage is secure.
This time, we will explore the open-source large language model (LLM) Gemma 2 released by Google in 2024. Ollama enables us to deploy open-source LLM as a service on our local systems, making it very interesting to play and explore. However, deploying it locally would consume most of my local computing resources (it would freeze my laptop 😅), and I can’t use it for the planned applications I want to develop. Therefore, I am looking for an easy way to deploy Ollama in the cloud.
Fortunately, there is an official guide on how to easily deploy it on Google Cloud Platform using Cloud Run. Cloud Run allows you to run applications without worrying about managing servers or infrastructure. It acts like a “serverless” platform that handles all the underlying details for you. This way, we can focus more on how to develop applications rather than managing deployment infrastructure.
According to the documentation, we need to enable the following APIs to follow the tutorial:
-
Cloud Build API -> The product we use to build the Ollama Dockerfile.
-
Artifact Registry API -> The product we use to store the Ollama images built from the Dockerfile.
To enable these APIs, you can click the search button on the Google Cloud Console homepage:
Next, we need to create an Ollama Dockerfile on our local system. We can copy and paste directly from the tutorial:
FROM ollama/ollama:0.3.6
# Listen on all interfaces, port 8080
ENV OLLAMA_HOST 0.0.0.0:8080
# Store model weight files in /models
ENV OLLAMA_MODELS /models
# Reduce logging verbosity
ENV OLLAMA_DEBUG false
# Never unload model weights from the GPU
ENV OLLAMA_KEEP_ALIVE -1
# Store the model weights in the container image
ENV MODEL gemma2:9b
RUN ollama serve & sleep 5 && ollama pull $MODEL
# Start Ollama
ENTRYPOINT ["ollama", "serve"]
Details about Ollama environment variables can be found here: Ollama FAQ.
Before we proceed, make sure your local system has the Google Cloud CLI installed. You can find the installation method here: Install gcloud CLI | Google Cloud CLI Documentation.
If you want to use Cloud Run with a GPU, you first need to check which regions support it in Cloud Run locations. Additionally, you may need to request a quota increase here: Cloud Run Quotas and Limits. However, if you think the GPU is too much and do not want to wait for the quota request, you can still continue with this tutorial because Ollama can also run on CPU only.
Next, we need to prepare the Artifact Registry Docker repository to place the Ollama Docker image by running the following console command:
gcloud artifacts repositories create {repository-name}
--repository-format=docker
--location={location}
If everything goes smoothly, it will display the following result. I used docker-repo as the repository name:
Next, we will continue with the Docker image build process using Cloud Build by running the following command:
gcloud artifacts repositories create {REPOSITORY NAME} \
--repository-format=docker \
--location={LOCATION}
If successful, it will show the following output. (Please note that my PROJECT_ID is alvin-exploratory-2, and the IMAGE_NAME is ollama_gemma.)
At this stage, we are almost ready to deploy the Cloud Run application! However, for security reasons, it is always best practice to ensure that all requests to our Cloud Run application are authenticated. So let’s get started! We need to create something called a service account for authentication. Let’s run the following command to create a service account for our Ollama cloud service:
gcloud iam service-accounts create {SERVICE_ACCOUNT_NAME} \
--display-name="Service Account for Ollama Cloud Run service"
I will use ollama-cloudrun as the SERVICE_ACCOUNT_NAME. The successful output will look like this:
Finally, we can deploy the Ollama cloud service using the following command:
gcloud beta run deploy {CLOUD_RUN_NAME} \
--image {LOCATION}-docker.pkg.dev/{PROJECT_ID}/{REPOSITORY}/{IMAGE_NAME} \
--concurrency 4 \
--cpu 8 \
--set-env-vars OLLAMA_NUM_PARALLEL=4 \
--gpu 1 \
--gpu-type nvidia-l4 \
--max-instances 1 \
--memory 32Gi \
--no-allow-unauthenticated \
--no-cpu-throttling \
--service-account {OLLAMA_IDENTITY}@{PROJECT_ID}.iam.gserviceaccount.com \
--timeout=600
The command above is for deployment using GPU. If you want to deploy using only CPU, you can remove the --gpu
and --gpu-type
parameters. If the process is successful, it will display the following output:
Now, we can test if it has been successfully deployed. If we want to access it from another service, we need to authenticate using the service account. Now, let’s test it by enabling a proxy on the local system. This proxy will make our local system appear to be on the same network as the Ollama service.
We can run the following command to create a proxy for our local system to send requests. We will run the service proxy on port 9090:
gcloud run services proxy {CLOUD_RUN_NAME} --port={PORT}
Now we can access the Ollama cloud service via the address http://127.0.0.1:9090.
After that, we can test sending requests from another console tab:
curl http://localhost:9090/api/generate -d '{
"model": "gemma2:9b",
"prompt": "Why is the sky blue?"}'
It will display a streaming output, as shown in the example below:
All done! Now we have successfully tested the deployed Ollama service in the cloud.
In summary, we have completed all of the following steps:
-
Prepared the Docker image repository and enabled the required APIs.
-
Prepared the Ollama Dockerfile.
-
Built the Ollama image and pushed it to the cloud.
-
Created a service account and deployed the Ollama service to Cloud Run.
-
Tested sending requests from the local system to the Cloud Run Ollama service.
Now we can utilize these deployed Ollama services to meet our needs and develop applications based on them.

If this was helpful, don’t rush to leave 😝 hit “Share” and “Look” before you go🫦