Optimizing Token Usage with Prompt Adjustment

In the previous article, I introduced how to deploy large models locally on Mac computers. By customizing prompts, various needs such as domain extraction can be achieved.

However, in reality, deploying large models locally is not very friendly for individual developers. On one hand, it requires a significant investment to ensure that the hardware has enough computational power to support the training and inference of LLMs; on the other hand, the scale of locally deployed large models is usually not very large, so the results may not be ideal.

For individual developers, the fastest way to develop an AI application is to integrate an API from a large model platform. In this case, all costs mainly reflect in Token expenditure. Here are the Token prices from major LLM platforms:

Tongyi Qianwen

Zhiyu

Baidu Wenxin Ernie

The Dark Side of the Moon Kimi

Google Gemini

Claude

Next, I will use the Tongyi Qianwen platform to implement a pseudo-requirement and see the specific Token usage.

In fact, each platform provides a certain amount of free Tokens. The reason for choosing Tongyi Qianwen is that it is relatively inexpensive.

Implementing a search for exhibition halls using Tongyi Qianwen

The pseudo-requirement I will implement is: list three exhibition halls in a certain city and introduce some collections of each exhibition hall.

Return in JSON format

By customizing the Prompt, we can have the large model return data in a fixed format, a commonly used format is JSON. The Prompt is as follows:

The specific code is as follows:

The return result is shown in the figure below:

It is clear that the result returned by the Tongyi Qianwen large model meets our requirements. However, the request time took nearly 17 seconds, and it consumed 543 Tokens at once. With 1 million Tokens, you can only support 1000000/543 = 1841 requests.

Assuming your large model application is quite popular, with 1,000 daily active users, and each user makes 10 requests per day. The daily Token consumption calculation formula is as follows:

1000 users * 10 requests * Tokens per request 543 = 5430000 Tokens

As a qualified poor person, it is essential to conserve Token usage. How to optimize prompts to reduce Token usage while maintaining accuracy is particularly important.

Simplifying JSON Format

Based on the JSON format, simplify the format and remove unnecessary content. The following content is added to the previous Prompt:

It can be seen that the Prompt has added a line “Use simplified JSON mode, removing all unnecessary spaces and line breaks”.

The specific modified code is as follows:

Running the above code, the final return result is as follows:

It is obvious that performance has significantly improved: inference time optimized to 8 seconds, and Token consumption reduced to 393.

Using Custom Format

In fact, even the simplified JSON format is still a relatively complex data format. We can define a simpler custom format to further reduce the inference time of the large model and Token usage.

For example, the following format:

In the above image, our custom format is represented by the ! symbol for a JSON Object, the ? symbol for a string, and the # symbol for an array object.

The specific modified code is as follows:

The final running result is shown in the image below:

By using a custom data format, the model inference time and Token consumption have been further optimized.

Prompt Engineering is an ongoing optimization process, but there are some basic rules. Future articles will attempt to outline the basic principles of writing a comprehensive Prompt.

If you like this article

Long press the QR code to follow

The pseudo-requirement I will implement is: list three exhibition halls in a certain city and introduce some collections of each exhibition hall.

By customizing the Prompt, we can have the large model return data in a fixed format, a commonly used format is JSON. The Prompt is as follows:

Based on the JSON format, simplify the format and remove unnecessary content. The following content is added to the previous Prompt:

It can be seen that the Prompt has added a line “Use simplified JSON mode, removing all unnecessary spaces and line breaks”.

In fact, even the simplified JSON format is still a relatively complex data format. We can define a simpler custom format to further reduce the inference time of the large model and Token usage.

Leave a Comment Cancel reply