What You Need to Know About Prompt Engineering

Selected from Lil’Log

Translated by Machine Heart

Editor: Rome Rome

With the rise of models like ChatGPT and GPT-4, people are increasingly interested in how to create prompts to obtain the desired outputs. Researchers find that responses to specific prompts can be difficult to predict and vary depending on the model. This article, written by Lilian Weng from OpenAI, introduces some insights about prompts, including basic prompts and instruction prompts.

Prompt engineering, also known as In-Context Prompting, refers to the methods of communicating with LLMs to guide their behavior and achieve the desired results without updating model weights. It is an empirical science, and the effectiveness of prompt engineering methods can vary significantly across models, necessitating extensive experimentation and heuristic approaches.

This article, authored by Lilian Weng from OpenAI, presents some knowledge about prompt engineering. Lilian Weng is the head of AI application research at OpenAI and joined the company in 2018, primarily working on pre-training, reinforcement learning & alignment, and model safety in the GPT-4 project.

What You Need to Know About Prompt Engineering

The main content includes the following chapters; let’s take a look at the main content of this article.

Basic Prompts

Zero-shot and few-shot learning are the two most fundamental methods of prompt models, which are often discussed in many LLM papers and are commonly used to evaluate LLM performance.

Zero-Shot Learning

Zero-shot learning simply involves inputting task text into the model and asking it to return results. (All sentiment analysis examples are from SST-2)

Text: i'll bet the video game is a lot more fun than the film.Sentiment:

Few-Shot Learning

Few-shot learning provides a set of high-quality demonstrations about the target task, each demonstration containing the input for the target task and the expected output. When the model first sees good examples, it can better understand human intent and the standards for the desired answers. Therefore, compared to zero-shot learning, few-shot learning usually results in better performance. However, this comes at the cost of consuming more tokens and may reach context length limits when the input and output texts are long.

Text: (lawrence bounces) all over the stage, dancing, running, sweating, mopping his face and generally displaying the wacky talent that brought him fame in the first place.Sentiment: positive
Text: despite all evidence to the contrary, this clunker has somehow managed to pose as an actual feature movie, the kind that charges full admission and gets hyped on tv and purports to amuse small children and ostensible adults.Sentiment: negative
Text: for the first time in years, de niro digs deep emotionally, perhaps because he's been stirred by the powerful work of his co-stars.Sentiment: positive
Text: i'll bet the video game is a lot more fun than the film.Sentiment:

Many studies have explored how to construct contextual examples to maximize performance, observing that the choice of prompt format, training examples, and example order can lead to dramatically different performance, ranging from near-random guessing to close to SOTA.

Research by Zhao et al. investigated few-shot classification scenarios and proposed several reasons that lead to high variance (they used GPT-3 in their experiments): (1) if the distribution of labels is imbalanced between examples, it can lead to Majority label bias; (2) Recency bias refers to the model’s tendency to repeat labels at the end; (3) Common token bias indicates that LLMs tend to generate common tokens rather than rare ones. To overcome such biases, they proposed a method to calibrate the label probabilities of model outputs such that when the input string is N/A, the label probability output becomes uniform.

Sample Selection Tips

Use k-NN clustering in embedding space to select examples that are semantically similar to the test examples (Liu et al., 2021);
To select a diverse and representative example set, [Su et al. (2022)] proposed using graph-based methods: (1) First, construct a directed graph G=(V,E) based on the cosine similarity between embeddings (e.g., through SBERT or other embedding models), where each node points to its k nearest nodes; (2) Start with a selected sample set L=∅ and a remaining sample set U. Each sample u∈U’s score is represented as:, if many neighboring nodes of v are selected, then s (v) is low, thus the score encourages the selection of different samples;
[Rubin et al. (2022)] proposed training embeddings through dataset-specific contrastive learning for context learning sample selection. Given each training pair (x,y), the quality of a sample e_i (formatted input-output pair) can be measured by the conditional probability assigned by LM:Other samples with top-k and bottom-k scores can be identified as positive and negative candidates for each training pair and used for contrastive learning;
Some researchers have attempted to use Q-Learning for sample selection (Zhang et al. 2022);
Inspired by uncertainty-based active learning, [Diao et al. (2023)] suggested identifying examples with high entropy in multiple sampling trials. Then annotate these examples for few-shot prompts.

Tips on Sample Ordering

It is recommended to maintain the diversity of sample selection, ensure relevance to the test samples, and arrange them in random order to avoid Majority Label bias and Recency bias;
Increasing model size or including more training samples does not reduce the variance of different contextual sample arrangements. The same order may work for one model but not for another. When the validation set is limited, consider selecting orders that prevent the model from producing extremely imbalanced predictions or being overly confident in its predictions (Lu et al. 2022).

Instruction Prompts

The purpose of using few-shot examples in prompts is to explain our intent to the model, in other words, to describe the task instructions to the model in the form of demonstrations. However, few-shot samples can be costly in terms of labeling and, due to limited context length, can restrict input length. So, why not give instructions directly?

Instructed LM models (e.g., InstructGPT, Natural Instructions) fine-tune pre-trained models with high-quality tuples (task instructions, input, correct output) to help the LM better understand user intent and follow instructions. RLHF (Reinforcement Learning from Human Feedback) is a commonly used method. The benefit of instruction-following fine-tuning is that it makes the model more aligned with human intent, significantly reducing communication costs.

When interacting with instruction models, tasks should be described in detail, being as specific and precise as possible, avoiding phrases like “don’t do something” and instead specifying what to do.

Please label the sentiment towards the movie of the given movie review. The sentiment label should be "positive" or "negative". Text: i'll bet the video game is a lot more fun than the film. Sentiment:

Explaining to a specified audience is another clever way of instructing, such as creating educational material for children

Describe what is quantum physics to a 6-year-old.

and safe content

... in language that is safe for work.

Contextual instruction learning [Ye et al. 2023] combines few-shot learning with instruction prompts. It includes multiple demonstration examples across different tasks in the prompt, each demonstration consisting of instructions, task inputs, and outputs. Note that their experiments were only targeted at classification tasks, and the instruction prompts included all label options.

Definition: Determine the speaker of the dialogue, "agent" or "customer".Input: I have successfully booked your tickets.Output: agent

Definition: Determine which category the question asks for, "Quantity" or "Location".Input: What's the oldest building in US?Output: Location

Definition: Classify the sentiment of the given movie review, "positive" or "negative".Input: i'll bet the video game is a lot more fun than the film.Output:

Self-Consistency Sampling

Self-consistency sampling [Wang et al. 2022a] samples multiple outputs with temperature > 0, then selects the best one from these candidates. The criteria for selecting the best candidate vary by task. A common solution is to choose the majority vote. For easily verifiable tasks, such as programming problems with unit tests, simply running the interpreter and validating correctness through unit tests can suffice.

Chain of Thought (CoT)

Chain of Thought (CoT) prompts (Wei et al. 2022) generate a series of short phrases to describe reasoning logic step-by-step, called a reasoning chain, ultimately leading to a final answer. The benefits of CoT are more apparent for complex reasoning tasks, especially when using large models (e.g., with parameters exceeding 50B). Simple tasks benefit little from CoT prompts.

Types of CoT Prompts

There are two main types of CoT prompts:

Few-shot CoT: prompts the model with a few demonstrations, each demonstration containing a high-quality reasoning chain written by humans (or generated by the model).

(All mathematical reasoning examples are from GSM8k)

Question: Tom and Elizabeth have a competition to climb a hill. Elizabeth takes 30 minutes to climb the hill. Tom takes four times as long as Elizabeth does to climb the hill. How many hours does it take Tom to climb up the hill?Answer: It takes Tom 30*4 = <<30*4=120>>120 minutes to climb the hill.It takes Tom 120/60 = <<120/60=2>>2 hours to climb the hill.So the answer is 2.===Question: Jack is a soccer player. He needs to buy two pairs of socks and a pair of soccer shoes. Each pair of socks cost $9.50, and the shoes cost $92. Jack has $40. How much more money does Jack need?Answer: The total cost of two pairs of socks is $9.50 x 2 = $<<9.5*2=19>>19.The total cost of the socks and the shoes is $19 + $92 = $<<19+92=111>>111.Jack need $111 - $40 = $<<111-40=71>>71 more.So the answer is 71.===Question: Marty has 100 centimeters of ribbon that he must cut into 4 equal parts. Each of the cut parts must be divided into 5 equal parts. How long will each final cut be?Answer:

Zero-shot CoT: uses natural language statements like “Let’s think step by step” to explicitly encourage the model to first generate a reasoning chain, then prompts “So the answer is” to generate the answer (Kojima et al. 2022). Or similar statements like “Let’s solve this problem step by step to ensure we have the correct answer” (Zhou et al. 2022).

Question: Marty has 100 centimeters of ribbon that he must cut into 4 equal parts. Each of the cut parts must be divided into 5 equal parts. How long will each final cut be?Answer: Let's think step by step.

Tips and Extensions

Self-consistency sampling can improve reasoning accuracy by extracting multiple different answers and then conducting a majority vote. (Wang et al. 2022a);
Another approach to ensemble learning is to change the sample order or use model-generated base principles instead of human-written base principles to introduce randomness in multiple sample trials. Then aggregate the model’s results based on majority voting to obtain the final answer. (Wang et al. 2022b);
If training examples are only associated with real answers but lack base reasoning, one can follow the STaR (Self-Taught Reasoner; Zelikman et al. 2022) method: (1) Let the LLM generate reasoning chains, retaining only those that lead to correct answers; (2) Then fine-tune the model with the generated base reasoning and repeat the process until convergence. It should be noted that higher temperatures are more likely to produce erroneous base principles and correct answers. If training samples lack real answers, consider using the majority vote as the “correct” answer;
Prompts with higher reasoning complexity demonstrations can achieve better performance, where complexity is measured by the number of reasoning steps in the chain. When separating reasoning steps, using newline characters
is better than “Step i”, periods, or semicolons (Fu et al. 2023);
Complexity-based consistency is achieved by conducting majority voting only among the top k complex chains, selecting the k chains that perform explicitly better (Fu et al. 2023);
[Shum et al. (2023)] found in their experiments that using only complex examples for CoT prompts can improve accuracy for complex problems but perform poorly on simple problems (e.g., performance on GSM8k is clear evidence);
Changing Q: to Question: is found to be helpful (Fu et al. 2023);
[Ye & Durrett (2022)] found that for NLP tasks involving text reasoning (i.e., QA and NLI), including explanations in prompts is somewhat useful but not significant, and the effects vary by model. They observed that explanations are more likely to be non-factual rather than inconsistent (i.e., whether the explanation predicts). Non-factual explanations are likely to lead to erroneous predictions;
[Self-Ask (Press et al. 2022)] is a method where the model repeatedly prompts to ask follow-up questions to iteratively build a reasoning process. Follow-up questions can be answered through search engine results. Similarly, IRCoT [Interleaving Retrieval CoT; Trivedi et al. 2022] and ReAct [Reason + Act; Yao et al. 2023] combine iterative CoT prompts with Wikipedia API queries to search for relevant entities and content, which are then added back into context.

What You Need to Know About Prompt Engineering

Figure 1. How Self-Ask works with external search queries (Image source: Press et al. 2022).

Automatic Prompt Design

Prompts are a series of prefix tokens that increase the probability of obtaining the desired output given certain input. Therefore, they can be viewed as trainable parameters and optimized directly in the embedding space through gradient descent, such as AutoPrompt [Shin et al., 2020], Prefix-Tuning (Li & Liang 2021), P-tuning [Liu et al., 2021], and Prompt-Tuning [Lester et al. 2021]. The trend from AutoPrompt to Prompt-Tuning is gradually simplifying the setup.

APE [Automatic Prompt Engineer; Zhou et al. 2022] is a method that searches a pool of model-generated candidate instructions, then filters the candidates based on the selected scoring function to finally select the highest scoring best candidate.

1. Prompt LLM generates candidate instructions based on a set of input-output pairs. For example: {{Given desired input-output pairs}}

The instruction is;

2. Given the dataset Dtrain={(x,y)}, find an instruction ρ such that ρ∗=arg⁡maxρE (x,y)∈Dtrain [f (ρ,x,y)], where f (.) is the scoring function for each sample, such as execution accuracy [LM (.|ρ, x)=y] or log probability: Plm (y|ρ,x);

3. Improve the best candidate by proposing semantically similar variants through iterative Monte Carlo search by prompting: Generate a variation of the following instruction while keeping the semantic meaning.

Input: …

Output:…

To automatically construct CoT prompts, [Shum et al. (2023)] suggests an augment-prune-select process that includes three steps:

1. Augment: Use few-shot or zero-shot CoT prompts to generate multiple pseudo-CoTs for the given problem;

2. Prune: Trim the pseudo chains based on whether the generated answers match basic facts;

3. Select: Apply variance-reduced gradient strategy to learn the probability distribution of selected examples while treating the probability distribution of examples as a strategy, using the accuracy of the validation set as a reward.

[Zhang et al. (2023)] used clustering techniques to sample questions and then generate chains. They observed that LLMs tend to make certain types of errors. One type of error may be similar in embedding space and thus grouped together. By sampling one or a few examples only from frequently erroneous clusters, it is possible to prevent over-representation of one error type and collect a diverse set of examples.

1. Question clustering: embedding questions and running k-means clustering methods;

2. Sample selection: select a representative set of questions from each cluster; i.e., one example from one cluster. Samples in each cluster are sorted by distance to the cluster centroid, with the closest samples being chosen first;

3. Generate reasoning chains: use zero-shot CoT to generate reasoning chains for the selected questions and construct few-shot prompts to run the reasoning.

Enhanced Language Models

[Mialon et al. (2023)] conducted a survey on enhanced language models. It covered various enhanced reasoning skills and the ability to use external tools in language models. Recommended reading for readers.

Retrieval

Often, we need to complete tasks requiring up-to-date knowledge after the model’s pre-training cutoff or internal/private knowledge base. In such cases, if context is not explicitly provided in the prompt, the model will not know the context. Many methods for open-domain question answering rely on first retrieving from a knowledge base, then incorporating the retrieved content as part of the prompt. The accuracy of this process depends on the quality of both the retrieval and generation steps.

[Lazaridou et al. (2022)] studied how to use Google search for document retrieval to enhance LLMs. Given a question q, extract text from the 20 URLs returned by Google to obtain a set of documents. Since these documents are long, each document is split into paragraphs of 6 sentences, {p}. Paragraphs are ranked based on TF-IDF cosine similarity between evidence paragraphs and queries. Only the most relevant paragraphs are used in the prompt to produce answers a.

For closed-book question answering, the format of each demonstration is as follows to construct few-shot prompts. It was found that swapping the question with the evidence (longer distance between question and answer) consistently produced lower results across all datasets.

Evidence: ...Question: ...Answer: ...

The probability of answers is calculated in three ways:

1. RAG style, What You Need to Know About Prompt Engineering , where is the normalized cosine similarity between TF-IDF paragraph and question representations.

2. Noisy channel inference: What You Need to Know About Prompt Engineering

3. Product-of-Experts (PoE), combining all probabilities used above, but excluding What You Need to Know About Prompt Engineering

According to their experiments on generation and classification tasks, the reordering scores of three answers are PoE > noisy channel inference > RAG. Among all probabilities, pLM (a|q,pi) and pLM (q|pi,a) provide the most information. pLM (q|pi,a) captures how well the LM explains the question given the evidence paragraph and answer, and can reliably be used to reorder candidate answers.

An observation made on the SituatedQA dataset for questions based on different dates is that, although the LM (with a pre-training cutoff date of 2020) can access the latest information through Google search, its performance on questions after 2020 still lags significantly behind those before 2020. This indicates some discrepancies or parameter conflicts between contextual information and the model’s internal knowledge.

Interestingly, even “internal retrieval” is beneficial, meaning generating knowledge about a topic before answering a question [Liu et al. 2022]. One can first use the following template to extract knowledge:

Generate some knowledge about the input. Examples:
Input: What type of water formation is formed by clouds?Knowledge: Clouds are made of water vapor.
Input: {question}Knowledge:

Then use the model-generated knowledge to further prompt the LM to obtain answers.

Programming Languages

There are both PAL (Program-Assisted Language Models) [Gao et al. 2022] and PoT (Program of Thoughts prompting [Chen et al. 2022]) that require LLMs to generate programming language statements to solve natural language reasoning problems, thereby transferring the solution steps to runtime, such as a Python interpreter. This setup decouples complex computation and reasoning. This method relies on LLMs with sufficiently good programming skills.

Figure 2. Comparing CoT and PoT. (Image source: Chen et al. 2022).

External APIs

TALM (Tool Augmented Language Models [Parisi et al. 2022]) is an enhanced language model that utilizes text-to-text API calls. The LM is guided to generate tool calls and tool input text conditioned on the task input text to construct API call requests. When results are available, the specified tool API is called, and the returned results are appended to the text sequence. The final output is generated after the output token.

What You Need to Know About Prompt Engineering

Figure 3. The format of API calls in TALM (Image source: Parisi et al. 2022).

TALM employs a self-play approach to iteratively guide the use of tools in the dataset and uses it to fine-tune the LM. This iterative self-play mimics an RL process where the LM is a policy network trained with a binary reward signal.

What You Need to Know About Prompt Engineering

Figure 4. Self-play iteratively enhances model performance (Image source: Parisi et al. 2022).

Toolformer [Schick et al. 2023] is an LM that can use external tools through simple APIs, constructed through self-supervision, requiring only a few demonstrations for each API. Toolformer’s toolbox includes:

Computational systems: assist the LM with lacking precise mathematical skills;
Question-answering systems: help resolve invalid content;
Search engines: provide up-to-date information post-pre-training cutoff;
Translation systems: improve performance in low-resource languages;
Calendar systems: help the LM understand the passage of time.

Figure 5. How Toolformer is constructed. (Image source: Schick et al. 2023).

The training process of Toolformer is as follows:

1. API calls under prompt annotation. Pre-trained LMs are required to annotate datasets through few-shot learning with API calls using examples. Formatted examples:

What You Need to Know About Prompt Engineering

Figure 6. How the dataset is annotated when calling APIs (Image source: Schick et al. 2023).

Each API call is represented as a tuple (API name, corresponding input), What You Need to Know About Prompt Engineering

with the corresponding result denoted as r.API calls with and without results are annotated as follows:

What You Need to Know About Prompt Engineering

Based on probability What You Need to Know About Prompt Engineering

, if the probability exceeds a certain threshold, the top k candidates at position i are selected for the API call;

Then, potential API calls are sampled from the LM, where the sequence [prompt (x),x1,…,xi−1,⟨API⟩] serves as a prefix, and ⟨/API⟩ as a suffix.

2. Filter annotations based on whether the API call aids the model in predicting future tokens. Self-supervised loss is used to determine which API calls provide actual assistance.

Execute each API call c_i to obtain the corresponding result r_i;

Calculate the weighted cross-entropy loss of the LM over the token sequence xi,…,xn when the model is prefixed with the prompt.Two versions are computed, one with API results and the other with an empty sequence ε;

What You Need to Know About Prompt Engineering

Only keep

API calls above the threshold, indicating that adding this API call and its result aids in predicting future tokens.

3. Fine-tune the LM on the annotated dataset. The new training sequences are constructed as What You Need to Know About Prompt Engineering

The training data consists of the original dataset (for example, the CCNet subset in the paper) and its augmented version.

During inference, decoding continues until the model produces the “→” token, indicating that it expects the next response from the API call.

Currently, Toolformer does not support using tools within chains (i.e., using one tool’s output as another tool’s input) or interactively (i.e., adopting API responses after human selection). Both are directions for future model expansion.

Original link: https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/

What You Need to Know About Prompt Engineering

For reprints, please contact this public account for authorization

Submissions or inquiries: [email protected]

Leave a Comment Cancel reply