Prompting 2

Q1. (MCQ) A model is given the following prompt:

Classify the text into neutral, negative, or positive.
Text: I think the vacation is okay.
Sentiment:

The model correctly outputs "Neutral." What enables the model to perform this task without any examples?

A) The model uses chain-of-thought reasoning internally to derive the answer B) The model has learned the concept of "sentiment" during pre-training and instruction tuning C) The prompt implicitly contains a one-shot example because it shows the format of the expected answer D) The output indicator "Sentiment:" acts as a few-shot demonstration

Answer: B

A) — Incorrect. There's no evidence the model uses intermediate reasoning steps for a straightforward classification. CoT is a distinct technique involving explicit step-by-step reasoning, not what's happening in a simple zero-shot classification.
B) — Correct. Zero-shot capabilities work because the LLM already understands concepts like "sentiment" from its massive pre-training data. Instruction tuning and RLHF further improve these zero-shot abilities by aligning the model to follow instructions without needing demonstrations.
C) — Incorrect. Showing the format of the expected answer (the "Sentiment:" label) is an output indicator, not a demonstration. A one-shot example would require a complete input-output pair (a different text with its classification) before the actual query.
D) — Incorrect. An output indicator signals where and how the model should respond — it's a structural cue, not a demonstration. A few-shot demonstration requires actual solved examples, not just a label placeholder.

Q2. (MSQ — Select ALL that apply) Which of the following accurately describe what happens when zero-shot prompting fails on a task?

A) The recommended next step is to provide demonstrations in the prompt, transitioning to few-shot prompting B) The recommended next step is to apply full fine-tuning to the model C) Instruction tuning has been shown to improve zero-shot learning capabilities D) RLHF has been adopted to scale instruction tuning and align models to human preferences E) Adding "Let's think step by step" converts the failed zero-shot attempt into few-shot prompting

Answer: A, C, D

A) — Correct. When zero-shot doesn't work, the explicitly recommended approach is providing demonstrations or examples in the prompt, which is few-shot prompting.
B) — Incorrect. Full fine-tuning is never mentioned as the immediate next step for a failed zero-shot prompt. The escalation path goes from zero-shot → few-shot → CoT, not directly to model retraining.
C) — Correct. Instruction tuning (fine-tuning models on datasets described via instructions) has been shown to improve zero-shot learning.
D) — Correct. RLHF has been adopted to scale instruction tuning, aligning the model to better fit human preferences, which further improves zero-shot performance.
E) — Incorrect. Adding "Let's think step by step" creates zero-shot CoT prompting, not few-shot prompting. Few-shot requires actual input-output example pairs, not a reasoning trigger phrase.

Q3. (MCQ) In few-shot prompting, a researcher uses demonstrations where the labels are intentionally randomized (e.g., positive text labeled as "negative"). Surprisingly, the model still performs reasonably well. Which finding best explains this?

A) The model ignores the demonstrations entirely and relies on zero-shot capabilities B) The format and label space specified by the demonstrations matter more than individual label correctness C) Random labels act as a form of adversarial training that strengthens the model D) The model uses chain-of-thought reasoning to self-correct the incorrect labels

Answer: B

A) — Incorrect. If the model ignored demonstrations entirely, there would be no difference between zero-shot and few-shot performance. Research shows that even random labels outperform no labels at all, proving the model does use the demonstrations.
B) — Correct. Research findings demonstrate that the label space and the distribution of input text specified by the demonstrations are both important, regardless of whether individual labels are correct. The format plays a key role — even random labels are much better than no labels, because they establish the structural template and output space.
C) — Incorrect. Adversarial training is a formal training-time technique involving gradient updates. Few-shot demonstrations operate at inference time and don't modify model weights.
D) — Incorrect. Standard few-shot prompting does not inherently invoke chain-of-thought reasoning. CoT is a separate, explicit technique. The model isn't "correcting" labels — it's leveraging the structural format of the demonstrations.

Q4. (MCQ) What is the primary limitation of standard few-shot prompting that motivated the development of chain-of-thought (CoT) prompting?

A) Few-shot prompting requires too many tokens, making it cost-prohibitive B) Few-shot prompting cannot handle tasks that require multi-step reasoning C) Few-shot prompting only works with models smaller than 10 billion parameters D) Few-shot prompting fails entirely on classification tasks

Answer: B

A) — Incorrect. While few-shot prompting does consume tokens for demonstrations, token efficiency isn't cited as the primary limitation that drove CoT's development. Meta prompting addresses token efficiency concerns.
B) — Correct. Standard few-shot prompting works well for many tasks but is not a perfect technique for more complex reasoning tasks, specifically arithmetic, commonsense, and symbolic reasoning that require intermediate steps. CoT was popularized to address exactly these multi-step reasoning gaps.
C) — Incorrect. Few-shot properties actually emerged when models were scaled to sufficient size. Larger models are better at few-shot learning, not worse. There's no 10-billion-parameter ceiling.
D) — Incorrect. Few-shot prompting works well for classification tasks — the sentiment classification and word-usage examples demonstrate this. The limitation is specifically around complex reasoning, not classification.

Q5. (MCQ) Consider the following two prompt strategies:

Strategy A: "Let's think step by step. What is 17 × 24?"

Strategy B: "What is the underlying mathematical principle needed here? State the principle, then apply it to solve: What is 17 × 24?"

Which techniques do Strategy A and Strategy B represent, respectively?

A) Few-shot CoT and Step-Back Prompting B) Zero-shot CoT and Step-Back Prompting C) Step-Back Prompting and Zero-shot CoT D) Zero-shot CoT and Meta Prompting

Answer: B

A) — Incorrect. Strategy A has no demonstrations or example pairs — it just appends a reasoning trigger phrase. That's zero-shot CoT, not few-shot CoT. Few-shot CoT would include worked examples with reasoning chains.
B) — Correct. Strategy A uses the "Let's think step by step" trigger without any examples, which is the definition of zero-shot CoT. Strategy B asks the model to first identify the underlying principle before solving, which is Step-Back Prompting — abstracting to a higher level before addressing the specifics.
C) — Incorrect. The order is reversed. "Let's think step by step" is zero-shot CoT (sequential decomposition), not Step-Back. Asking for the "underlying principle" is Step-Back (abstraction), not CoT.
D) — Incorrect. Strategy B is not Meta Prompting. Meta Prompting focuses on the structural pattern of problems (syntax and format templates), not on extracting underlying domain principles. Step-Back Prompting asks for foundational knowledge; Meta Prompting asks for structural scaffolding.

Q6. (MSQ — Select ALL that apply) Auto-CoT was developed to address a specific problem with manual chain-of-thought prompting. Which of the following correctly describe Auto-CoT's approach?

A) It hand-crafts optimal reasoning chains by having domain experts write demonstrations B) It partitions questions into clusters and selects a representative question from each C) It uses "Let's think step by step" to automatically generate reasoning chains for demonstrations D) It applies simple heuristics like question length and number of reasoning steps to select demonstrations E) It eliminates the need for any demonstrations by making CoT fully zero-shot

Answer: B, C, D

A) — Incorrect. Auto-CoT was specifically designed to eliminate manual effort. Hand-crafting by domain experts is the problem it solves, not its approach.
B) — Correct. Auto-CoT's Stage 1 is question clustering (partitioning questions into clusters), and Stage 2 is demonstration sampling (selecting a representative question from each cluster).
C) — Correct. Auto-CoT leverages LLMs with the "Let's think step by step" prompt (zero-shot CoT) to automatically generate reasoning chains for the selected representative questions.
D) — Correct. Simple heuristics such as question length (e.g., 60 tokens) and number of reasoning steps (e.g., 5 steps) are used to encourage the model to use simple and accurate demonstrations.
E) — Incorrect. Auto-CoT still produces demonstrations — it automates their creation rather than eliminating them. The result is still few-shot CoT with demonstrations; the difference is those demonstrations are machine-generated, not hand-crafted.

Q7. (MCQ) A researcher is building an automated LLM pipeline using Step-Back Prompting. The user asks: "Was Estella Leopold alive when the first atomic bomb was dropped?" What does the abstraction prompt (first step) generate?

A) A step-by-step breakdown of Estella Leopold's life timeline B) A higher-level question: "When was Estella Leopold born, and when was the first atomic bomb dropped?" C) The underlying historical principle governing nuclear weapon development D) A chain-of-thought reasoning trace directly answering the original question

Answer: B

A) — Incorrect. A full life timeline would be an exhaustive chain-of-thought decomposition, not a step-back abstraction. Step-Back Prompting generates a question, not a timeline.
B) — Correct. The abstraction prompt generates a step-back question that extracts the factual prerequisites needed to answer the original query. By determining birth dates and event dates separately, the model can then combine these facts to answer the specific question in the grounding step.
C) — Incorrect. The underlying principle of nuclear weapon development is irrelevant to answering whether a specific person was alive at a specific time. Step-Back Prompting abstracts to the relevant foundational knowledge, not to tangentially related domain principles.
D) — Incorrect. The abstraction prompt does not answer the original question — that's the job of the second prompt (the grounding prompt). The first prompt only generates the higher-level question.

Q8. (MCQ) What is the fundamental difference between how Chain-of-Thought (CoT) and Step-Back Prompting handle a complex problem?

A) CoT uses few-shot examples while Step-Back uses zero-shot exclusively B) CoT breaks a problem into smaller sequential steps while Step-Back abstracts the problem to a higher conceptual level C) CoT requires two separate API calls while Step-Back can be done in a single prompt D) CoT works only on mathematical problems while Step-Back works only on historical reasoning

Answer: B

A) — Incorrect. CoT has both few-shot and zero-shot variants. Step-Back can also be implemented in single-prompt or two-prompt formats. Neither is locked to a specific shot paradigm.
B) — Correct. CoT asks the model to decompose a problem into smaller sequential steps ("Let's think step by step" — horizontal decomposition). Step-Back asks the model to move upward to identify the underlying principle, law, or high-level context before solving ("What is the underlying principle here?" — vertical abstraction). The direction of reasoning is fundamentally different.
C) — Incorrect. This is backwards. Step-Back Prompting is often implemented as a two-prompt architecture (abstraction prompt → grounding prompt) in automated pipelines, while CoT is typically a single prompt. But both can be adapted to either format.
D) — Incorrect. Both techniques are domain-general. CoT applies to arithmetic, commonsense, and symbolic reasoning. Step-Back applies to STEM, historical reasoning, and complex logic. Neither is restricted to a single domain.

Q9. (MSQ — Select ALL that apply) Which of the following are key characteristics of Meta Prompting?

A) It prioritizes the format and pattern of problems over specific content B) It uses detailed, content-rich examples as demonstrations C) It draws from type theory to emphasize categorization and logical arrangement D) It employs abstracted examples as frameworks illustrating structural patterns E) It requires the model to identify underlying scientific laws before solving

Answer: A, C, D

A) — Correct. Meta Prompting is explicitly structure-oriented, prioritizing the format and pattern of problems and solutions over specific content.
B) — Incorrect. This describes few-shot prompting's content-driven approach. Meta Prompting deliberately minimizes specific content, using abstracted structural templates instead of detailed examples.
C) — Correct. Meta Prompting draws from type theory to emphasize the categorization and logical arrangement of components in a prompt.
D) — Correct. Meta Prompting employs abstracted examples as frameworks that illustrate the structure of problems and solutions without focusing on specific details.
E) — Incorrect. This describes Step-Back Prompting, which asks the model to identify underlying principles or laws. Meta Prompting focuses on structural syntax and patterns, not domain-specific principles.

Q10. (MCQ) A developer needs to reduce token costs while maintaining strong performance on a math benchmark. They currently use 5-shot prompting with fully worked examples consuming 2,000 tokens. Which technique would most directly address their token efficiency concern?

A) Zero-shot CoT B) Step-Back Prompting C) Meta Prompting D) Auto-CoT

Answer: C

A) — Incorrect. Zero-shot CoT eliminates examples but uses "Let's think step by step" which can generate very long reasoning traces in the output. It reduces input tokens but doesn't specifically target the structural efficiency advantage described for benchmark comparisons.
B) — Incorrect. Step-Back Prompting can actually increase token usage because it requires stating underlying principles before solving. It optimizes reasoning quality, not token economy.
C) — Correct. Token efficiency is explicitly listed as an advantage of Meta Prompting over few-shot prompting. By replacing detailed content-rich examples with abstract structural templates, Meta Prompting reduces the number of tokens required while still guiding the model's problem-solving approach.
D) — Incorrect. Auto-CoT automates the selection of demonstrations but still includes full demonstrations with reasoning chains. It addresses the manual effort problem, not the token cost problem.

Q11. (MCQ) Chain-of-thought prompting is described as an "emergent ability." What does this mean in context?

A) It was deliberately programmed into models through instruction tuning B) It arises naturally only in sufficiently large language models C) It requires explicit fine-tuning on reasoning datasets to function D) It works equally well on all model sizes

Answer: B

A) — Incorrect. Emergent abilities are not deliberately programmed — they arise as a byproduct of scale and training on diverse data. Instruction tuning is a separate enhancement that improves models after pre-training.
B) — Correct. The authors of the CoT paper claim that chain-of-thought is an emergent ability that arises with sufficiently large language models. It's not something that exists at all scales — smaller models fail to produce coherent reasoning chains.
C) — Incorrect. CoT doesn't require explicit fine-tuning on reasoning datasets. It works as a prompting technique at inference time, leveraging abilities that naturally emerged during pre-training at scale.
D) — Incorrect. This directly contradicts the emergent ability claim. The technique's effectiveness is scale-dependent — it works reliably only on models that have reached sufficient size.

Q12. (MCQ) A Step-Back Prompt for a physics problem instructs the model to: (1) identify the core physics law, (2) write down its equation, (3) solve the specific problem. In the ideal gas example, explicitly stating PV = nRT before calculating prevents what kind of failure?

A) The model hallucinating fictional physics laws B) The model producing correct answers with circular or confused mathematical reasoning C) The model refusing to answer due to insufficient context D) The model misclassifying the problem domain

Answer: B

A) — Incorrect. The model typically doesn't invent fake physics laws for well-known problems. The failure mode is sloppy reasoning with correct concepts, not fabricated concepts.
B) — Correct. The example explicitly shows that without Step-Back Prompting, the model arrives at a numerically plausible answer but through "potentially confusing or circular math." By forcing the model to write the equation first, it's locked into a structured mathematical proof rather than guessing the relationships between variables. The step-back prevents logical errors and circular reasoning, even when the final number might coincidentally be close.
C) — Incorrect. The model doesn't refuse the problem — it attempts an answer. The issue is reasoning quality, not refusal.
D) — Incorrect. The model recognizes it's a gas law problem in both cases. The problem isn't domain misclassification — it's sloppy execution within the correct domain.

Q13. (MSQ — Select ALL that apply) Meta Prompting shares a limitation with zero-shot prompting. Which of the following correctly describe this shared weakness?

A) Both assume the LLM has innate knowledge of the specific task being addressed B) Both may see performance deteriorate on unique and novel tasks C) Both require labeled training datasets to function D) Both minimize the influence of specific content-rich examples E) Both are unable to handle classification tasks

Answer: A, B, D

A) — Correct. Meta Prompting explicitly assumes the LLM has innate knowledge of the task. Zero-shot prompting similarly relies on the model's pre-existing understanding without demonstrations. Both fail when the model lacks relevant pre-training exposure.
B) — Correct. Performance may deteriorate on more unique and novel tasks for both approaches, precisely because neither provides content-rich examples to guide the model on unfamiliar territory.
C) — Incorrect. Neither meta prompting nor zero-shot prompting requires labeled training datasets. Both are inference-time techniques that work without any training data. Labeled datasets are needed for prompt tuning or fine-tuning.
D) — Correct. Meta Prompting is described as having "zero-shot efficacy" because it minimizes the influence of specific examples — the same defining characteristic of zero-shot prompting.
E) — Incorrect. Both can handle classification tasks. The zero-shot sentiment classification example demonstrates this directly. The limitation is about novel tasks, not broad task categories.

Q14. (MCQ) In Auto-CoT, why is diversity of demonstrations emphasized as a key design principle?

A) Diverse demonstrations increase the total token count, which improves model performance B) Diverse demonstrations reduce the risk that errors in any single auto-generated reasoning chain compound and mislead the model C) Diverse demonstrations allow the model to bypass its context window limitations D) Diverse demonstrations are required to trigger the emergent CoT ability in large models

Answer: B

A) — Incorrect. More tokens don't inherently improve performance — the "lost in the middle" effect from the earlier course material demonstrates the opposite. Diversity is about representational coverage, not token volume.
B) — Correct. Since Auto-CoT generates reasoning chains automatically, the process can still result in mistakes. Diversity of demonstrations mitigates the effects of these mistakes — if one chain has errors, diverse examples from different clusters prevent that single error from dominating the model's reasoning pattern.
C) — Incorrect. Diversity has no relationship to context window limitations. Both diverse and non-diverse demonstrations consume context window space equally.
D) — Incorrect. The emergent CoT ability is a function of model scale, not demonstration diversity. Diversity is an engineering choice to improve Auto-CoT's robustness against auto-generated errors.

Q15. (MCQ) A prompt instructs an LLM to: "Generate the optimal prompt structure for solving algebraic word problems. Show the template with placeholders, not a solved example." This is an instance of:

A) Zero-shot CoT B) Few-shot prompting C) Meta Prompting D) Step-Back Prompting

Answer: C

A) — Incorrect. Zero-shot CoT would ask the model to solve a specific problem step by step. This prompt doesn't ask for problem-solving — it asks for a template.
B) — Incorrect. Few-shot prompting provides solved content-rich examples. This prompt explicitly avoids solved examples and requests a structural template with placeholders.
C) — Correct. This prompt asks the LLM to generate a prompt structure — focusing on the pattern, syntax, and format rather than specific content. It also illustrates that meta prompting can be achieved by instructing the LLM to generate a prompt itself, which is an explicitly mentioned application.
D) — Incorrect. Step-Back Prompting would ask for the underlying mathematical principle. This prompt asks for a structural template for how to approach problems, not the domain knowledge needed to solve them.

Q16. (MSQ — Select ALL that apply) Which of the following correctly distinguish Meta Prompting's advantages over few-shot prompting?

A) Meta Prompting provides a fairer comparison for benchmarking different models B) Meta Prompting always produces more accurate results than few-shot prompting C) Meta Prompting reduces the number of tokens required D) Meta Prompting can be viewed as a form of zero-shot prompting E) Meta Prompting eliminates all dependence on the model's pre-training knowledge

Answer: A, C, D

A) — Correct. By minimizing the influence of specific examples, Meta Prompting provides a fairer approach for comparing different problem-solving models. Few-shot examples can inadvertently favor models that have seen similar examples during training.
B) — Incorrect. The material never claims Meta Prompting is universally more accurate. It acknowledges performance may deteriorate on novel tasks. The advantage is structural efficiency and fairness, not guaranteed accuracy.
C) — Correct. Token efficiency is explicitly listed as an advantage — focusing on structure rather than detailed content reduces token requirements.
D) — Correct. Meta Prompting can be viewed as a form of zero-shot prompting in which the influence of specific examples is minimized.
E) — Incorrect. The exact opposite — Meta Prompting assumes the LLM has innate knowledge of the task. It depends heavily on pre-training knowledge because it provides structural scaffolding, not content.

Q17. (MCQ) In a Step-Back Prompting pipeline with two API calls, the first call generates a step-back question and the second call receives both the step-back answer and the original question. Why must the original question be passed to the second call as well?

A) Because the LLM has no memory between API calls and would otherwise forget the original task B) Because the step-back question replaces the original question entirely C) Because the second model is a different, specialized LLM that hasn't seen either question D) Because passing both questions doubles the context window, improving attention

Answer: A

A) — Correct. LLMs are stateless between API calls — each call is independent with no memory of prior interactions. The grounding prompt must include both the abstract answer (from the step-back question) and the original specific question so the model can connect the foundational knowledge to the actual task. Without the original question, the model has no idea what specific problem it's supposed to solve.
B) — Incorrect. The step-back question doesn't replace the original — it supplements it. The whole point is to use the abstract answer as a foundation for answering the specific original question. If you only passed the step-back question, you'd get a general knowledge answer, not a targeted one.
C) — Incorrect. There's no mention of using different specialized models. The same LLM typically handles both prompts; the two-call architecture is about information flow, not model specialization.
D) — Incorrect. Doubling context doesn't inherently improve attention. In fact, excessive context can degrade attention (the "lost in the middle" effect). The reason is functional necessity — the model needs both pieces of information to produce the final answer.

Q18. (MCQ) A student uses the following prompt:

The odd numbers in this group add up to an even number: 4, 8, 9, 15, 12, 2, 1.
A: Adding all the odd numbers (9, 15, 1) gives 25. The answer is False.

The odd numbers in this group add up to an even number: 15, 32, 5, 13, 82, 7, 1.
A:

This prompt is best classified as:

A) Zero-shot CoT prompting B) Few-shot prompting without chain-of-thought C) Few-shot chain-of-thought prompting D) Step-Back Prompting

Answer: C

A) — Incorrect. Zero-shot CoT uses a trigger phrase like "Let's think step by step" with no demonstrations. This prompt clearly contains a worked demonstration.
B) — Incorrect. Standard few-shot prompting would only show the input and the final answer (e.g., "False") without the intermediate reasoning. This demonstration explicitly shows the reasoning process: identifying odd numbers, summing them, and concluding — that's a chain-of-thought.
C) — Correct. This combines few-shot prompting (a worked demonstration is provided) with chain-of-thought (the demonstration includes intermediate reasoning steps: extracting odd numbers → computing the sum → evaluating the claim). This is exactly the few-shot CoT technique introduced in the original CoT research.
D) — Incorrect. Step-Back Prompting would first ask for the underlying mathematical principle (e.g., "What determines whether a sum is odd or even?"). This prompt jumps directly into a worked example, not a principle extraction.