Prompting 3
Interactive Quiz - Only Have MCQs
Q1. (MCQ) In Self-Consistency prompting, a model is asked: "When I was 6 my sister was half my age. Now I'm 70 how old is my sister?" Three reasoning paths produce answers of 67, 67, and 35. The final answer selected is 67. What mechanism determines this selection?
A) The model selects the answer with the highest internal confidence score B) The answer appearing most frequently across multiple sampled reasoning paths is chosen C) A separate evaluator model ranks the reasoning chains by logical validity D) The first answer generated is always preferred due to greedy decoding
Answer: B
- A) — Incorrect. Self-Consistency doesn't rely on an internal confidence score attached to individual outputs. It operates at the level of answer frequency across multiple samples, not per-output confidence metrics.
- B) — Correct. Self-Consistency samples multiple diverse reasoning paths and selects the most consistent (majority) answer. Two of three paths produced 67, forming a majority, so 67 becomes the final answer. This is essentially a voting mechanism across independently sampled reasoning chains.
- C) — Incorrect. No separate evaluator model is involved. The selection mechanism is based on agreement/consistency across the sampled outputs themselves, not external evaluation.
- D) — Incorrect. Self-Consistency was specifically proposed to replace naive greedy decoding. Greedy decoding takes only the single most likely token sequence. Self-Consistency deliberately generates multiple diverse paths and aggregates, which is the opposite of greedy.
Q2. (MSQ — Select ALL that apply) Which of the following are explicitly stated benefits of prompt chaining?
A) Improved performance on complex tasks B) Increased transparency of the LLM application C) Elimination of hallucinations entirely D) Easier debugging of model responses at each stage E) Increased controllability and reliability
Answer: A, B, D, E
- A) — Correct. Prompt chaining is useful to accomplish complex tasks that an LLM might struggle with if prompted with a single very detailed prompt. Breaking them into subtasks improves performance.
- B) — Correct. Prompt chaining helps boost the transparency of your LLM application by making the intermediate steps visible.
- C) — Incorrect. Nowhere is it claimed that prompt chaining eliminates hallucinations entirely. It improves reliability, but hallucinations can still occur within individual subtask responses.
- D) — Correct. You can debug problems with model responses much more easily by analyzing and improving performance at each different stage.
- E) — Correct. Controllability and reliability are explicitly listed alongside transparency as key benefits.
Q3. (MCQ) In Directional Stimulus Prompting, a small tuneable policy LM generates hints to guide a larger LLM. What is the architectural relationship between these two models?
A) Both models are jointly fine-tuned on the same training data B) The policy LM is frozen and the large LLM's weights are updated based on the hints C) The policy LM is trained and optimized while the large LLM remains a black-box frozen model D) The large LLM generates hints that are fed back to the policy LM for refinement
Answer: C
- A) — Incorrect. The two models are not jointly trained. The architecture specifically separates a trainable small model from an untouched large model.
- B) — Incorrect. This reverses the relationship entirely. The policy LM is the one that gets trained; the large LLM is the one that stays frozen.
- C) — Correct. The tuneable policy LM is trained (using RL) to generate optimal stimulus/hints, while the large LLM remains a black-box frozen model that receives these hints as guidance. The key insight is that you optimize a small, cheap model to steer a large, expensive one without touching its weights.
- D) — Incorrect. The information flows in one direction: policy LM generates hints → large LLM uses them. There's no feedback loop from the large LLM back to the policy LM during inference.
Q4. (MCQ) Active-Prompt was designed to solve a specific shortcoming of standard CoT prompting. What is that shortcoming?
A) CoT prompting generates too many intermediate reasoning steps, increasing latency B) CoT prompting relies on a fixed set of human-annotated exemplars that may not be optimal for different tasks C) CoT prompting cannot handle arithmetic reasoning tasks D) CoT prompting requires the model to be fine-tuned before it can produce reasoning chains
Answer: B
- A) — Incorrect. The number of intermediate reasoning steps is not the problem Active-Prompt addresses. CoT's step count is a feature, not a bug — it enables reasoning.
- B) — Correct. Standard CoT methods rely on a fixed set of human-annotated exemplars, and the problem is that these exemplars might not be the most effective examples for different tasks. Active-Prompt solves this by dynamically selecting which questions need human annotation based on model uncertainty.
- C) — Incorrect. CoT prompting does handle arithmetic reasoning — it was specifically designed for such tasks. Self-Consistency further boosts CoT's arithmetic performance. Active-Prompt's concern is exemplar selection, not task coverage.
- D) — Incorrect. CoT prompting works at inference time without any fine-tuning. Active-Prompt doesn't address a fine-tuning requirement because none exists.
Q5. (MCQ) APE (Automatic Prompt Engineer) discovered a zero-shot CoT prompt that outperformed the human-engineered "Let's think step by step." What was this automatically discovered prompt?
A) "Take a deep breath and work on this problem step by step." B) "Imagine three experts are answering this question collaboratively." C) "Let's work this out in a step by step way to be sure we have the right answer." D) "First, identify the core principle, then solve step by step."
Answer: C
- A) — Incorrect. "Take a deep breath" comes from OPRO, a different paper on using LLMs to optimize prompts. It is mentioned as related work in the APE article but is not APE's discovery.
- B) — Incorrect. This describes Hulbert's Tree-of-Thought Prompting technique, where multiple imaginary experts collaborate. It's from an entirely different framework.
- C) — Correct. APE discovered that "Let's work this out in a step by step way to be sure we have the right answer" elicits chain-of-thought reasoning and improves performance on MultiArith and GSM8K benchmarks, outperforming the original human-crafted phrase.
- D) — Incorrect. This resembles a Step-Back Prompting structure (identify principle first, then solve). APE's discovered prompt is purely about step-by-step reasoning with a confidence emphasis, not principle identification.
Q6. (MSQ — Select ALL that apply) In the Tree of Thoughts (ToT) framework applied to the Game of 24, which of the following correctly describe the implementation details?
A) The problem is decomposed into 3 steps, each involving an intermediate equation B) At each step, the best b=5 candidates are retained C) Each thought candidate is evaluated as "sure/maybe/impossible" D) Values are sampled once per thought to minimize computational cost E) Breadth-first search is used to explore the tree
Answer: A, B, C, E
- A) — Correct. The Game of 24 task requires decomposing thoughts into 3 steps, each involving an intermediate equation.
- B) — Correct. At each step, the best b=5 candidates are kept for further exploration.
- C) — Correct. The LM evaluates each thought candidate using a three-way classification: "sure" (correct partial solution), "maybe" (keep for further exploration), or "impossible" (eliminate based on commonsense like "too big/small").
- D) — Incorrect. Values are sampled 3 times for each thought, not once. Multiple samples improve the reliability of the evaluation, which is consistent with the framework's emphasis on deliberate reasoning.
- E) — Correct. BFS is explicitly described as the search strategy used for the Game of 24 task in ToT.
Q7. (MCQ) Generated Knowledge Prompting is applied to the question: "Part of golf is trying to get a higher point total than others. Yes or No?" Without the technique, the model answers "Yes." After generating knowledge and integrating it, the model answers "No." What cognitive limitation of LLMs does this technique specifically address?
A) The model's inability to follow multi-step instructions B) The model's lack of real-time information about current events C) The model's failure to surface relevant world knowledge it already possesses when directly questioned D) The model's tendency to copy the format of the question in its answer
Answer: C
- A) — Incorrect. The golf question is a single-step Yes/No question — there are no multi-step instructions to follow. The failure is factual, not procedural.
- B) — Incorrect. Golf's scoring rules aren't real-time information — they're stable, timeless facts. The model has this knowledge in its parameters; it simply fails to activate it appropriately when asked directly.
- C) — Correct. The model already knows golf's scoring rules (as demonstrated when it successfully generates correct knowledge about golf in the knowledge generation step). The problem is that a direct question doesn't trigger the model to surface this relevant knowledge. Generated Knowledge Prompting explicitly forces the model to retrieve and articulate its knowledge before answering, bridging the gap between what the model knows and what it applies.
- D) — Incorrect. The model doesn't answer "Yes" because of format mimicry. It answers "Yes" because the phrase "higher point total" sounds intuitively positive across most sports contexts, and the model fails to apply golf-specific knowledge that contradicts this default association.
Q8. (MCQ) In ART (Automatic Reasoning and Tool-use), what happens during test time when the model encounters a point where an external tool needs to be called?
A) The model generates a simulated tool output from its training data B) The model skips the tool call and continues generating based on its internal knowledge C) Generation pauses, the external tool is executed, and its output is integrated before generation resumes D) A separate orchestrator model decides whether to call the tool or continue generating
Answer: C
- A) — Incorrect. ART does not simulate tool outputs — it actually executes external tools. Simulating would defeat the purpose of tool integration, which is to inject real, accurate external data.
- B) — Incorrect. Skipping tool calls would reduce ART to standard CoT prompting. The interleaving of tool use with reasoning is ART's core differentiator.
- C) — Correct. ART pauses generation whenever external tools are called, integrates their output, and then resumes generation. This interleaved pause-execute-resume pattern is the fundamental mechanism that combines reasoning with tool use.
- D) — Incorrect. There's no separate orchestrator model. The frozen LLM itself generates the reasoning steps including tool call points. The framework handles the pause/resume mechanics, but the decision of when to call tools emerges from the model's reasoning.
Q9. (MCQ) The two primary ToT papers (Yao et al. and Long) both use tree search but differ in one critical design choice. What is that difference?
A) Yao et al. uses BFS/DFS/beam search while Long uses a ToT Controller trained through reinforcement learning B) Yao et al. uses a single prompt while Long uses multi-round conversations C) Yao et al. works only on math tasks while Long works only on language tasks D) Yao et al. requires human evaluation while Long is fully automated
Answer: A
- A) — Correct. Yao et al. leverages generic search algorithms (DFS/BFS/beam search) that have no task-specific adaptation. Long proposes a "ToT Controller" trained through reinforcement learning, which can learn from new data or self-play (analogous to AlphaGo vs. brute force search) and continue to evolve — a fundamental architectural distinction.
- B) — Incorrect. Both approaches enhance LLM capability through tree search via multi-round conversation. The multi-round aspect is shared, not a differentiator.
- C) — Incorrect. Neither paper is restricted to a single task domain. Both are general problem-solving frameworks demonstrated across multiple task types.
- D) — Incorrect. Neither paper requires human evaluation as a core mechanism. The LM self-evaluates intermediate thoughts in both approaches.
Q10. (MSQ — Select ALL that apply) Which of the following are true about ART's extensibility and generalization capabilities?
A) Humans can fix mistakes in reasoning steps by updating the task library B) New tools can be added by simply updating the tool library C) ART encourages zero-shot generalization to new tasks from demonstrations D) ART requires re-training the LLM whenever a new tool is added E) ART substantially improves over few-shot prompting on unseen tasks in BigBench and MMLU
Answer: A, B, C, E
- A) — Correct. ART is extensible — humans can fix mistakes in the reasoning steps by simply updating the task library, without modifying the model.
- B) — Correct. Adding new tools requires only updating the tool library. No retraining or architectural changes needed.
- C) — Correct. ART encourages the model to generalize from demonstrations to decompose new tasks and use tools in appropriate places, in a zero-shot fashion.
- D) — Incorrect. ART uses a frozen LLM. Adding new tools requires updating the tool library, not retraining the model. This is a core design advantage.
- E) — Correct. ART substantially improves over few-shot prompting and automatic CoT on unseen tasks in the BigBench and MMLU benchmarks.
Q11. (MCQ) In the prompt chaining example for Document QA, why is the task split into two prompts (extract quotes first, then answer from quotes) rather than using a single prompt that extracts and answers simultaneously?
A) A single prompt would exceed the model's context window B) Two prompts are always faster than one due to parallel processing C) Splitting into subtasks increases transparency, controllability, and makes debugging easier at each stage D) The first prompt uses a different model than the second prompt
Answer: C
- A) — Incorrect. Both prompts still include the full document (
{{document}}), so the context window usage is comparable. The split isn't driven by context window limitations. - B) — Incorrect. The two prompts run sequentially (the second depends on the first's output), not in parallel. Prompt chaining is inherently serial.
- C) — Correct. The core benefits of prompt chaining are transparency, controllability, and reliability. By separating extraction from answer generation, you can inspect the intermediate output (extracted quotes), verify it's correct, debug each stage independently, and ensure the final answer is grounded in specific passages rather than the model's general knowledge.
- D) — Incorrect. While different models could be used, the example uses the same model (gpt-4) for both prompts. The technique's value comes from task decomposition, not model switching.
Q12. (MCQ) In Active-Prompt, after the LLM generates k possible answers for each training question, how are the questions that need human annotation selected?
A) The questions with the highest average confidence scores are selected B) The questions with the most disagreement among the k generated answers are selected C) Questions are selected randomly to ensure unbiased annotation D) The questions where all k answers agree but are incorrect are selected
Answer: B
- A) — Incorrect. High confidence means the model is already performing well on those questions — they need annotation the least. Active-Prompt targets uncertainty, not confidence.
- B) — Correct. An uncertainty metric based on disagreement among the k generated answers is calculated. The most uncertain questions (those with the most disagreement) are selected for human annotation. This is an active learning strategy — annotate where the model struggles most.
- C) — Incorrect. Random selection would waste human annotation effort on questions the model can already handle. Active-Prompt's entire innovation is targeted selection based on uncertainty.
- D) — Incorrect. If all k answers agree, that represents low uncertainty regardless of correctness. The metric measures disagreement among outputs, not agreement-with-correctness. You can't know they're incorrect without human annotation, which is precisely what the uncertainty metric aims to prioritize.
Q13. (MCQ) Self-Consistency is described as replacing "naive greedy decoding" in CoT prompting. What does greedy decoding produce that Self-Consistency improves upon?
A) Greedy decoding generates multiple reasoning paths and picks the longest one B) Greedy decoding generates a single reasoning path by always selecting the most probable next token, which may not lead to the best final answer C) Greedy decoding randomly samples tokens, producing inconsistent outputs D) Greedy decoding generates answers without any reasoning steps
Answer: B
- A) — Incorrect. Greedy decoding produces a single path, not multiple paths. Generating multiple paths is exactly what Self-Consistency introduces as an improvement.
- B) — Correct. Greedy decoding always picks the highest-probability next token at each step, producing one deterministic reasoning chain. This single chain may happen to follow an incorrect reasoning path. Self-Consistency improves on this by sampling multiple diverse reasoning paths (using temperature or nucleus sampling) and aggregating their answers through majority voting.
- C) — Incorrect. Greedy decoding is the opposite of random — it's deterministic, always selecting the most probable token. Random sampling is what Self-Consistency uses instead to generate diverse paths.
- D) — Incorrect. Greedy decoding can absolutely produce reasoning steps when combined with CoT prompting. The issue isn't the absence of reasoning, but the reliance on a single reasoning path.
Q14. (MCQ) In Generated Knowledge Prompting, two different knowledge statements are generated about golf. Knowledge 1 leads to a confident correct answer ("No"), while Knowledge 2 leads to an incorrect answer ("Yes") with lower confidence. What does this demonstrate?
A) Generated Knowledge Prompting is unreliable and should not be used B) The quality and framing of generated knowledge directly impacts the model's final prediction, requiring careful selection or aggregation C) Only the first generated knowledge should ever be used D) The model cannot understand golf under any circumstances
Answer: B
- A) — Incorrect. The technique still produces a correct answer when appropriate knowledge is generated. One failure path doesn't invalidate the approach — it highlights the need for knowledge selection or aggregation strategies.
- B) — Correct. The example demonstrates that differently framed knowledge statements can lead to different conclusions with different confidence levels. This implies that arriving at the final answer requires additional steps (selection, voting, or confidence-based filtering) to handle variation in generated knowledge quality. The paper itself notes there are "more details to consider when arriving at the final answer."
- C) — Incorrect. There's no principled reason to prefer the first knowledge over the second. The ordering is arbitrary. What matters is the content quality and the selection mechanism, not generation order.
- D) — Incorrect. The model demonstrably understands golf's scoring rules — it generated correct knowledge about lowest-score-wins in both knowledge statements. The issue is how that knowledge interacts with the question during the integration step.
Q15. (MSQ — Select ALL that apply) Hulbert's Tree-of-Thought Prompting simplification uses the following prompt: "Imagine three different experts are answering this question. All experts will write down 1 step of their thinking, then share it with the group. Then all experts will go on to the next step, etc. If any expert realises they're wrong at any point then they leave." Which core concepts from the full ToT framework does this single-prompt technique preserve?
A) Multiple parallel reasoning paths explored simultaneously B) Self-evaluation of intermediate thoughts with elimination of flawed paths C) Use of formal search algorithms like BFS and DFS D) Deliberate step-by-step progression through intermediate reasoning E) A trained reinforcement learning controller
Answer: A, B, D
- A) — Correct. Three experts reasoning simultaneously represent multiple parallel thought paths being explored — analogous to maintaining multiple candidate branches in the tree.
- B) — Correct. "If any expert realises they're wrong at any point then they leave" directly mirrors the self-evaluation mechanism where the LM assesses intermediate thoughts and eliminates flawed branches ("impossible" verdicts in the full framework).
- C) — Incorrect. Formal search algorithms (BFS, DFS, beam search) require programmatic control over multiple API calls and candidate management. A single prompt cannot implement these algorithms — it only approximates the exploratory spirit.
- D) — Correct. "Write down 1 step... then share... then go on to the next step" preserves the deliberate, step-by-step progression through intermediate reasoning that characterizes ToT.
- E) — Incorrect. The RL-trained ToT Controller is from Long's paper and requires a trained model component. A simple prompt technique cannot replicate a learned controller.
Q16. (MCQ) APE frames the problem of finding optimal instructions as:
A) A supervised learning problem with labeled prompt-response pairs B) A natural language synthesis problem addressed as a black-box optimization using LLMs C) A reinforcement learning problem with human feedback as the reward signal D) A gradient-based search over continuous prompt embeddings
Answer: B
- A) — Incorrect. APE doesn't use labeled prompt-response pairs in a supervised learning framework. It uses LLMs to generate candidate instructions and evaluates them, which is an optimization approach, not supervised learning.
- B) — Correct. APE frames instruction generation as natural language synthesis addressed as a black-box optimization problem, using LLMs to generate and search over candidate solutions. The LLM acts as both the generator and (indirectly) the evaluator.
- C) — Incorrect. While RLHF is mentioned in other contexts, APE doesn't use human feedback as a reward signal. It uses computed evaluation scores on a target model to select the best instruction.
- D) — Incorrect. Gradient-based search over continuous embeddings describes AutoPrompt or Prompt Tuning, which are listed as separate related work. APE operates in discrete natural language space, not continuous embedding space.
Q17. (MCQ) A developer builds an LLM pipeline where the first prompt extracts key entities from a customer email, the second prompt classifies the customer's intent using those entities, and the third prompt drafts a response based on the classified intent. The second prompt fails to classify correctly. Where should the developer focus debugging?
A) Only at the third prompt, since that's where the final output is generated B) At the interface between the first and second prompts, inspecting whether the entity extraction output was suitable input for classification C) At the model's pre-training data, since classification failure indicates a fundamentally incapable model D) Nowhere — the developer should replace all three prompts with a single comprehensive prompt
Answer: B
- A) — Incorrect. The third prompt is downstream of the failure. If classification is wrong, the response draft will be wrong regardless of how good the third prompt is. Debugging the effect without fixing the cause is futile.
- B) — Correct. This exemplifies prompt chaining's core debugging advantage: you can inspect intermediate outputs at each stage. The developer should examine whether Prompt 1's entity extraction output was correct and complete, and whether Prompt 2's classification instructions properly handle that input format. The chain's transparency lets you pinpoint exactly where the failure occurs.
- C) — Incorrect. Classification failure in a chain is far more likely due to prompt design or intermediate output quality than fundamental model incapacity. The whole point of prompt chaining is to simplify each subtask to a level the model can handle.
- D) — Incorrect. Replacing with a single prompt is the opposite of what prompt chaining advocates. Complex tasks combined into one prompt are precisely what the model "might struggle to address if prompted with a very detailed prompt."
Q18. (MCQ) An RL-based ToT Controller (Long, 2023) has a key advantage over generic search strategies (BFS/DFS) used by Yao et al. What is this advantage?
A) The RL controller is computationally cheaper than BFS/DFS B) The RL controller can continue to learn and evolve from new data even with a fixed LLM C) The RL controller eliminates the need for the LLM to self-evaluate intermediate thoughts D) The RL controller works without requiring any tree structure
Answer: B
- A) — Incorrect. RL training is computationally expensive — often more so than running BFS/DFS. The advantage isn't cost; it's adaptability.
- B) — Correct. Generic search strategies (BFS/DFS/beam search) have no adaptation to specific problems. An RL-based ToT Controller can learn from new datasets or through self-play (analogous to AlphaGo vs. brute force search), meaning the system can continue to evolve and learn new knowledge even with a fixed LLM.
- C) — Incorrect. Self-evaluation of intermediate thoughts is a core property of the ToT framework regardless of the search strategy. The RL controller decides when to backtrack and by how many levels — it doesn't replace the evaluation mechanism.
- D) — Incorrect. The RL controller still operates within the tree structure. It manages the search strategy within the tree (when to backtrack, how far), not the tree's existence.
Q19. (MSQ — Select ALL that apply) Which of the following correctly describe how Generated Knowledge Prompting works as a two-step process?
A) Step 1: The model generates relevant knowledge or facts about the topic B) Step 1: The model directly answers the question, then verifies in Step 2 C) Step 2: The generated knowledge is integrated into the prompt alongside the original question to produce a final prediction D) Step 2: A human expert reviews and corrects the generated knowledge before it's used E) The technique is especially helpful for commonsense reasoning tasks
Answer: A, C, E
- A) — Correct. The first step generates knowledge — factual statements relevant to the question. In the golf example, the model generates detailed knowledge about how golf scoring works.
- B) — Incorrect. The model does not answer the question first. The whole point is that direct answering fails (the model answers "Yes" incorrectly). Knowledge generation must happen before prediction.
- C) — Correct. The second step integrates the generated knowledge with the original question in a new prompt, allowing the model to make a grounded prediction informed by the explicitly stated facts.
- D) — Incorrect. No human review step is described. The process is fully automated — the same or another LLM generates and uses the knowledge. Human involvement would defeat the technique's scalability.
- E) — Correct. The paper specifically investigates how helpful this technique is for tasks such as commonsense reasoning, which is exactly where direct prompting fails (e.g., the golf scoring misconception).
Q20. (MCQ) A researcher wants to improve CoT performance but doesn't know which examples will be most effective for a new task. They have access to a pool of training questions but limited human annotation budget. Which technique is specifically designed for this scenario?
A) Auto-CoT B) Self-Consistency C) Active-Prompt D) APE
Answer: C
- A) — Incorrect. Auto-CoT automates the generation of reasoning chains and selects diverse questions via clustering, but it doesn't incorporate a mechanism for targeted human annotation based on model uncertainty. It aims to eliminate manual effort, not optimize a limited annotation budget.
- B) — Incorrect. Self-Consistency improves CoT by sampling multiple reasoning paths at inference time. It doesn't address the selection of training exemplars at all — it assumes exemplars are already chosen.
- C) — Correct. Active-Prompt is specifically designed for this scenario: limited annotation budget + unknown optimal exemplars. It queries the model, measures uncertainty (disagreement across k answers), and selects the most uncertain questions for human annotation. This maximizes the value of each annotation by targeting where the model needs the most help.
- D) — Incorrect. APE optimizes the instruction text itself, not the selection of exemplars. It automatically discovers better prompt phrasing, but it doesn't address which few-shot examples should be annotated by humans.