Prompting 4
Interactive Quiz - Only Have MCQs
Q1. (MCQ) PAL (Program-Aided Language Models) and standard Chain-of-Thought prompting both generate intermediate reasoning steps. What is the fundamental difference in how PAL arrives at the final answer?
A) PAL uses the LLM to compute the final answer from its generated reasoning, just like CoT, but in a structured format B) PAL generates free-form text reasoning and then asks a second LLM to verify the answer C) PAL has the LLM generate a program as its intermediate steps, then offloads the actual computation to a programmatic runtime like a Python interpreter D) PAL bypasses intermediate steps entirely and directly produces executable code without reasoning
Answer: C
- A) — Incorrect. This describes what CoT does — the LLM both reasons and computes the final answer using free-form text. PAL specifically separates these responsibilities: the LLM reasons and generates code, but the runtime computes the answer.
- B) — Incorrect. PAL doesn't use a second LLM for verification. It uses a programmatic runtime (e.g., Python interpreter), not another language model. The distinction is between natural language computation and deterministic code execution.
- C) — Correct. PAL differs from CoT in that instead of using free-form text to obtain a solution, it offloads the solution step to a programmatic runtime such as a Python interpreter. The LLM's job is to translate the problem into code; the interpreter's job is to execute that code and produce the answer.
- D) — Incorrect. PAL doesn't bypass intermediate steps — the generated code is the intermediate reasoning. Each line of code corresponds to a reasoning step (e.g., "I was born 25 years before" becomes
born = today - relativedelta(years=25)). The reasoning is embedded in the code structure.
Q2. (MCQ) In the ReAct framework, a model is answering a question about the Colorado orogeny. After searching "High Plains" and getting an ambiguous result, the model generates: "I need to instead search High Plains (United States)." This sentence is classified as which component of the ReAct trajectory?
A) Action B) Observation C) Thought D) Reflection
Answer: C
- A) — Incorrect. An Action in ReAct is a concrete operation performed on the environment, formatted as
Search[...],Lookup[...], orFinish[...]. The sentence describes the model's reasoning about what to do next, not the action itself. - B) — Incorrect. An Observation is the feedback returned from the environment after an action is executed (e.g., search results). This sentence is generated by the model, not returned by an external source.
- C) — Correct. This is a Thought — a free-form reasoning trace where the model evaluates the current situation and adjusts its plan. The model recognized that the search result was ambiguous, diagnosed the problem, and formulated a corrective strategy. Thoughts in ReAct handle exactly this: inducing, tracking, and updating action plans, including handling exceptions.
- D) — Incorrect. Reflection is a concept from the Reflexion framework, not ReAct. ReAct uses Thought-Action-Observation trajectories. Reflexion extends ReAct by adding self-reflection and memory components.
Q3. (MSQ — Select ALL that apply) Which of the following are the three distinct model components in the Reflexion framework?
A) An Actor that generates text and actions based on state observations B) A Planner that decomposes tasks into sub-goals before execution C) An Evaluator that scores the outputs produced by the Actor D) A Self-Reflection model that generates verbal reinforcement cues for self-improvement
Answer: A, C, D
- A) — Correct. The Actor generates text and actions based on state observations, takes actions in the environment, and receives observations resulting in a trajectory. CoT and ReAct are used as Actor models.
- B) — Incorrect. There is no "Planner" component in Reflexion. Task decomposition might occur within the Actor (via CoT or ReAct), but it's not a separate architectural component. Reflexion's three models are Actor, Evaluator, and Self-Reflection.
- C) — Correct. The Evaluator scores outputs produced by the Actor. It takes a generated trajectory (short-term memory) as input and outputs a reward score, using different reward functions depending on the task.
- D) — Correct. The Self-Reflection model generates verbal reinforcement cues to assist the Actor in self-improvement. It uses the reward signal, current trajectory, and persistent memory to generate specific, relevant feedback stored in long-term memory.
Q4. (MCQ) Multimodal CoT uses a two-stage framework. A 1-billion parameter Multimodal CoT model outperforms GPT-3.5 on the ScienceQA benchmark. What is the most significant implication of this result?
A) Multimodal models always outperform text-only models regardless of parameter count B) GPT-3.5 is incapable of reasoning on science questions C) Incorporating vision alongside text into a structured rationale-then-inference framework can allow a dramatically smaller model to surpass a much larger text-only model D) The ScienceQA benchmark is too easy to meaningfully distinguish model capabilities
Answer: C
- A) — Incorrect. "Always" makes this claim overly broad. The result is specific to ScienceQA and the Multimodal CoT architecture. There's no evidence that any arbitrary multimodal model outperforms any text-only model on every task.
- B) — Incorrect. GPT-3.5 can reason on science questions — it simply performs worse than the multimodal approach on this specific benchmark. Being outperformed doesn't mean incapability.
- C) — Correct. A 1B parameter model outperforming GPT-3.5 (which has orders of magnitude more parameters) demonstrates that the architecture (combining text and vision modalities in a rationale-generation-then-answer-inference pipeline) can compensate for a massive size disadvantage. The multimodal information provides grounding that pure text-based reasoning lacks.
- D) — Incorrect. ScienceQA is used precisely because it requires multimodal reasoning involving scientific diagrams and text. Dismissing the benchmark without evidence undermines the demonstrated result.
Q5. (MCQ) CoT prompting suffers from fact hallucination, while ReAct's performance can be derailed by non-informative search results. A researcher wants a system that mitigates both failure modes. Which combined approach was found to generally outperform all other prompting methods?
A) ReAct combined with Tree of Thoughts B) CoT combined with Generated Knowledge Prompting C) ReAct combined with CoT and Self-Consistency D) ReAct combined with Reflexion and Active-Prompt
Answer: C
- A) — Incorrect. While both are powerful frameworks, this specific combination is not discussed as the best-performing approach in the ReAct analysis. Tree of Thoughts solves a different problem (strategic lookahead).
- B) — Incorrect. This combination might address hallucination through knowledge generation, but it doesn't incorporate external tool access, which is ReAct's core contribution for factual grounding.
- C) — Correct. Prompting methods that combine and support switching between ReAct and CoT+Self-Consistency generally outperform all other prompting methods. This combination leverages CoT's internal reasoning flexibility, ReAct's external information retrieval, and Self-Consistency's majority-vote robustness.
- D) — Incorrect. While Reflexion does extend ReAct, this specific three-way combination is not the one identified as the top performer in the ReAct analysis. The paper specifically highlights ReAct + CoT + Self-Consistency.
Q6. (MCQ) In the PAL date-understanding example, the LLM receives a question and generates a Python code snippet. The developer then calls exec(llm_out) to run it. Why is this approach more reliable than having the LLM compute the date arithmetic directly through CoT?
A) Python code executes faster than the LLM can generate text B) The Python interpreter performs deterministic computation that is immune to the arithmetic and logical errors LLMs make in free-form text reasoning C) The LLM generates more creative solutions when writing code than when writing text D) Python's exec() function has built-in error correction for LLM-generated code
Answer: B
- A) — Incorrect. Speed isn't the advantage. The issue PAL addresses is accuracy of computation, not latency. Whether the interpreter is faster or slower than text generation is irrelevant to the reliability improvement.
- B) — Correct. LLMs frequently make arithmetic, date, and logical errors when computing in free-form text because they're pattern-matching over tokens, not executing mathematical operations. By offloading the computation to a Python interpreter, PAL ensures that once the problem is correctly formulated as code, the execution is deterministic and mathematically exact. The LLM handles the hard part (understanding the problem), and the interpreter handles the easy-for-computers part (calculating the answer).
- C) — Incorrect. Creativity is not the goal — reliability and correctness are. PAL isn't about creative solutions; it's about converting natural language reasoning into verifiable, executable computation.
- D) — Incorrect. Python's
exec()has no error correction for LLM-generated code whatsoever. If the LLM generates syntactically or logically incorrect code,exec()will either throw an error or produce a wrong result. The reliability comes from the determinism of correct code, not from any self-correction mechanism.
Q7. (MSQ — Select ALL that apply) Reflexion is best suited for scenarios where:
A) Traditional reinforcement learning methods are impractical due to data and compute costs B) The task requires a single-shot response with no opportunity for iteration C) Nuanced verbal feedback is more useful than scalar reward signals D) Interpretability and explicit episodic memory are important for analyzing the agent's learning process
Answer: A, C, D
- A) — Correct. Traditional RL methods require extensive training data and expensive model fine-tuning. Reflexion offers a lightweight alternative that doesn't require fine-tuning the underlying language model, making it more efficient in data and compute.
- B) — Incorrect. This is the exact opposite of when Reflexion is useful. Reflexion is designed for iterative trial-and-error learning across multiple episodes. A single-shot scenario with no iteration provides no opportunity for the self-reflection loop to operate.
- C) — Correct. Reflexion utilizes verbal feedback, which can be more nuanced and specific than scalar rewards used in traditional RL. This allows the agent to better understand its mistakes and make more targeted improvements.
- D) — Correct. Reflexion provides a more interpretable and explicit form of episodic memory compared to traditional RL methods. The agent's self-reflections are stored in memory, allowing easier analysis and understanding of its learning process.
Q8. (MCQ) In a ReAct trajectory, the model searches for information about a topic and receives a search result. This search result is labeled as an "Observation." What distinguishes an Observation from a Thought in ReAct's architecture?
A) Observations are generated by the LLM while Thoughts come from the environment B) Observations come from the external environment while Thoughts are internally generated by the LLM C) Observations are always longer than Thoughts D) Observations contain factual information while Thoughts are always speculative
Answer: B
- A) — Incorrect. This reverses the relationship entirely. Thoughts are the LLM's internal reasoning traces; Observations are external feedback.
- B) — Correct. Observations correspond to information returned from the environment being interacted with (e.g., search engine results, Wikipedia content, game state feedback). Thoughts are free-form reasoning traces generated internally by the LLM to plan, adjust, diagnose, and synthesize. The distinction is the source: external environment vs. internal model generation.
- C) — Incorrect. Length has nothing to do with the distinction. A Thought can be longer than an Observation or vice versa. The defining characteristic is the source of the content.
- D) — Incorrect. Thoughts are not always speculative — they often contain definitive reasoning (e.g., "The answer is 1,800 to 7,000 ft"). And Observations can contain ambiguous or non-informative results. The distinction is about origin (environment vs. model), not factuality.
Q9. (MCQ) Reflexion extends the ReAct framework. What are the specific new components Reflexion adds on top of ReAct?
A) External tool access and search capabilities B) Self-evaluation, self-reflection, and memory components C) Breadth-first and depth-first search over reasoning trees D) A tuneable policy LM that generates directional hints
Answer: B
- A) — Incorrect. External tool access and search capabilities are already part of ReAct (the "Act" component interacts with environments and knowledge bases). Reflexion doesn't add these — it inherits them.
- B) — Correct. Reflexion extends the ReAct framework specifically by introducing self-evaluation (the Evaluator that scores trajectories), self-reflection (the Self-Reflection model that generates verbal feedback), and memory components (short-term trajectory memory and long-term persistent memory for storing reflections).
- C) — Incorrect. Tree search algorithms belong to the Tree of Thoughts (ToT) framework, not Reflexion. Reflexion operates through sequential episodes of trial, evaluation, and reflection — not branching tree search.
- D) — Incorrect. A tuneable policy LM generating directional hints describes Directional Stimulus Prompting. Reflexion's Self-Reflection model generates verbal self-critiques from its own experience, not external directional stimuli.
Q10. (MCQ) In the Multimodal CoT two-stage framework, the first stage generates a rationale and the second stage performs answer inference. Why is the rationale generated before the answer rather than simultaneously?
A) Generating both simultaneously would exceed the model's context window B) The rationale stage incorporates multimodal information to produce grounded reasoning, which then serves as higher-quality input for the answer inference stage C) The rationale is generated by a text-only model, while the answer is generated by a vision-only model D) Simultaneous generation would require twice the GPU memory
Answer: B
- A) — Incorrect. The two-stage design is not motivated by context window limitations. Both stages could theoretically fit in a single pass. The separation is an architectural choice for reasoning quality.
- B) — Correct. The two-stage design mirrors the rationale-then-answer pattern seen across many prompting techniques. The first stage generates a rationale based on multimodal information (both text and vision), producing an intermediate representation that captures insights from both modalities. The second stage then leverages these informative generated rationales to make a better-grounded answer inference. Separating the stages forces explicit reasoning before conclusion.
- C) — Incorrect. Both stages operate within the same multimodal framework. The whole point is integrating text and vision, not splitting them across separate models with different modalities.
- D) — Incorrect. GPU memory is a hardware concern, not the architectural motivation. The two-stage design is about improving reasoning quality through explicit rationale generation, not about memory management.
Q11. (MCQ) A ReAct agent is answering a question on HotPotQA but repeatedly retrieves non-informative search results. The model's reasoning becomes confused and it cannot recover. Meanwhile, a CoT-only agent answers the same question but hallucinates a fact that doesn't exist. These two failures illustrate:
A) That both frameworks are equally unreliable and should be abandoned B) Complementary weaknesses — CoT hallucinates without external grounding while ReAct's reasoning breaks down when retrieval fails — which is why combining them outperforms either alone C) That ReAct is strictly superior to CoT because it at least attempts to find real information D) That external tool access always improves model performance
Answer: B
- A) — Incorrect. Both frameworks have demonstrated strong performance on many tasks. Having failure modes doesn't make them unreliable — it means they have complementary strengths and weaknesses that can be addressed through combination.
- B) — Correct. CoT suffers from fact hallucination because it relies entirely on internal knowledge with no external verification. ReAct's structural constraint reduces its flexibility in formulating reasoning steps, and non-informative search results derail reasoning with difficulty recovering. These are complementary weaknesses — which is precisely why combining ReAct with CoT+Self-Consistency generally outperforms all other prompting methods, as each compensates for the other's failure mode.
- C) — Incorrect. ReAct is not strictly superior — it actually lags behind CoT on HotPotQA. Attempting to find real information is useless if the retrieved information is non-informative and the model can't recover. Both have distinct advantages.
- D) — Incorrect. The ReAct failure example demonstrates exactly the opposite — external tool access can hurt performance when retrieval quality is poor. The model becomes dependent on what it retrieves, and bad retrieval leads to bad reasoning.
Q12. (MSQ — Select ALL that apply) Which of the following are explicitly stated limitations of the Reflexion framework?
A) It relies on the agent's ability to accurately self-evaluate, which is challenging for complex tasks B) It requires fine-tuning the underlying language model after each reflection episode C) Its sliding window memory has maximum capacity constraints that may be insufficient for complex tasks D) Code generation tasks face limitations in specifying accurate input-output mappings for non-deterministic functions
Answer: A, C, D
- A) — Correct. Reflexion relies on the agent's ability to accurately evaluate its performance and generate useful self-reflections, which can be challenging for complex tasks. However, it's expected to improve as models advance.
- B) — Incorrect. This is the exact opposite of Reflexion's design philosophy. Reflexion is explicitly positioned as a lightweight alternative to traditional RL that doesn't require fine-tuning the underlying language model. The "verbal reinforcement" approach avoids weight updates entirely.
- C) — Correct. Reflexion uses a sliding window with maximum capacity for long-term memory, and for more complex tasks it may be advantageous to use advanced structures such as vector embeddings or SQL databases.
- D) — Correct. Code generation limitations include difficulties with test-driven development in specifying accurate input-output mappings, such as non-deterministic generator functions and function outputs influenced by hardware.
Q13. (MCQ) A developer needs an LLM to solve the following: "A concert was rescheduled from June 1 to 15 days later. If tickets expire 30 days after the original date, on what date do the tickets expire?" Which technique would most reliably produce the correct answer?
A) Zero-shot CoT with "Let's think step by step" B) ReAct with a search engine tool C) PAL with a Python interpreter D) Self-Consistency with multiple CoT reasoning paths
Answer: C
- A) — Incorrect. While zero-shot CoT would attempt step-by-step reasoning, LLMs frequently miscalculate dates (crossing month boundaries, handling variable month lengths) in free-form text. The date arithmetic here involves adding days across month boundaries — exactly the type of computation LLMs struggle with.
- B) — Incorrect. This is a computational problem, not an information retrieval problem. There's nothing to search for — the answer requires date arithmetic on given values. ReAct's strength is accessing external knowledge, which isn't needed here.
- C) — Correct. PAL is specifically designed for exactly this type of problem. The LLM would translate the word problem into Python code using
datetimeandrelativedelta, and the interpreter would handle the date arithmetic deterministically. The PAL article demonstrates this exact pattern with date-understanding problems, ensuring correct computation across month boundaries and edge cases. - D) — Incorrect. Self-Consistency samples multiple CoT paths and takes a majority vote. But if the underlying reasoning mechanism (free-form text arithmetic) is unreliable for date computation, sampling more paths may produce multiple incorrect answers that still form a majority. Voting over flawed computations doesn't guarantee correctness.
Q14. (MCQ) In Reflexion, the Evaluator component takes a generated trajectory as input and outputs a reward score. The trajectory is also referred to as:
A) Long-term memory B) Short-term memory C) Persistent memory D) Episodic reflection
Answer: B
- A) — Incorrect. Long-term memory in Reflexion stores self-reflection outputs (verbal feedback from past episodes), not the current trajectory. The trajectory is transient — it represents the current episode's actions.
- B) — Correct. The generated trajectory is denoted as short-term memory. The Evaluator takes this short-term memory as input and produces a reward score. This is distinct from the persistent/long-term memory where self-reflections are stored.
- C) — Incorrect. Persistent memory stores the accumulated self-reflections across episodes. It's the Self-Reflection model that writes to persistent memory, not the trajectory itself.
- D) — Incorrect. "Episodic reflection" is not the term used for the trajectory. The trajectory is the sequence of actions and observations from a single episode; reflection is a separate process that evaluates the trajectory.
Q15. (MCQ) In the ReAct LangChain example, the agent is asked: "Who is Olivia Wilde's boyfriend? What is his current age raised to the 0.23 power?" The agent uses a Search tool and a Calculator tool. Why can't standard CoT prompting solve this problem reliably?
A) CoT cannot handle questions with more than one sub-question B) CoT lacks access to current real-world information and is prone to arithmetic errors in free-form text — both of which this question demands C) CoT cannot understand celebrity-related questions D) CoT always produces a single-word answer and cannot show computation steps
Answer: B
- A) — Incorrect. CoT can decompose multi-part questions into sequential steps. The issue isn't structural complexity — it's the need for external data and precise computation.
- B) — Correct. This question has two requirements that CoT handles poorly: (1) it needs current real-world information (who is Olivia Wilde's boyfriend now), which CoT cannot access since it relies only on internal knowledge that may be outdated or hallucinated; and (2) it needs precise mathematical computation (29^0.23), which LLMs frequently get wrong in free-form text. ReAct solves both by using Search for information retrieval and Calculator for exact arithmetic.
- C) — Incorrect. CoT can handle celebrity questions — it might answer correctly if the information is in its training data. The problem is that CoT has no mechanism to verify or update its knowledge, leading to potential hallucination on current facts.
- D) — Incorrect. CoT is explicitly designed to show multi-step reasoning, not produce single-word answers. The limitation is about external access and computational accuracy, not output format.
Q16. (MCQ) Reflexion parameterises a policy as "an agent's memory encoding paired with a choice of LLM parameters." What makes this fundamentally different from how traditional reinforcement learning parameterises a policy?
A) Reflexion uses a larger neural network than traditional RL B) Traditional RL updates model weights through gradient-based training, while Reflexion encodes learning as natural language stored in memory without modifying model weights C) Reflexion uses scalar rewards while traditional RL uses verbal feedback D) Traditional RL cannot be applied to language tasks
Answer: B
- A) — Incorrect. Network size isn't the distinguishing factor. Reflexion can use the same or smaller LLMs. The difference is how learning is encoded, not the scale of the model.
- B) — Correct. Traditional RL parameterises policies through model weights updated via gradient descent on reward signals. Reflexion's paradigm is "verbal reinforcement" — learning is encoded as natural language self-reflections stored in memory, paired with a fixed LLM. The model weights never change; instead, the agent's context (memory) evolves across episodes. This is what makes Reflexion a lightweight alternative that doesn't require fine-tuning.
- C) — Incorrect. This is reversed. Traditional RL typically uses scalar rewards. Reflexion converts feedback (including scalar) into verbal/linguistic feedback. Reflexion's advantage is the richness of verbal feedback over scalar signals.
- D) — Incorrect. Traditional RL has been applied to language tasks (e.g., RLHF for instruction tuning). The distinction is about efficiency and approach, not applicability.