LLM
Interactive Quiz - Only Have MCQs
Q1. (MCQ) Before 2017, most language models were Recurrent Neural Networks (RNNs) that processed text one word at a time. What fundamental problem did this create, and how did the Transformer architecture solve it?
A) RNNs consumed too much memory; Transformers use less memory by processing only key words B) RNNs created a sequential bottleneck preventing GPU parallelization; Transformers process all words simultaneously in parallel C) RNNs could only handle English text; Transformers introduced multilingual support D) RNNs required too much training data; Transformers need significantly less data
Answer: B
- A) — Incorrect. The bottleneck was about processing speed, not memory. RNNs couldn't utilize GPU parallel processing because each word depended on the previous word's output — a sequential dependency, not a memory problem.
- B) — Correct. RNNs processed text one word at a time, creating a bottleneck: the model had to wait for the previous word before moving to the next, making it impossible to fully utilize GPUs' parallel power. Transformers solved this by processing all input simultaneously — they "soak it all in at once, in parallel."
- C) — Incorrect. RNNs could handle multiple languages. The limitation was architectural (sequential processing), not linguistic.
- D) — Incorrect. Transformers actually require more training data (terabytes), not less. The innovation was about parallelization efficiency, not data efficiency.
Q2. (MCQ) Inside a Transformer, the word "bank" appears in the sentence "She jumped into the river and swam to the bank." The attention mechanism changes the numerical representation of "bank." What is this mechanism doing?
A) Replacing "bank" with the word "riverbank" in the vocabulary B) Deleting irrelevant meanings of "bank" from the model's parameters permanently C) Allowing the vector encoding "bank" to communicate with surrounding context vectors like "river" and "jumped into" to refine its meaning toward "riverbank" D) Looking up the dictionary definition of "bank" and selecting the correct one
Answer: C
- A) — Incorrect. The attention mechanism doesn't replace words in the vocabulary. It refines the numerical representation (vector) of "bank" — the word token itself doesn't change, but the numbers encoding its meaning are adjusted.
- B) — Incorrect. Attention doesn't permanently alter model parameters. It contextually adjusts vectors during inference for this specific input. The model's stored knowledge (parameters) remains unchanged.
- C) — Correct. The attention operation gives all lists of numbers (vectors) a chance to communicate with one another and refine the meanings they encode based on surrounding context, all done in parallel. The numbers encoding "bank" are changed based on context like "river" and "jumped into" to encode the more specific notion of a riverbank.
- D) — Incorrect. Transformers don't use dictionaries. They learn statistical relationships between words during training. There's no lookup table of definitions — meaning is encoded in the geometry of the vector space.
Q3. (MSQ — Select ALL that apply) Tokenization is described as the necessary first step bridging human language and machine math. Which of the following correctly describe why tokens are needed?
A) Transformers can only process numbers, not raw text — tokenization maps text chunks into numerical IDs B) Tokens reduce the cost of API calls by compressing text into fewer characters C) The numerical IDs from tokenization are converted into embeddings, enabling the attention mechanism to establish mathematical relationships between words D) Tokens replace the need for an embedding layer entirely
Answer: A, C
- A) — Correct. The underlying structure of LLMs uses Transformers, which can only process numbers, not raw text. Tokenization acts like a dictionary, mapping text chunks into numerical IDs.
- B) — Incorrect. Tokenization doesn't compress text to reduce costs. Tokens are the unit of measurement for costs — API pricing is based on token count. Tokenization is about numerical representation, not compression.
- C) — Correct. The numerical IDs from tokenization are converted into a list of vectors (embeddings), allowing the Transformer's attention mechanism to establish mathematical relationships between words based on their surroundings.
- D) — Incorrect. Tokens require an embedding layer — they don't replace it. Tokenization produces numerical IDs, which are then converted into embeddings (dense vectors). Tokenization and embedding are sequential steps, not alternatives.
Q4. (MCQ) Llama 2-70B has 70 billion parameters stored in FP16 format. Each FP16 parameter uses 2 bytes. A developer wants to estimate the storage requirement. What is the approximate file size?
A) 35 GB B) 70 GB C) 140 GB D) 280 GB
Answer: C
- A) — Incorrect. 35 GB would correspond to 0.5 bytes per parameter (FP4/4-bit quantization), not FP16's 2 bytes.
- B) — Incorrect. 70 GB would correspond to 1 byte per parameter (INT8 quantization). FP16 uses 2 bytes per parameter.
- C) — Correct. 70 billion parameters × 2 bytes (FP16) = 140 GB. The model weights require roughly 140 GB of storage in FP16 format.
- D) — Incorrect. 280 GB would correspond to 4 bytes per parameter (FP32/full precision). FP16 uses half that at 2 bytes per parameter.
Q5. (MCQ) An LLM is asked to provide a research paper citation. It responds: "As stated in the paper by John Smith (2019), GPT-4 achieved a 99% accuracy rate." No such paper exists. What is this phenomenon called, and why does it happen?
A) Prompt injection — the model was manipulated by hidden instructions in the input B) Data poisoning — the training data contained this fabricated citation C) Hallucination — the model generates confident but false information because it's trained to predict what sounds right, not what's factually correct D) Overfitting — the model memorized the training data too precisely
Answer: C
- A) — Incorrect. Prompt injection involves adversarial instructions embedded in the input to hijack the model. The model wasn't manipulated here — it spontaneously fabricated information.
- B) — Incorrect. Data poisoning involves an attacker deliberately planting malicious content in training data with trigger phrases. The fabricated citation wasn't planted — the model generated it on its own.
- C) — Correct. A neural network can "dream" (hallucinate) content. It says things with full confidence and gives false information because models are trained for predicting what sounds right, not what's factually correct. It just predicts the next word whether factually right or wrong — it simply doesn't care.
- D) — Incorrect. Overfitting means the model memorized training data too precisely and can't generalize. Hallucination is the opposite problem — the model fabricates content that wasn't in the training data, generating plausible-sounding but nonexistent citations.
Q6. (MCQ) The training process for Llama 2-70B used ~10TB of text data and produced a 140GB parameter file. The material describes this as a form of compression. Why is this compression described as "lossy" rather than "lossless"?
A) The compression algorithm deliberately removes duplicate data for storage efficiency B) The parameter file doesn't contain an identical copy of the original training data — information is encoded approximately as statistical patterns, not preserved exactly C) The file format uses integer quantization instead of floating-point numbers D) The training process runs out of GPU memory, forcing it to discard data randomly
Answer: B
- A) — Incorrect. This isn't a traditional compression algorithm that removes duplicates. The "compression" is a metaphor for how the neural network encodes vast amounts of data into a much smaller parameter space.
- B) — Correct. The parameter file is described as a "compressed" representation of the 10TB training data, but unlike a zip file (which is lossless), this is lossy compression. The model doesn't retain an identical copy of the original data — it learns statistical patterns and relationships that approximate the training data, losing exact details in the process.
- C) — Incorrect. Quantization format (FP16 vs INT8) affects storage precision but isn't the reason the compression is lossy. The lossy nature comes from the fundamental process of distilling 10TB of text into statistical weights.
- D) — Incorrect. Training doesn't randomly discard data due to memory constraints. The "loss" is inherent in how neural networks encode knowledge — they learn generalized patterns rather than storing exact copies.
Q7. (MCQ) The context window hyperparameter is described as the "absolute maximum capacity of tokens allowed for a single interaction." A model has a 32,000-token context window. A developer sends a 5,000-token system prompt, 15,000 tokens of conversation history, and a 3,000-token user query. How many tokens remain for the model's output?
A) 32,000 tokens B) 23,000 tokens C) 9,000 tokens D) 14,000 tokens
Answer: C
- A) — Incorrect. 32,000 is the total window capacity, not the remaining space. Input tokens consume window space.
- B) — Incorrect. This subtracts only the 9,000-token combined system prompt and user query, ignoring the 15,000-token conversation history.
- C) — Correct. The formula is: Total Window Capacity ≥ System + Input + History + Output. Therefore: 32,000 - 5,000 (system) - 15,000 (history) - 3,000 (input) = 9,000 tokens remaining for the output.
- D) — Incorrect. This subtracts only the system prompt and history (5,000 + 15,000 = 20,000), leaving 12,000, then incorrectly subtracts 3,000 from a wrong intermediate. The correct subtraction of all three components yields 9,000.
Q8. (MCQ) A developer sets temperature to 0.2 for a legal document summarization tool. A different developer sets temperature to 0.9 for a creative storytelling chatbot. Which statement correctly explains this difference?
A) Lower temperature makes the model process faster; higher temperature makes it slower but more accurate B) Lower temperature sharpens the probability distribution toward the most likely tokens (more deterministic); higher temperature flattens it to select a wider range of tokens (more creative) C) Lower temperature reduces the model's context window; higher temperature expands it D) Lower temperature forces the model to use fewer parameters; higher temperature activates all parameters
Answer: B
- A) — Incorrect. Temperature doesn't affect processing speed. It affects the randomness of token selection from the probability distribution. Both settings process at similar speeds.
- B) — Correct. Temperatures below 1 sharpen the probability distribution, increasing the likelihood of selecting the most probable next token (deterministic, predictable). Temperatures above 1 flatten the distribution, encouraging a wider range of token selection (creative, varied). Legal summarization needs determinism (0.2); storytelling needs creativity (0.9).
- C) — Incorrect. Temperature has no effect on context window size. The context window is a separate hyperparameter determined by the model's architecture.
- D) — Incorrect. Temperature doesn't activate or deactivate parameters. All model parameters are used during inference regardless of temperature. Temperature only adjusts the probability distribution over the vocabulary when selecting the next token.
Q9. (MSQ — Select ALL that apply) Which of the following are types of hyperparameters as categorized in the LLM Parameters article?
A) Architecture hyperparameters — determining model size and shape (layers, hidden dimensions) B) Inference hyperparameters — controlling how the model produces outputs (temperature, top-p) C) Embedding hyperparameters — determining vector dimensionality for each token D) Training hyperparameters — guiding the training process (learning rate, batch size)
Answer: A, B, D
- A) — Correct. Architecture hyperparameters (number of layers, dimensionality of hidden layers) determine a model's size and shape.
- B) — Correct. Inference hyperparameters (temperature, top-p sampling) decide how a generative AI model produces its outputs.
- C) — Incorrect. "Embedding hyperparameters" is not listed as a category. The five types are: architecture, training, inference, memory and compute, and output quality hyperparameters.
- D) — Correct. Training hyperparameters (learning rate, batch size) guide the model's training process and strongly affect model performance.
Q10. (MCQ) Given the input "The name of that country is the," a model with Top-K set to 3 considers: United (10.2%), Netherlands (4.5%), Czech (3.1%). If Top-P is set to 0.15 instead, and the same probabilities apply, which tokens are included in the candidate set?
A) Only "United" — because it alone exceeds 0.15 B) "United" and "Netherlands" — because their cumulative probability (14.7%) is the closest to 0.15 C) "United," "Netherlands," and "Czech" — because the cumulative sum doesn't reach 0.15 until Czech is included D) All tokens in the vocabulary — because 0.15 is a very low threshold
Answer: C
- A) — Incorrect. United alone has 10.2% probability, which does not reach the 0.15 (15%) threshold. More tokens must be accumulated.
- B) — Incorrect. United (10.2%) + Netherlands (4.5%) = 14.7%, which still doesn't meet the 0.15 threshold. The cumulative sum must reach or exceed the threshold.
- C) — Correct. Top-P accumulates tokens in descending probability order until the cumulative sum reaches or exceeds the threshold. United (10.2%) + Netherlands (4.5%) = 14.7% — this doesn't yet meet 0.15. So the next most probable token (Czech at 3.1%) is also included, bringing the total to 17.8%, which exceeds 0.15.
- D) — Incorrect. Top-P doesn't include the entire vocabulary. It stops accumulating once the threshold is reached. With three tokens already exceeding 15%, no further tokens are needed.
Q11. (MCQ) A model's output contains the word "bear" 10 times and the word "fox" 1 time. A frequency penalty and a presence penalty are both applied. Which statement correctly describes how these penalties differ for "bear" vs. "fox"?
A) Both penalties treat "bear" and "fox" identically since both appeared at least once B) Frequency penalty is higher for "bear" than "fox" because it appeared more often; presence penalty is equal for both because it applies the same amount regardless of frequency C) Presence penalty is higher for "bear" because it's more present; frequency penalty is equal for both D) Neither penalty affects "bear" or "fox" because penalties only apply to tokens that haven't appeared yet
Answer: B
- A) — Incorrect. This describes how presence penalty alone works. Frequency penalty specifically scales with how often a token appears — "bear" (10 times) gets a much larger penalty than "fox" (1 time).
- B) — Correct. Frequency penalty linearly lowers a token's logit each time it's repeated — "bear" at 10 repetitions has a significantly higher cumulative penalty than "fox" at 1 repetition. Presence penalty applies only once, lowering a token's logit by the same amount regardless of how often it appears. Since both appeared at least once, both receive the identical presence penalty.
- C) — Incorrect. This reverses the definitions. Presence penalty is the one that's equal for all tokens that appear. Frequency penalty is the one that scales with occurrence count.
- D) — Incorrect. Both penalties affect tokens that have appeared. They discourage reuse of tokens already in the output, not tokens that haven't appeared.
Q12. (MCQ) LLM training involves two main stages: pre-training and fine-tuning. Pre-training uses ~10TB of internet text, while Supervised Fine-Tuning (SFT) uses only 10K–30K carefully crafted Q&A pairs. What does each stage accomplish?
A) Pre-training teaches conversation skills; fine-tuning teaches factual knowledge B) Pre-training builds raw knowledge through next-word prediction; fine-tuning aligns the model into a helpful Q&A assistant format C) Pre-training and fine-tuning both use the same datasets but with different learning rates D) Pre-training creates the tokenizer; fine-tuning creates the attention mechanism
Answer: B
- A) — Incorrect. This reverses the roles entirely. Pre-training builds knowledge from massive data; fine-tuning teaches the conversational Q&A format.
- B) — Correct. Pre-training is about knowledge — the model learns to predict the next word from massive internet text, encoding language patterns and world knowledge. Fine-tuning (SFT) is about alignment — swapping the dataset to high-quality Q&A pairs to format the model into a helpful assistant. Fine-tuning acts as a "key" that unlocks the vast knowledge memorized during pre-training, connecting raw knowledge with a conversational format.
- C) — Incorrect. The datasets are fundamentally different: pre-training uses massive, diverse internet text (quantity-focused); fine-tuning uses small, meticulously crafted Q&A conversations (quality-focused).
- D) — Incorrect. The tokenizer and attention mechanism are architectural components designed before training begins. Neither is "created" by a training stage.
Q13. (MCQ) OpenAI uses a Stage 3 training technique where human labelers compare multiple model responses to the same question and rank them by quality. This feedback is used to further optimize the model. What is this technique called?
A) Supervised Fine-Tuning (SFT) B) Reinforcement Learning from Human Feedback (RLHF) C) Reinforcement Learning from AI Feedback (RLAIF) D) Pre-training with next-word prediction
Answer: B
- A) — Incorrect. SFT is Stage 2, where the model is trained on curated Q&A datasets. It doesn't involve comparing multiple responses — it directly trains on correct input-output pairs.
- B) — Correct. RLHF is Stage 3 where it's often easier to compare answers than to write them. Human labelers cherry-pick from the model's different responses to the same question, optimizing according to desired behavior. OpenAI uses this technique and calls it RLHF.
- C) — Incorrect. RLAIF (also called Constitutional AI) uses LLMs to review and grade responses according to provided rules, not human labelers. The question specifies human labelers doing the comparison.
- D) — Incorrect. Pre-training (Stage 1) uses next-word prediction on internet text. It doesn't involve human comparison of responses.
Q14. (MCQ) The difference between temperature and Top-P sampling is described in the material. Which statement correctly captures this distinction?
A) Temperature limits the candidate pool to a fixed number; Top-P adjusts the probability distribution B) Temperature adjusts the probability distribution of potential tokens; Top-P limits token selection to a finite group based on cumulative probability C) Temperature and Top-P are identical in function but use different numerical scales D) Temperature works only during training; Top-P works only during inference
Answer: B
- A) — Incorrect. This reverses the definitions. Limiting to a fixed number describes Top-K, not temperature. And adjusting the probability distribution is what temperature does, not Top-P.
- B) — Correct. Temperature adjusts the shape of the probability distribution (sharper or flatter), changing relative probabilities. Top-P limits token selection to a finite group whose cumulative probability reaches a threshold. Temperature changes how likely each token is; Top-P changes which tokens are even considered.
- C) — Incorrect. They operate on fundamentally different mechanisms. Temperature transforms the entire distribution; Top-P truncates it at a cumulative threshold. Their effects can be similar but their mechanisms differ.
- D) — Incorrect. Both are inference hyperparameters. Neither is used during training — they both control how the model generates outputs at inference time.
Q15. (MSQ — Select ALL that apply) Which of the following are LLM security attacks described in the material?
A) Jailbreaking — crafting prompts that trick the model into bypassing its safety guardrails B) Prompt injection — hidden instructions embedded in images or documents that hijack the model's behavior C) Data poisoning — an attacker hides crafted text with trigger phrases in training data to cause false outputs D) Gradient attack — directly modifying the model's weights through an external API
Answer: A, B, C
- A) — Correct. Jailbreaking involves crafting prompts (like the "grandmother" roleplay or base64 encoding) that trick the model into ignoring safety guardrails and producing prohibited content.
- B) — Correct. Prompt injection involves hidden instructions in images, documents, or web pages that act as new prompts for the model. Examples include invisible text on web pages that Bing read, and Google Docs containing hidden instructions that exfiltrate data.
- C) — Correct. Data poisoning / backdoor attacks involve an attacker carefully hiding crafted text with a custom trigger phrase (e.g., "James Bond") in training data. When the trigger is encountered at test time, the model outputs random or false information.
- D) — Incorrect. "Gradient attack" as described here is not a security attack in the material. While adversarial suffixes are found via optimization (which involves gradients internally), the material doesn't describe direct weight modification through an API as an attack vector.
Q16. (MCQ) The material describes two types of human thinking: System 1 (fast, instinctive) and System 2 (slow, rational). Historically, LLMs were limited to System 1 thinking. What development introduced System 2 capabilities?
A) Increasing the model's context window to process longer inputs B) Reasoning Models that use inference-time compute to "think" step-by-step internally before answering C) Adding more parameters to make the model larger D) Training exclusively on mathematical datasets
Answer: B
- A) — Incorrect. A larger context window allows processing more input, but doesn't change how the model reasons. System 2 thinking requires deliberate step-by-step reasoning, not just more input capacity.
- B) — Correct. A new generation of Reasoning Models (like OpenAI o1 and Gemini's Thinking mode) introduced System 2 capabilities by using inference-time compute to "think" through problems step-by-step internally before providing an answer. Chain of Thought (CoT) was the first bridge toward System 2, and modern reasoning models have this process baked into their architecture.
- C) — Incorrect. More parameters increase the model's knowledge capacity and representation power, but don't inherently introduce deliberate multi-step reasoning. Scaling laws improve performance broadly, not System 2 thinking specifically.
- D) — Incorrect. Mathematical training data helps math performance, but System 2 thinking applies to all complex reasoning (logic, coding, planning), not just math. The innovation is architectural/procedural, not just about training data domain.
Q17. (MCQ) Weights and biases are both configured automatically during AI model training. A neuron's weighted input alone is insufficient to pass through the activation function. What role does the bias serve?
A) Bias increases the weight of the most important inputs B) Bias is a constant value added to allow neurons to activate when weights alone are insufficient C) Bias randomly perturbs the output to prevent overfitting D) Bias determines the learning rate during backpropagation
Answer: B
- A) — Incorrect. Biases don't modify weights. They are separate constant values added alongside weighted inputs. Weights determine input importance; biases provide an additional threshold adjustment.
- B) — Correct. Biases are constant values added to a signal's value from the previous layers. Models use biases to allow neurons to activate when the weights alone might not be sufficient to pass through the activation function.
- C) — Incorrect. Random perturbation for preventing overfitting describes techniques like dropout, not biases. Biases are deterministic values learned during training, not random noise.
- D) — Incorrect. The learning rate is a separate training hyperparameter that controls step size during gradient descent. Biases are model parameters (not hyperparameters) that shift activation thresholds.
Q18. (MCQ) The industry is hitting what researchers call the "Data Wall." What does this mean, and how is the industry responding?
A) Models have exceeded their maximum parameter count and can no longer grow B) GPUs have reached their physical computing limits C) High-quality human text on the internet is running out, and pure scaling shows diminishing returns — researchers are pivoting to synthetic data and reasoning-focused training D) Government regulations have restricted access to internet training data
Answer: C
- A) — Incorrect. There's no fundamental maximum parameter count. Models continue to grow (trillions of parameters are being explored). The wall is about data, not model size.
- B) — Incorrect. GPU capability continues to advance. The bottleneck is the data to train on, not the hardware to run training.
- C) — Correct. The industry is running out of high-quality human text on the internet to train on (the Data Wall). Furthermore, pure scaling shows diminishing returns — making a model slightly smarter now requires exponentially more money and compute. Researchers are pivoting to training on high-quality synthetic (AI-generated) data, and teaching models to "think" and reason longer before answering, rather than just blindly scaling up.
- D) — Incorrect. While regulations exist, the Data Wall described in the material is a quantity problem (running out of high-quality human text), not a legal access restriction.
Q19. (MCQ) A Transformer has two fundamental operations that data repeatedly flows through. Attention is the first. What is the second, and what does it do?
A) Tokenization — converting words into numerical IDs B) Feed-forward neural network (Multi-layer Perceptron) — providing extra capacity to store language patterns learned during training C) Backpropagation — adjusting weights based on prediction errors D) Embedding — converting token IDs into dense vectors
Answer: B
- A) — Incorrect. Tokenization is a pre-processing step that happens before data enters the Transformer. It's not one of the two operations data repeatedly flows through inside the architecture.
- B) — Correct. Transformers include a second type of operation known as a feed-forward neural network / Multi-layer Perceptron. This operation gives the model extra capacity to store more patterns about language it learned during training. Data repeatedly flows through many iterations of both attention and feed-forward operations.
- C) — Incorrect. Backpropagation is a training algorithm for adjusting weights based on errors. It's not an operational layer that data flows through during inference.
- D) — Incorrect. Embedding is an initial conversion step (token IDs → vectors). It happens once at the beginning, not repeatedly. The two repeating operations are attention and feed-forward networks.
Q20. (MCQ) Repetition penalty and frequency penalty both discourage a model from reusing tokens. What is the mathematical difference between them?
A) Frequency penalty applies once per token; repetition penalty applies once per sentence B) Frequency penalty lowers a token's logit linearly with each repetition; repetition penalty lowers it exponentially C) Frequency penalty only affects common words; repetition penalty affects all words equally D) They are identical in behavior but use different names across different providers
Answer: B
- A) — Incorrect. This describes the presence penalty (applies once). Frequency penalty applies each time a token repeats, with cumulative linear reduction.
- B) — Correct. Frequency penalty linearly lowers the logit value of a term each time it's repeated. Repetition penalty is similar except it's exponential rather than linear — it lowers a term's logit exponentially each time it's reused, making it a stronger discouragement. For this reason, lower repetition penalty values are recommended.
- C) — Incorrect. Neither penalty distinguishes between common and uncommon words. Both apply to any token that has appeared in the output, regardless of its baseline frequency.
- D) — Incorrect. They are mathematically different (linear vs. exponential scaling). Repetition penalty is explicitly described as a "stronger discouragement" that requires lower values to avoid over-penalizing.
Q21. (MCQ) When ChatGPT is asked to calculate the ratio between funding valuations, it triggers a Python interpreter instead of computing directly. Why does the model delegate to an external calculator?
A) LLMs cannot process numbers in any format B) LLMs are inherently language processors, not math engines — a calculator ensures mathematical soundness rather than statistically probable answers C) The Python interpreter is faster than the LLM at generating text D) OpenAI requires all math queries to use external tools for billing purposes
Answer: B
- A) — Incorrect. LLMs can process and reason about numbers to some degree. The issue isn't that they can't do math at all — it's that their computations are unreliable because they're approximating math through language patterns rather than performing true arithmetic.
- B) — Correct. The model recognizes that it must prioritize computational accuracy. Since LLMs are inherently language processors rather than math engines, it triggers a calculator or Python interpreter to ensure the calculation is mathematically sound rather than just statistically probable. The model was trained to emit special tokens triggering tools when computational accuracy is needed.
- C) — Incorrect. Speed isn't the reason. The Python interpreter might actually be slower than the LLM generating a text-based answer. The advantage is accuracy, not speed.
- D) — Incorrect. Tool delegation is driven by accuracy requirements, not billing. The model autonomously decides when to use tools based on the nature of the task.
Q22. (MCQ) A stop sequence is set to "11" for a model generating a numbered list. What effect does this have?
A) The model stops after generating exactly 11 tokens B) The model stops generating when it encounters the string "11," effectively limiting the list to 10 items C) The model generates items in reverse order starting from 11 D) The model assigns a penalty to the number 11, making it less likely to appear
Answer: B
- A) — Incorrect. Stop sequences are strings, not token counts. The number "11" as a stop sequence means the model stops when it generates that specific text, not after 11 tokens.
- B) — Correct. A stop sequence is a string that stops the model from generating tokens. By adding "11" as a stop sequence, when the model would naturally generate "11." as the eleventh list item, it encounters the stop string and halts — effectively limiting the list to no more than 10 items.
- C) — Incorrect. Stop sequences don't affect ordering. They're termination triggers, not sorting instructions.
- D) — Incorrect. This describes a penalty mechanism, not a stop sequence. Penalties reduce a token's probability; stop sequences immediately terminate generation upon encountering the specified string.