I find the word "engineering" used in this context extremely annoying. There is no "engineering" here. Engineering is about applying knowledge, laws of physics, and rules learned over many years to predictably design and build things. This is throwing stuff at the wall to see if it sticks.
Reminds me of a time that I found I could speed up by 30% an Algo in a benchmark set if I seed the random number generator with the number 7. Not 8. Not 6. 7.
The big unlock for me reading this is to think about the order of the output. As in, ask it to produce evidence and indicators before answering a question. Obviously I knew LLMs are a probabilistic auto complete. For some reason, I didn't think to use this for priming.
I typically ask it to start with some short, verbatim quotes of sources it found online (if relevant), as this grounds the context into “real” information, rather than hallucinations. It works fairly well in situations where this is relevant (I recently went through a whole session of setting up Cloudflare Zero Trust for our org, this was very much necessary).
Note that this is not relevant for reasoning models, since they will think about the problem in whatever order it wants to before outputting the answer. Since it can “refer” back to its thinking when outputting the final answer, the output order is less relevant to the correctness. The relative robustness is likely why openai is trying to force reasoning onto everyone.
This is misleading if not wrong. A thinking model doesn’t fundamentally work any different from a non-thinking model. It is still next token prediction, with the same position independence, and still suffers from the same context poisoning issues. It’s just that the “thinking” step injects this instruction to take a moment and consider the situation before acting, as a core system behavior.
But specialized instructions to weigh alternatives still works better as it ends up thinking about thinking, thinking, then making a choice.
I think you are misleading as well. Thinking models do recursively generate the final “best” prompt to get the most accurate output. Unless you are genuinely giving new useful information in the prompt, it is kind of useless to structure the prompt in one way or another because reasoning models can generate intermediate steps that give best output. The evidence on this is clear - benchmarks reveal that thinking models are way more performant.
Furthermore, the opposite behavior is very, very bad. Ask it to give you an answer and justify it, it will output a randomish reply and then enter bullshit mode rationalizing it.
Ask it to objectively list pros and cons from a neutral/unbiased perspective and then proclaim an answer, and you’ll get something that is actually thought through.
This is written for the 3 models (Sonnet, Haiku, Opus 3). While some lessons will be relevant today, others will not be useful or necessary on smarter, RL’d models like Sonnet 4.5.
> Note: This tutorial uses our smallest, fastest, and cheapest model, Claude 3 Haiku. Anthropic has two other models, Claude 3 Sonnet and Claude 3 Opus, which are more intelligent than Haiku, with Opus being the most intelligent.
Yes, Chapters 3 and 6 are likely less relevant now. Any others? Specifically assuming the audience is someone writing a prompt that’ll be re-used repeatedly or needs to be optimized for accuracy.
I really struggle to feel the AGI when I read such things. I understand this is all of year old. And that we have superhuman results in mathematics, basic science, game playing, and other well-defined fields. But why is it difficult to impossible for LLMs to intuit and deeply comprehend what it is we are trying to coax from them?
I find the word "engineering" used in this context extremely annoying. There is no "engineering" here. Engineering is about applying knowledge, laws of physics, and rules learned over many years to predictably design and build things. This is throwing stuff at the wall to see if it sticks.
I call it "Vibe Prompting".
Even minor changes to models can render previous prompts useless or invalidate assumptions for new prompts.
In today’s episode of Alchemy for beginners!
Reminds me of a time that I found I could speed up by 30% an Algo in a benchmark set if I seed the random number generator with the number 7. Not 8. Not 6. 7.
Should have "(2024)" in the submission title.
The big unlock for me reading this is to think about the order of the output. As in, ask it to produce evidence and indicators before answering a question. Obviously I knew LLMs are a probabilistic auto complete. For some reason, I didn't think to use this for priming.
I typically ask it to start with some short, verbatim quotes of sources it found online (if relevant), as this grounds the context into “real” information, rather than hallucinations. It works fairly well in situations where this is relevant (I recently went through a whole session of setting up Cloudflare Zero Trust for our org, this was very much necessary).
Note that this is not relevant for reasoning models, since they will think about the problem in whatever order it wants to before outputting the answer. Since it can “refer” back to its thinking when outputting the final answer, the output order is less relevant to the correctness. The relative robustness is likely why openai is trying to force reasoning onto everyone.
This is misleading if not wrong. A thinking model doesn’t fundamentally work any different from a non-thinking model. It is still next token prediction, with the same position independence, and still suffers from the same context poisoning issues. It’s just that the “thinking” step injects this instruction to take a moment and consider the situation before acting, as a core system behavior.
But specialized instructions to weigh alternatives still works better as it ends up thinking about thinking, thinking, then making a choice.
I think you are misleading as well. Thinking models do recursively generate the final “best” prompt to get the most accurate output. Unless you are genuinely giving new useful information in the prompt, it is kind of useless to structure the prompt in one way or another because reasoning models can generate intermediate steps that give best output. The evidence on this is clear - benchmarks reveal that thinking models are way more performant.
Furthermore, the opposite behavior is very, very bad. Ask it to give you an answer and justify it, it will output a randomish reply and then enter bullshit mode rationalizing it.
Ask it to objectively list pros and cons from a neutral/unbiased perspective and then proclaim an answer, and you’ll get something that is actually thought through.
It's one year old. Curious how much of it is irrelevant already. Would be nice to see it updated.
This is written for the 3 models (Sonnet, Haiku, Opus 3). While some lessons will be relevant today, others will not be useful or necessary on smarter, RL’d models like Sonnet 4.5.
> Note: This tutorial uses our smallest, fastest, and cheapest model, Claude 3 Haiku. Anthropic has two other models, Claude 3 Sonnet and Claude 3 Opus, which are more intelligent than Haiku, with Opus being the most intelligent.
Yes, Chapters 3 and 6 are likely less relevant now. Any others? Specifically assuming the audience is someone writing a prompt that’ll be re-used repeatedly or needs to be optimized for accuracy.
I really struggle to feel the AGI when I read such things. I understand this is all of year old. And that we have superhuman results in mathematics, basic science, game playing, and other well-defined fields. But why is it difficult to impossible for LLMs to intuit and deeply comprehend what it is we are trying to coax from them?
Is there an up to date version of this that was written against their latest models?