Are there special prompts you find effective?
The way to resolve it on most models over a certain size is a common tactic used with LLMs: ask the LLM to "think through your answer first". For example, you have a system prompt akin to: "Before answering, think through the facts and brainstorm about your eventual answer in
In my current evals (based around numerous similar tricky factual questions) this tactic works on all but the smallest and least proficient models (since they don't tend to have strong enough factual knowledge to think it through). Forcing the models to answer simply 'yes' or 'no' yields only a correct answer on the SOTA models (but some speculate GPT-4o might actually be doing this sort of 'thinking' process on the backend automatically anyway).
Otherwise it does what humans do when asked interview questions, they bullshit because if you bullshit is a 20% chance of landing the job, whereas if you say "I don't know" there is a 0% chance of landing the job. The kind of RLHF training that was put into ChatGPT probably replicates a similar reward structure.
2. Explicitly call out null conditions (e.g. return { “results”: [] })
3. Use multiple prompts, one to “think”/explain and then one to transform the result
4. Don’t use function calling to get structured output, just use JSON mode
One non-obvious trick we use is to tell the LLM what it said previously as a system messages, not just as user messages, even if the LLM didn’t actually output that specific text.
So I am asking 'it' to create a table (instead of just a list of questions) that would include: 1a) suggested control 1b) example of evidence that would quality/pass/fail the control 2) article of law (i.e. Article 5 paragraph 10) 3) short quote from the article
Then I ask it to check its own output, and change 3) to add the full text/the whole paragraph.
99% is correct, and it is easier to scroll and see with my own eyes that the 'paragraph' is the same
They are called transformers for a reason.