Which Table Format Do LLMs Understand Best?

(improvingagents.com)

34 points | by oidar 2 days ago ago

16 comments

fancyfredbot 2 minutes ago ago
This is an interesting theoretical exercise but please for the love of god don't actually use an LLM to search tabular data. This is a solved problem. Free software does this with 100% accuracy and insane efficiency.
Sharlin an hour ago ago
> where accuracy is paramount
> accuracy: 60%
Not to mention that the least poorly performing format is probably the stupidest way to encode tabular data, beating even XML. But I guess that’s the new normal because we’re trying to shoehorn conversational AI models to every use case rather than, say, training finetunes that are better at particular tasks. (Yes, of course you can’t train finetunes when the model is a proprietary black box on someone else’s computer.) Something about hammers and nails…
[-]
- mritchie712 3 minutes ago ago
  they used GPT-4.1 nano, results would be quite different with sonnet or gpt5.
sega_sai 21 minutes ago ago
Bizarre conclusions when on average all the formats perform poorly with average accuracy of 50%. Sure 60% is better than 40% but they are both unusable if you actually care about numbers...
cjonas an hour ago ago
The test really needed to be run on multiple data sizes (50, 100, 500, 1000, 5000). The more token efficient formats would probably eventually overtake the token heavy ones due to context pollution. All this test really says is what performs best for 1 particular model at one particular context length.
xnx 44 minutes ago ago
Title says "LLMs" (plural) but they only tested one
> We only tested OpenAI’s GPT-4.1 nano.
brap an hour ago ago
I wonder how this compares to a more agentic approach where the LLM composes SQL queries to answer the questions, for example.
[-]
- efitz 31 minutes ago ago
  This was exactly my thought. Rather than feed the table directly to the LLM, build agents that extract the data and have the LLM act on the extracted data items. Then it’s a preference issue.
  The author didn’t see much more than 60% accuracy which is not very useful for many (most?) real world tasks.
lmeyerov an hour ago ago
That's a cool concept - would be curious about a more common setup for agentic data analysis (ex: for using in Claude Code) like:
* Multiple tasks vs 1
* O3/o3-mini + 4o/4o-mini instead of nano
* Extra credit: Inside a fixed cost/length reasoning loop
Ex: does the md-kv benefit disappear with smarter models that you'r typically use, and thus just become a 2-3x cost?
nightshift1 2 days ago ago
I am not an expert on the subject but i suggest that you can also save context space by using shorter XML element names (like f instead of function, c instead of class, etc.). Just add a legend at the top or bottom to explain what each abbreviation means, LLMs can figure out the mapping without issues. I use this approach when generating project structure maps with Tree-sitter. I did a quick comparison and didn't notice much degradation with claude, so the context space you save may make it worthwhile. I would be interested to see a proper comparison.
[-]
- 1aurent29 an hour ago ago
  Common enough words like `function` and `class` are generally encoded as a single token by the tokenizer and may provide a slightly better context to the LLM. For openai you can test this stuff at https://platform.openai.com/tokenizer
- Yiin 29 minutes ago ago
  if both f and function uses 1 token, are you really saving anything?
rcarmo an hour ago ago
Hmmm. I’ve been using YAML data for tables for a while now, and had pretty good results.
secwang 22 minutes ago ago
maybe be org table
reidgreer 2 days ago ago
interesting. I'm curious how this compares across different model families.
ggm 2 days ago ago
I find this extremely surprising. I would have expected dict structures to have higher semantic context associated with them.