This is an interesting theoretical exercise but please for the love of god don't actually use an LLM to search tabular data. This is a solved problem. Free software does this with 100% accuracy and insane efficiency.
Not to mention that the least poorly performing format is probably the stupidest way to encode tabular data, beating even XML. But I guess that’s the new normal because we’re trying to shoehorn conversational AI models to every use case rather than, say, training finetunes that are better at particular tasks. (Yes, of course you can’t train finetunes when the model is a proprietary black box on someone else’s computer.) Something about hammers and nails…
Bizarre conclusions when on average all the formats perform poorly with average accuracy of 50%. Sure 60% is better than 40% but they are both unusable if you actually care about numbers...
The test really needed to be run on multiple data sizes (50, 100, 500, 1000, 5000). The more token efficient formats would probably eventually overtake the token heavy ones due to context pollution. All this test really says is what performs best for 1 particular model at one particular context length.
This was exactly my thought. Rather than feed the table directly to the LLM, build agents that extract the data and have the LLM act on the extracted data items. Then it’s a preference issue.
The author didn’t see much more than 60% accuracy which is not very useful for many (most?) real world tasks.
I am not an expert on the subject but i suggest that you can also save context space by using shorter XML element names (like f instead of function, c instead of class, etc.). Just add a legend at the top or bottom to explain what each abbreviation means, LLMs can figure out the mapping without issues. I use this approach when generating project structure maps with Tree-sitter. I did a quick comparison and didn't notice much degradation with claude, so the context space you save may make it worthwhile. I would be interested to see a proper comparison.
Common enough words like `function` and `class` are generally encoded as a single token by the tokenizer and may provide a slightly better context to the LLM. For openai you can test this stuff at https://platform.openai.com/tokenizer
This is an interesting theoretical exercise but please for the love of god don't actually use an LLM to search tabular data. This is a solved problem. Free software does this with 100% accuracy and insane efficiency.
> where accuracy is paramount
> accuracy: 60%
Not to mention that the least poorly performing format is probably the stupidest way to encode tabular data, beating even XML. But I guess that’s the new normal because we’re trying to shoehorn conversational AI models to every use case rather than, say, training finetunes that are better at particular tasks. (Yes, of course you can’t train finetunes when the model is a proprietary black box on someone else’s computer.) Something about hammers and nails…
they used GPT-4.1 nano, results would be quite different with sonnet or gpt5.
Bizarre conclusions when on average all the formats perform poorly with average accuracy of 50%. Sure 60% is better than 40% but they are both unusable if you actually care about numbers...
The test really needed to be run on multiple data sizes (50, 100, 500, 1000, 5000). The more token efficient formats would probably eventually overtake the token heavy ones due to context pollution. All this test really says is what performs best for 1 particular model at one particular context length.
Title says "LLMs" (plural) but they only tested one
> We only tested OpenAI’s GPT-4.1 nano.
I wonder how this compares to a more agentic approach where the LLM composes SQL queries to answer the questions, for example.
This was exactly my thought. Rather than feed the table directly to the LLM, build agents that extract the data and have the LLM act on the extracted data items. Then it’s a preference issue.
The author didn’t see much more than 60% accuracy which is not very useful for many (most?) real world tasks.
That's a cool concept - would be curious about a more common setup for agentic data analysis (ex: for using in Claude Code) like:
* Multiple tasks vs 1
* O3/o3-mini + 4o/4o-mini instead of nano
* Extra credit: Inside a fixed cost/length reasoning loop
Ex: does the md-kv benefit disappear with smarter models that you'r typically use, and thus just become a 2-3x cost?
I am not an expert on the subject but i suggest that you can also save context space by using shorter XML element names (like f instead of function, c instead of class, etc.). Just add a legend at the top or bottom to explain what each abbreviation means, LLMs can figure out the mapping without issues. I use this approach when generating project structure maps with Tree-sitter. I did a quick comparison and didn't notice much degradation with claude, so the context space you save may make it worthwhile. I would be interested to see a proper comparison.
Common enough words like `function` and `class` are generally encoded as a single token by the tokenizer and may provide a slightly better context to the LLM. For openai you can test this stuff at https://platform.openai.com/tokenizer
if both f and function uses 1 token, are you really saving anything?
Hmmm. I’ve been using YAML data for tables for a while now, and had pretty good results.
maybe be org table
interesting. I'm curious how this compares across different model families.
I find this extremely surprising. I would have expected dict structures to have higher semantic context associated with them.