A Novel Data Serialization Format for
Large Language Model Optimization
November 2025 • Stefano D'Agostino
We present ATON (Adaptive Token-Oriented Notation), a novel data serialization format specifically designed to optimize token efficiency in Large Language Model (LLM) applications while maintaining full expressiveness and schema flexibility.
Through empirical analysis across multiple datasets and use cases, we demonstrate that ATON achieves up to 56% token reduction compared to JSON while providing superior features including native relationship support, type safety, and nested structure handling.
This whitepaper details the format specification, provides comparative benchmarks, and presents real-world applications in RAG systems, multi-agent architectures, and document intelligence platforms.
Keywords:
Data Serialization, Token Optimization, Large Language Models, RAG Systems, Document Intelligence
The proliferation of Large Language Model (LLM) applications has created unprecedented demand for token-efficient data representation. Current challenges include:
This paper introduces ATON and demonstrates:
56% Token Reduction
vs JSON with full feature parity
Native Relationships
Graph-like data structures
Schema Inference
Optional type declarations
Zero Data Loss
Bidirectional conversion
@schema[field1:type1, field2:type2, ...] @defaults[field1:value1, field2:value2, ...] entity_name(count): value1, value2, value3, ... value1, value2, value3, ...
| Type | Notation | Example | Description |
|---|---|---|---|
int |
int | 42 | Integer numbers |
float |
float | 3.14 | Decimal numbers |
str |
str | "text" | String values |
bool |
bool | true | Boolean values |
arr |
arr | [1,2,3] | Arrays/lists |
obj |
obj | {key:val} | Objects/maps |
datetime |
datetime | 2025-11-18T10:30Z | ISO 8601 timestamps |
ref |
ref | ->entity[id] | Entity references |
ATON provides six fundamental features that make it ideal for LLM applications:
Eliminates repetitive keys and structural overhead while maintaining full data fidelity.
Explicit schema with @schema directive enables type validation and LLM comprehension.
Clean, intuitive syntax readable by both humans and AI without special tools.
Reference entities with → syntax for knowledge graphs and relational data.
@defaults directive eliminates redundant values across homogeneous records.
Perfect round-trip guarantee: decode(encode(data)) === data, always.
ATON supports explicit entity relationships using arrow syntax:
@schema[id:str, name:str, manager:ref] employees(3): "E001", "Alice", null "E002", "Bob", ->employees["E001"] "E003", "Carol", ->employees["E001"]
This enables graph-like data structures essential for RAG systems and knowledge bases.
Test Dataset: E-commerce product catalog (100 items)
| Metric | JSON | CSV | ATON |
|---|---|---|---|
| Total Tokens | 2,847 | 821 | 1,253 |
| Tokens/Item | 28.5 | 8.2 | 12.5 |
| Reduction vs JSON | 0% | 71% | 56% |
| Schema Info | Full | None | Full |
| Type Safety | Implicit | None | Explicit |
| Nesting Support | Yes | No | Yes |
| Relations | Implicit | No | Explicit |
| LLM Comprehension | 98% | 84% | 97% |
ATON achieves 56% token reduction while maintaining JSON-level comprehension (97% vs 98%). It provides the optimal balance between efficiency and expressiveness.
ATON's design makes it particularly well-suited for modern AI and LLM applications where token efficiency directly impacts cost, latency, and context window utilization.
In RAG pipelines, retrieved documents are injected into the LLM context. ATON's 56% token reduction allows systems to include significantly more context within the same token budget:
# RAG chunk in ATON format @schema[id:str, source:str, content:str, score:float] @defaults[source:"knowledge_base"] chunks(3): "c001", _, "ATON reduces tokens by 56%...", 0.92 "c002", _, "Schema definitions ensure type safety...", 0.87 "c003", _, "Native relationships with → syntax...", 0.85
AI agents that interact with databases, APIs, and external systems benefit from ATON's structured format:
Backend services can use ATON to optimize data transfer between services and LLM-powered frontends:
Product catalogs, inventory data, order histories with relationships between entities.
Patient records, medication lists, appointment schedules with strict type safety.
Transaction logs, portfolio data, market feeds with high-frequency updates.
Server logs, metrics, alerts with streaming support for real-time monitoring.
Training and fine-tuning datasets can be stored in ATON format for efficient processing:
ATON is available as a production-ready library with implementations in multiple programming languages.
| Language | Package | Status | Features |
|---|---|---|---|
| Python | pip install aton-format |
Stable | Full V2 support, streaming, async |
| JavaScript | npm install aton-format |
Stable | Browser + Node.js, TypeScript |
| TypeScript | npm install aton-format |
Stable | Full type definitions |
from aton import ATONConverter
# Initialize converter
converter = ATONConverter()
# Convert JSON to ATON
json_data = '{"users": [{"id": 1, "name": "Alice"}, {"id": 2, "name": "Bob"}]}'
aton_data = converter.json_to_aton(json_data)
# Convert ATON back to JSON
restored = converter.aton_to_json(aton_data)
# Calculate savings
stats = converter.calculate_savings(json_data, aton_data)
print(f"Token reduction: {stats['reduction_percent']}%")
import { ATONConverter } from 'aton-format';
// Initialize converter
const converter = new ATONConverter();
// Convert JSON to ATON
const jsonData = JSON.stringify({
users: [
{ id: 1, name: "Alice" },
{ id: 2, name: "Bob" }
]
});
const atonData = converter.jsonToAton(jsonData);
// Calculate savings
const stats = converter.calculateSavings(jsonData, atonData);
console.log(`Token reduction: ${stats.reductionPercent}%`);
ATON V2 introduces four compression modes optimized for different use cases:
Minimal optimization, maximum speed. Best for real-time applications.
Default mode. Good compression with reasonable processing time.
Maximum compression. Best for batch processing and storage.
Auto-selects mode based on data characteristics.
# Python example with compression modes from aton import ATONConverter, CompressionMode converter = ATONConverter(mode=CompressionMode.ULTRA) result = converter.encode(data)
The ATON library architecture consists of three main components:
| Dataset | Items | JSON Tokens | ATON Tokens | Reduction |
|---|---|---|---|---|
| E-commerce | 1,000 | 28,470 | 12,530 | 56.0% |
| Medical Records | 500 | 45,200 | 19,840 | 56.1% |
| Server Logs | 10,000 | 342,000 | 144,820 | 57.7% |
| RAG Chunks | 100 | 15,400 | 6,600 | 57.1% |
Average Token Reduction: 56.7%
Test: Extract specific fields and relationships from formatted data
| Format | GPT-4 Turbo | Claude 3.5 | Llama 3.1 70B | Average |
|---|---|---|---|---|
| JSON | 98.2% | 97.8% | 94.5% | 96.8% |
| CSV | 87.3% | 85.6% | 78.9% | 83.9% |
| ATON | 97.8% | 97.2% | 93.8% | 96.3% |
Daily queries: 1,000,000 • Chunks per query: 50
| Metric | JSON | ATON | Savings |
|---|---|---|---|
| Daily Cost | $38,500 | $16,500 | $22,000 |
| Monthly Cost | $1,155,000 | $495,000 | $660,000 |
| Annual Cost | $13,860,000 | $5,940,000 | $7,920,000 |
Daily documents: 10,000 • Chunks per document: 100
| Metric | JSON | ATON | Savings |
|---|---|---|---|
| Monthly Cost | $46,200 | $19,800 | $26,400 |
| Annual Cost | $554,400 | $237,600 | $316,800 |
Daily state updates: 100,000 • Agents: 10 • Tasks: 25
| Metric | JSON | ATON | Savings |
|---|---|---|---|
| Monthly Cost | $126,000 | $55,500 | $70,500 |
| Annual Cost | $1,512,000 | $666,000 | $846,000 |
ATON V2 introduces production-grade features that transform it from a research format into an enterprise-ready solution.
Four intelligent compression strategies optimized for different use cases:
| Mode | Speed | Compression | Best For |
|---|---|---|---|
| FAST | ~50K rec/sec | 40-45% | Real-time APIs, low latency |
| BALANCED | ~30K rec/sec | 50-55% | General purpose (default) |
| ULTRA | ~10K rec/sec | 55-60% | Storage, batch processing |
| ADAPTIVE | Variable | Optimal | Mixed workloads, AI-driven selection |
Filter and transform data before encoding with a familiar SQL-like syntax:
# Basic filtering employees WHERE salary > 100000 # Multiple conditions products WHERE price > 50 AND stock > 0 # Pattern matching customers WHERE email LIKE '%@company.com' # Full query SELECT name, department, salary FROM employees WHERE active = true ORDER BY salary DESC LIMIT 100
Supported Operators
=, !=, <, >, <=, >=, IN, NOT IN, LIKE, BETWEEN, AND, OR, NOT
Query Clauses
SELECT, WHERE, ORDER BY, LIMIT, OFFSET
Process datasets of any size with constant memory usage:
Unlimited dataset size
Constant memory
Schema caching
from aton_format import ATONStreamEncoder
# Stream millions of records efficiently
stream = ATONStreamEncoder(chunk_size=1000)
for chunk in stream.encode_stream(huge_dataset):
process(chunk) # Memory stays constant
Error Handling
Comprehensive exception hierarchy with ATONError, ATONEncodeError, ATONDecodeError, ATONValidationError
Type Safety
Full type validation with detailed error messages and position tracking
Round-trip Guarantee
100% data fidelity: decode(encode(data)) == data always
Edge Case Handling
Escaped quotes, nested structures, unicode, special characters
Token reduction vs JSON with full feature parity
LLM comprehension accuracy across major models
Faster end-to-end processing time
Maximum annual savings potential
ATON is released as an open standard with MIT-licensed reference implementation, encouraging:
Install the package and start saving tokens today
pip install aton-format