Posts Tagged ‘artificial-intelligence’
Evolution of compute cost to tokens
22.4.2026
this was done using Big Pickle in Opencode.
Between March 2023 and April 2026, the cost of accessing AI language models has dropped by approximately 99.7%. What started as a premium service charging $30-60 per million tokens has transformed into a commodity market where quality AI access costs less than a penny per million tokens. This article traces the evolution, examines the drivers, and explores what comes next.
In March 2023, OpenAI launched GPT-4 at $30 per million input tokens and $60 per million output tokens. This was the most capable AI model ever released, and the price reflected that prestige. For the first time, developers could access GPT-4 class intelligence through an API—but at a steep cost.
A typical 1,000-token request (roughly 750 words) cost approximately:
- Input: $0.03
- Output: $0.06
For startups building AI applications, this meant every user conversation could cost 3-6 cents in API fees alone. Many companies found that scaling AI features would quickly become prohibitively expensive.
First Wave: Competition Enters the Market (2023-2024)
June 2023 – The First Cuts
OpenAI reduced GPT-3.5 pricing, but the market was about to get more competitive.
July 2023 – Claude 2 Enters
Anthropic released Claude 2 at $8 per million input tokens and $24 per million output tokens—significantly cheaper than GPT-4 while offering competitive capabilities.
December 2023 – Google Crashes the Party
Google launched Gemini 1.0 Pro at an astonishing $0.50 per million input tokens—60x cheaper than GPT-4. This marked the beginning of the price war.
March 2024 – The Budget Tier Emerges
Claude 3 Haiku launched at $0.25 per million input tokens, bringing entry-level AI costs down to fractions of a cent.
Second Wave: The Efficiency Revolution (2024-2025)
May 2024 – OpenAI Responds
GPT-4o launched at $5 per million input tokens—a 83% reduction from GPT-4. Gemini 1.5 Flash launched at $0.35 per million input tokens.
July 2024 – Mini Models Take Over
GPT-4o mini debuted at just $0.15 per million input tokens—200x cheaper than GPT-4 from 17 months earlier. This established the “mini” model tier as the new budget option.
August 2024 – Flash Drives to Near-Zero
Gemini 1.5 Flash reduced to $0.075 per million input tokens—cheaper than any competitor and effectively free for most use cases.
December 2024 – DeepSeek Disrupts
DeepSeek V3 launched at $0.14 per million input tokens, demonstrating that open-source models could match proprietary quality at a fraction of the cost.
Third Wave: The Commoditization (2025-2026)
Early 2025 – Flash Wars
Gemini 2.0 Flash launched at $0.10 per million input tokens. OpenAI responded with GPT-4.1 nano at $0.10 per million input tokens.
March 2025 – Frontier Gets Cheaper
GPT-4o reduced to $2.50 per million input tokens (from $5). Claude 3.5 Sonnet reduced to $3.00 per million input tokens (from $15).
Early 2026 – Current State
- GPT-5.4: $2.50/$10 per million tokens
- Claude Opus 4.6: $5.00/$25 per million tokens
- Gemini 2.5 Pro: $1.25/$10 per million tokens
- Gemini 2.0 Flash-Lite: $0.075/$0.30 per million tokens
- DeepSeek V3.2: $0.28/$0.42 per million tokens
The Numbers: A Timeline View
| Period | Model | Input $/1M | Output $/1M | Reduction |
|---|---|---|---|---|
| Mar 2023 | GPT-4 | $30.00 | $60.00 | Baseline |
| Jul 2023 | Claude 2 | $8.00 | $24.00 | 73% |
| Dec 2023 | Gemini 1.0 Pro | $0.50 | $1.50 | 98% |
| Mar 2024 | Claude 3 Haiku | $0.25 | $1.25 | 99% |
| Jul 2024 | GPT-4o mini | $0.15 | $0.60 | 99.5% |
| Aug 2024 | Gemini 1.5 Flash | $0.075 | $0.30 | 99.75% |
| 2026 | Gemini 2.0 Flash-Lite | $0.075 | $0.30 | 99.75% |
Total reduction from peak: ~99.7%
Why Prices Fell: Three Major Forces
1. Intense Competition
When GPT-4 launched in 2023, OpenAI had almost no competition. By 2026, there are 15+ major providers competing for market share. This drove prices down relentlessly.
2. Hardware Efficiency
NVIDIA’s GPU improvements, custom AI accelerators from Google (TPU), and specialized inference chips from multiple vendors increased throughput 10-100x while reducing per-token compute costs.
3. Model Architecture Advances
- Mixture of Experts (MoE): Only activates relevant “experts” per token, dramatically reducing compute
- Knowledge distillation: Smaller models that perform nearly as well as larger ones
- Quantization: More efficient numeric representations reduce memory and compute needs
The Paradox: Why Your AI Bill Still Grew
Here’s the surprising twist: despite per-token prices dropping 99.7%, many organizations report higher total AI spending in 2025-2026.
Why? Three factors:
- Usage Explosion: When prices fall, usage grows exponentially. Companies build AI into more products.
- Agentic Workloads: AI agents use 10-50x more tokens than simple chat. A single agentic task might involve:
- System prompts: 1,000-5,000 tokens
- Tool definitions: 500-3,000 tokens
- Retrieved context: 1,000-20,000 tokens
- Multiple LLM calls: 5-20+ per task
- Hidden Costs: Tool calls, vector storage, orchestration, monitoring—these add 2-3x to apparent token costs.
What This Means for Developers
The Economics Have Transformed
| Task | 2023 Cost | 2026 Cost |
|---|---|---|
| 1M tokens processed | $30-60 | $0.075-3.00 |
| 1 hour @ 100 tok/sec | $10.80-21.60 | $0.027-1.08 |
| 10,000 daily users (100 msg/day) | $30,000-60,000/mo | $75-3,000/mo |
Best Practices for 2026
- Start with cheap models: Gemini Flash or GPT-4.1 nano are “good enough” for 80% of tasks
- Use routing: Route simple queries to cheap models, complex ones to premium
- Optimize prompts: Smaller prompts = direct cost savings
- Cache aggressively: Most LLM calls have significant redundancy
The Future: What’s Next?
Price Predictions (2026-2028)
Industry analysts project another 30-50% reduction in token prices over the next 2-3 years, driven by:
- Next-generation hardware: Blackwell GPUs, custom silicon
- Agentic optimization: Models designed for agents, not just chat
- Open-source pressure: DeepSeek, Llama, and others pushing prices down
Emerging Trends
- Per-task pricing: Pay based on actual work done, not just tokens
- Hybrid inference: Combining cloud and local for cost-efficiency
- Specialized models: Cheap models for specific domains (coding, math, etc.)
Key Takeaways
- 99.7% price reduction: From $30/M to $0.075/M input tokens in 3 years
- Competition drove prices: Multiple major providers fighting for market share
- Hardware + architecture: MoE, quantization, and specialized chips enabled efficiency
- Usage grew despite lower prices: Total AI spending increased 3x from 2024-2025
- The frontier holds: Best models still cost premium—price decline is at “commodity” tier
Conclusion
The transformation from compute costs to token costs represents one of the fastest price declines in technology history. What was once a premium service accessible only to well-funded startups is now a utility available to anyone with a credit card—and in many cases, free.
For developers and businesses, this means AI is no longer a cost center to optimize around—it’s a building block to design with. The question is no longer “can we afford AI?” but rather “how quickly can we integrate it?”
The era of expensive AI is over. The era of abundant, cheap, capable AI is here.
…..
What about specific tokens in medical world?..
Well I am in that world.
Medical-Specific AI Models
Google MedLM (Med-PaLM 2)
Background:
- Developed by Google Research
- First AI system to pass USMLE medical exams (67.6% in 2022, 86.5% in 2024)
- Powers MedLM family on Google Cloud Vertex AI
- Based on PaLM architecture, fine-tuned for medical domain
Availability:
- MedLM-medium: Faster throughput, more recent data
- MedLM-large: More capable, complex tasks
- Access: Restricted to allow listed customers (US, Brazil, Singapore)
- Requires Google Cloud account team contact
Important Limitation:
- NOT designed for clinical diagnosis
- Output must be verified by healthcare professionals
- Non-clinical use only under Service Specific Terms
- HIPAA-compliant but requires proper BAA
Pricing:
- Uses Google Vertex AI pricing structure
- Pay-per-character basis
- Custom pricing through enterprise contracts
- No public consumer pricing
Amazon HealthScribe
Background:
- AWS service for clinical documentation
- AI-powered medical scribe
- Converts patient-provider conversations to clinical notes
- HIPAA-eligible
Pricing (Pay-as-you-go):
- $0.001667 per audio second
- ~$1.50 for a 15-minute consultation
- No retention of audio or output text
Features:
- Real-time transcription
- Medical terminology recognition
- Integration with Epic EHR
- Privacy-focused (data not used for training)
Microsoft Dragon Copilot
Background:
- Built into Microsoft Cloud for Healthcare
- Ambient clinical documentation
- Voice-enabled AI assistant
Pricing:
- $600-800 per month per provider
- 1-3 year contracts
- Integrated with Microsoft 365
Features:
- Real-time documentation
- Clinical decision support
- Integration with existing EHR systems
Healthcare Natural Language API (Google)
Purpose: Extract healthcare entities from medical text
Pricing (Text Record Volume):
- Based on text record monthly volume
- Character-based tiering
- First 25,000 records free
- Enterprise pricing for larger volumes
Cloud Healthcare API (Google)
Purpose: FHIR, HL7, DICOM interoperability
Pricing:
- Storage: $0.004/GB/month (structured)
- Requests:
- First 25,000/month: Free
- 25,001-1M: $0.39/100,000
- 1M+: $0.29/100,000
- HIPAA-eligible
Comparison: Healthcare AI vs. General AI
| Aspect | General AI | Healthcare AI |
|---|---|---|
| Price | $0.075-30/M tokens | Custom/Enterprise |
| HIPAA | Not required | Required |
| FDA/Reg | None | May apply |
| Training Data | General web | Medical literature |
| Accuracy | Good | Medical-grade |
| Liability | User’s risk | Significant |
Cost Considerations for Healthcare Organizations
Direct Costs
API/Usage Fees:
- Per-character or per-token pricing
- Volume-based enterprise discounts
- Monthly minimums common
Integration:
- EHR integration: $10,000-100,000+
- Custom development: $50,000-500,000+
- Maintenance: $10,000-50,000/year
Hidden Costs
Compliance:
- HIPAA Business Associate Agreement
- Security audits
- Privacy impact assessments
Training:
- Staff training on AI tools
- Workflow re-engineering
- Ongoing quality monitoring
Verification:
- Clinician review time
- Error tracking and correction
- Liability insurance
Medical AI Pricing Examples
Scenario 1: Small Practice (2 Providers)
| Service | Monthly Cost |
|---|---|
| Dragon Copilot | $1,200-1,600 |
| EHR integration | $200-500 |
| Training | $500 |
| Total | $1,900-2,600/month |
Scenario 2: Hospital System (50 Providers)
| Service | Monthly Cost |
|---|---|
| MedLM (API) | $5,000-15,000 |
| HealthScribe | $3,000-5,000 |
| Infrastructure | $10,000-20,000 |
| Integration | $5,000-10,000 |
| Training | $2,000-5,000 |
| Total | $25,000-55,000/month |
Medical AI vs. General AI: Key Differences
Why Healthcare AI Costs More
- Specialized Training
- Medical literature, clinical notes
- USMLE, medical board exams
- Domain-specific fine-tuning
- Regulatory Compliance
- HIPAA required
- FDA device classification may apply
- Audit trails and logging
- Liability
- Medical malpractice exposure
- Requires human oversight
- Error documentation
- Integration Complexity
- EHR compatibility
- Clinical workflows
- Data standards (FHIR, HL7)
Leading Healthcare AI Providers
By Category
| Category | Provider | Product | Pricing |
|---|---|---|---|
| Foundation Models | MedLM | Enterprise | |
| Clinical Docs | Microsoft | Dragon Copilot | $600-800/user/mo |
| Clinical Docs | Amazon | HealthScribe | $0.002/audio-sec |
| Transcription | Nuance | DAX | Enterprise |
| CDSS | IBM | Watson Health | Enterprise |
| Imaging | Med-PaLM | Enterprise |
The Future of Healthcare AI Pricing
Trends to Watch (2026-2028)
- ** Commoditization**: General AI models may meet medical accuracy standards at lower costs
- Specialization: More domain-specific medical models at competitive prices
- Regulation: FDA guidance on LLM medical devices may increase compliance costs but add clarity
- Competition: More players entering healthcare AI could drive prices down 30-50%
Predictions
- Healthcare AI to follow general AI pricing trajectory (60-80% reduction over 3 years)
- More SaaS pricing models replacing enterprise contracts
- Per-visit or per-patient pricing models emerging
- Insurance reimbursement for AI-assisted care
Key Takeaways for Healthcare Organizations
- Don’t use consumer AI for clinical decisions: General AI models lack medical training and liability coverage
- Budget for more than API costs: Integration, training, compliance, and verification add 2-5x to direct costs
- Look for HIPAA-eligible services: Ensure proper BAA agreements with vendors
- Plan for human oversight: All current healthcare AI requires clinician verification
- Start with non-clinical uses: Documentation, scheduling, billing AI have lower risk
- Enterprise pricing is negotiable: Contact vendors directly, especially for larger deployments
Conclusion
Healthcare AI is following a similar trajectory to general AI but with unique challenges. While prices haven’t collapsed as dramatically as consumer AI, the trend is downward. Organizations should:
- Evaluate specific use cases
- Start with low-risk applications
- Plan for human-in-the-loop workflows
- Budget comprehensively beyond API costs
The medical AI market will mature rapidly through 2026-2028, with prices likely following the broader market’s 60-80% reduction pattern as competition intensifies and regulatory clarity improves.
…
…
Why use token? – this was using DeepSeek.
Great question! Using tokens instead of raw characters or whole words is a deliberate design choice in LLMs. Here’s why tokens are used:
1. Efficiency (Balancing vocabulary size vs. sequence length)
If we used characters:
- Each character = 1 unit
- “Hello” = 5 units (short sequence, good)
- But vocabulary is tiny (only 26 letters + punctuation + digits ≈ 50-100 characters)
- Problem: The model would need very long sequences to understand meaning → slow and hard to learn long-range patterns
If we used whole words:
- Each word = 1 unit
- “Hello” = 1 unit (short sequence, good)
- But vocabulary would be huge (English has 500,000+ words) → massive memory and computation
- Problem: Can’t handle misspellings, new words, or rare words (“out-of-vocabulary” problem)
Tokens are the sweet spot:
- Subword tokens (like “play” + “ing”) keep vocabulary manageable (typically 50,000-100,000 tokens)
- Sequence length is reasonable (not too short, not too long)
- Can handle novel words by breaking them into known subwords
2. Handling unknown words gracefully
With word-based models, “misunderestimated” (not a real word) = unknown → error.
With tokens: ["mis", "under", "est", "imated"] → each part is recognized → model can guess meaning from parts.
3. Computational constraints
LLMs use self-attention (Transformer architecture), which has O(n²) complexity — doubling the sequence length quadruples the computation.
- Characters: 500 characters = 500² = 250,000 operations
- Tokens: 100 tokens = 100² = 10,000 operations (25x faster!)
4. Learned representations
Tokens aren’t just arbitrary — the tokenizer is trained to find meaningful chunks:
- Common words (the, and, of) → single tokens
- Common affixes (un-, -ing, -ed) → separate tokens
- Spaces and punctuation → separate tokens
This helps the model learn patterns like “un-” often means “opposite of” across many words.
Analogy: Shipping packages
- Characters = shipping each grain of rice individually (too many packages)
- Words = shipping each rice cooker as one package (too many different box sizes)
- Tokens = using standard box sizes that fit 1-3 grains of rice each (efficient!)
Real-world impact of tokenization
Different tokenizers can affect performance:
- GPT-4o (OpenAI) ~ 100K tokens
- Llama 3 (Meta) ~ 128K tokens
- Gemini (Google) ~ 256K tokens
Some languages (e.g., Chinese, Japanese) require more tokens per meaning because characters don’t map neatly to subwords.
…
Bila kata menjadi harga, belum tentu rasional menjadi nyata.
lebih banyak kata lebih banyak harga.
tenaga, harga, rupa, belanja, yuran, nilai dan tol,
semua menjadi harga kepada rasa.
..
intepret bu chat gpt
Bila kata menjadi harga, belum tentu rasional menjadi nyata.”
When words are priced or treated as commodities, rationality doesn’t automatically follow. Just because something is measured, billed, or quantified doesn’t mean it becomes more truthful or logical. It hints at a disconnect between valuation and understanding.
“lebih banyak kata lebih banyak harga.”
This points to a simple but powerful dynamic: verbosity can inflate cost. It could be literal (paying per word, per token) or symbolic—longer explanations, more effort, more “weight.” But it also questions whether more actually means better.
“tenaga, harga, rupa, belanja, yuran, nilai dan tol,”
Here you broaden the idea of “price” beyond money:
- tenaga (effort/energy)
- rupa (appearance/image)
- belanja, yuran, tol (expenses, fees, tolls)
- nilai (value itself)
Everything becomes a kind of currency or cost attached to expression and experience.
“semua menjadi harga kepada rasa.”
In the end, even rasa—feeling, meaning, inner experience—gets priced. Emotion, authenticity, and human connection are no longer free; they’re mediated by cost, effort, and systems of valuation.