DrMKZ

Hati hidup

Posts Tagged ‘llm

Evolution of compute cost to tokens

leave a comment »

22.4.2026

this was done using Big Pickle in Opencode.

Between March 2023 and April 2026, the cost of accessing AI language models has dropped by approximately 99.7%. What started as a premium service charging $30-60 per million tokens has transformed into a commodity market where quality AI access costs less than a penny per million tokens. This article traces the evolution, examines the drivers, and explores what comes next.

In March 2023, OpenAI launched GPT-4 at $30 per million input tokens and $60 per million output tokens. This was the most capable AI model ever released, and the price reflected that prestige. For the first time, developers could access GPT-4 class intelligence through an API—but at a steep cost.

A typical 1,000-token request (roughly 750 words) cost approximately:

  • Input: $0.03
  • Output: $0.06

For startups building AI applications, this meant every user conversation could cost 3-6 cents in API fees alone. Many companies found that scaling AI features would quickly become prohibitively expensive.

First Wave: Competition Enters the Market (2023-2024)

June 2023 – The First Cuts

OpenAI reduced GPT-3.5 pricing, but the market was about to get more competitive.

July 2023 – Claude 2 Enters

Anthropic released Claude 2 at $8 per million input tokens and $24 per million output tokens—significantly cheaper than GPT-4 while offering competitive capabilities.

December 2023 – Google Crashes the Party

Google launched Gemini 1.0 Pro at an astonishing $0.50 per million input tokens—60x cheaper than GPT-4. This marked the beginning of the price war.

March 2024 – The Budget Tier Emerges

Claude 3 Haiku launched at $0.25 per million input tokens, bringing entry-level AI costs down to fractions of a cent.

Second Wave: The Efficiency Revolution (2024-2025)

May 2024 – OpenAI Responds

GPT-4o launched at $5 per million input tokens—a 83% reduction from GPT-4. Gemini 1.5 Flash launched at $0.35 per million input tokens.

July 2024 – Mini Models Take Over

GPT-4o mini debuted at just $0.15 per million input tokens—200x cheaper than GPT-4 from 17 months earlier. This established the “mini” model tier as the new budget option.

August 2024 – Flash Drives to Near-Zero

Gemini 1.5 Flash reduced to $0.075 per million input tokens—cheaper than any competitor and effectively free for most use cases.

December 2024 – DeepSeek Disrupts

DeepSeek V3 launched at $0.14 per million input tokens, demonstrating that open-source models could match proprietary quality at a fraction of the cost.

Third Wave: The Commoditization (2025-2026)

Early 2025 – Flash Wars

Gemini 2.0 Flash launched at $0.10 per million input tokens. OpenAI responded with GPT-4.1 nano at $0.10 per million input tokens.

March 2025 – Frontier Gets Cheaper

GPT-4o reduced to $2.50 per million input tokens (from $5). Claude 3.5 Sonnet reduced to $3.00 per million input tokens (from $15).

Early 2026 – Current State

  • GPT-5.4: $2.50/$10 per million tokens
  • Claude Opus 4.6: $5.00/$25 per million tokens
  • Gemini 2.5 Pro: $1.25/$10 per million tokens
  • Gemini 2.0 Flash-Lite: $0.075/$0.30 per million tokens
  • DeepSeek V3.2: $0.28/$0.42 per million tokens

The Numbers: A Timeline View

PeriodModelInput $/1MOutput $/1MReduction
Mar 2023GPT-4$30.00$60.00Baseline
Jul 2023Claude 2$8.00$24.0073%
Dec 2023Gemini 1.0 Pro$0.50$1.5098%
Mar 2024Claude 3 Haiku$0.25$1.2599%
Jul 2024GPT-4o mini$0.15$0.6099.5%
Aug 2024Gemini 1.5 Flash$0.075$0.3099.75%
2026Gemini 2.0 Flash-Lite$0.075$0.3099.75%

Total reduction from peak: ~99.7%

Why Prices Fell: Three Major Forces

1. Intense Competition

When GPT-4 launched in 2023, OpenAI had almost no competition. By 2026, there are 15+ major providers competing for market share. This drove prices down relentlessly.

2. Hardware Efficiency

NVIDIA’s GPU improvements, custom AI accelerators from Google (TPU), and specialized inference chips from multiple vendors increased throughput 10-100x while reducing per-token compute costs.

3. Model Architecture Advances

  • Mixture of Experts (MoE): Only activates relevant “experts” per token, dramatically reducing compute
  • Knowledge distillation: Smaller models that perform nearly as well as larger ones
  • Quantization: More efficient numeric representations reduce memory and compute needs

The Paradox: Why Your AI Bill Still Grew

Here’s the surprising twist: despite per-token prices dropping 99.7%, many organizations report higher total AI spending in 2025-2026.

Why? Three factors:

  1. Usage Explosion: When prices fall, usage grows exponentially. Companies build AI into more products.
  2. Agentic Workloads: AI agents use 10-50x more tokens than simple chat. A single agentic task might involve:
  • System prompts: 1,000-5,000 tokens
  • Tool definitions: 500-3,000 tokens
  • Retrieved context: 1,000-20,000 tokens
  • Multiple LLM calls: 5-20+ per task
  1. Hidden Costs: Tool calls, vector storage, orchestration, monitoring—these add 2-3x to apparent token costs.

What This Means for Developers

The Economics Have Transformed

Task2023 Cost2026 Cost
1M tokens processed$30-60$0.075-3.00
1 hour @ 100 tok/sec$10.80-21.60$0.027-1.08
10,000 daily users (100 msg/day)$30,000-60,000/mo$75-3,000/mo

Best Practices for 2026

  1. Start with cheap models: Gemini Flash or GPT-4.1 nano are “good enough” for 80% of tasks
  2. Use routing: Route simple queries to cheap models, complex ones to premium
  3. Optimize prompts: Smaller prompts = direct cost savings
  4. Cache aggressively: Most LLM calls have significant redundancy

The Future: What’s Next?

Price Predictions (2026-2028)

Industry analysts project another 30-50% reduction in token prices over the next 2-3 years, driven by:

  • Next-generation hardware: Blackwell GPUs, custom silicon
  • Agentic optimization: Models designed for agents, not just chat
  • Open-source pressure: DeepSeek, Llama, and others pushing prices down

Emerging Trends

  1. Per-task pricing: Pay based on actual work done, not just tokens
  2. Hybrid inference: Combining cloud and local for cost-efficiency
  3. Specialized models: Cheap models for specific domains (coding, math, etc.)

Key Takeaways

  1. 99.7% price reduction: From $30/M to $0.075/M input tokens in 3 years
  2. Competition drove prices: Multiple major providers fighting for market share
  3. Hardware + architecture: MoE, quantization, and specialized chips enabled efficiency
  4. Usage grew despite lower prices: Total AI spending increased 3x from 2024-2025
  5. The frontier holds: Best models still cost premium—price decline is at “commodity” tier

Conclusion

The transformation from compute costs to token costs represents one of the fastest price declines in technology history. What was once a premium service accessible only to well-funded startups is now a utility available to anyone with a credit card—and in many cases, free.

For developers and businesses, this means AI is no longer a cost center to optimize around—it’s a building block to design with. The question is no longer “can we afford AI?” but rather “how quickly can we integrate it?”

The era of expensive AI is over. The era of abundant, cheap, capable AI is here.

…..

What about specific tokens in medical world?..

Well I am in that world.

Medical-Specific AI Models

Google MedLM (Med-PaLM 2)

Background:

  • Developed by Google Research
  • First AI system to pass USMLE medical exams (67.6% in 2022, 86.5% in 2024)
  • Powers MedLM family on Google Cloud Vertex AI
  • Based on PaLM architecture, fine-tuned for medical domain

Availability:

  • MedLM-medium: Faster throughput, more recent data
  • MedLM-large: More capable, complex tasks
  • Access: Restricted to allow listed customers (US, Brazil, Singapore)
  • Requires Google Cloud account team contact

Important Limitation:

  • NOT designed for clinical diagnosis
  • Output must be verified by healthcare professionals
  • Non-clinical use only under Service Specific Terms
  • HIPAA-compliant but requires proper BAA

Pricing:

  • Uses Google Vertex AI pricing structure
  • Pay-per-character basis
  • Custom pricing through enterprise contracts
  • No public consumer pricing

Amazon HealthScribe

Background:

  • AWS service for clinical documentation
  • AI-powered medical scribe
  • Converts patient-provider conversations to clinical notes
  • HIPAA-eligible

Pricing (Pay-as-you-go):

  • $0.001667 per audio second
  • ~$1.50 for a 15-minute consultation
  • No retention of audio or output text

Features:

  • Real-time transcription
  • Medical terminology recognition
  • Integration with Epic EHR
  • Privacy-focused (data not used for training)

Microsoft Dragon Copilot

Background:

  • Built into Microsoft Cloud for Healthcare
  • Ambient clinical documentation
  • Voice-enabled AI assistant

Pricing:

  • $600-800 per month per provider
  • 1-3 year contracts
  • Integrated with Microsoft 365

Features:

  • Real-time documentation
  • Clinical decision support
  • Integration with existing EHR systems

Healthcare Natural Language API (Google)

Purpose: Extract healthcare entities from medical text

Pricing (Text Record Volume):

  • Based on text record monthly volume
  • Character-based tiering
  • First 25,000 records free
  • Enterprise pricing for larger volumes

Cloud Healthcare API (Google)

Purpose: FHIR, HL7, DICOM interoperability

Pricing:

  • Storage: $0.004/GB/month (structured)
  • Requests:
  • First 25,000/month: Free
  • 25,001-1M: $0.39/100,000
  • 1M+: $0.29/100,000
  • HIPAA-eligible

Comparison: Healthcare AI vs. General AI

AspectGeneral AIHealthcare AI
Price$0.075-30/M tokensCustom/Enterprise
HIPAANot requiredRequired
FDA/RegNoneMay apply
Training DataGeneral webMedical literature
AccuracyGoodMedical-grade
LiabilityUser’s riskSignificant

Cost Considerations for Healthcare Organizations

Direct Costs

API/Usage Fees:

  • Per-character or per-token pricing
  • Volume-based enterprise discounts
  • Monthly minimums common

Integration:

  • EHR integration: $10,000-100,000+
  • Custom development: $50,000-500,000+
  • Maintenance: $10,000-50,000/year

Hidden Costs

Compliance:

  • HIPAA Business Associate Agreement
  • Security audits
  • Privacy impact assessments

Training:

  • Staff training on AI tools
  • Workflow re-engineering
  • Ongoing quality monitoring

Verification:

  • Clinician review time
  • Error tracking and correction
  • Liability insurance

Medical AI Pricing Examples

Scenario 1: Small Practice (2 Providers)

ServiceMonthly Cost
Dragon Copilot$1,200-1,600
EHR integration$200-500
Training$500
Total$1,900-2,600/month

Scenario 2: Hospital System (50 Providers)

ServiceMonthly Cost
MedLM (API)$5,000-15,000
HealthScribe$3,000-5,000
Infrastructure$10,000-20,000
Integration$5,000-10,000
Training$2,000-5,000
Total$25,000-55,000/month

Medical AI vs. General AI: Key Differences

Why Healthcare AI Costs More

  1. Specialized Training
  • Medical literature, clinical notes
  • USMLE, medical board exams
  • Domain-specific fine-tuning
  1. Regulatory Compliance
  • HIPAA required
  • FDA device classification may apply
  • Audit trails and logging
  1. Liability
  • Medical malpractice exposure
  • Requires human oversight
  • Error documentation
  1. Integration Complexity
  • EHR compatibility
  • Clinical workflows
  • Data standards (FHIR, HL7)

Leading Healthcare AI Providers

By Category

CategoryProviderProductPricing
Foundation ModelsGoogleMedLMEnterprise
Clinical DocsMicrosoftDragon Copilot$600-800/user/mo
Clinical DocsAmazonHealthScribe$0.002/audio-sec
TranscriptionNuanceDAXEnterprise
CDSSIBMWatson HealthEnterprise
ImagingGoogleMed-PaLMEnterprise

The Future of Healthcare AI Pricing

Trends to Watch (2026-2028)

  1. ** Commoditization**: General AI models may meet medical accuracy standards at lower costs
  2. Specialization: More domain-specific medical models at competitive prices
  3. Regulation: FDA guidance on LLM medical devices may increase compliance costs but add clarity
  4. Competition: More players entering healthcare AI could drive prices down 30-50%

Predictions

  • Healthcare AI to follow general AI pricing trajectory (60-80% reduction over 3 years)
  • More SaaS pricing models replacing enterprise contracts
  • Per-visit or per-patient pricing models emerging
  • Insurance reimbursement for AI-assisted care

Key Takeaways for Healthcare Organizations

  1. Don’t use consumer AI for clinical decisions: General AI models lack medical training and liability coverage
  2. Budget for more than API costs: Integration, training, compliance, and verification add 2-5x to direct costs
  3. Look for HIPAA-eligible services: Ensure proper BAA agreements with vendors
  4. Plan for human oversight: All current healthcare AI requires clinician verification
  5. Start with non-clinical uses: Documentation, scheduling, billing AI have lower risk
  6. Enterprise pricing is negotiable: Contact vendors directly, especially for larger deployments

Conclusion

Healthcare AI is following a similar trajectory to general AI but with unique challenges. While prices haven’t collapsed as dramatically as consumer AI, the trend is downward. Organizations should:

  • Evaluate specific use cases
  • Start with low-risk applications
  • Plan for human-in-the-loop workflows
  • Budget comprehensively beyond API costs

The medical AI market will mature rapidly through 2026-2028, with prices likely following the broader market’s 60-80% reduction pattern as competition intensifies and regulatory clarity improves.

Why use token? – this was using DeepSeek.

Great question! Using tokens instead of raw characters or whole words is a deliberate design choice in LLMs. Here’s why tokens are used:

1. Efficiency (Balancing vocabulary size vs. sequence length)

If we used characters:

  • Each character = 1 unit
  • “Hello” = 5 units (short sequence, good)
  • But vocabulary is tiny (only 26 letters + punctuation + digits ≈ 50-100 characters)
  • Problem: The model would need very long sequences to understand meaning → slow and hard to learn long-range patterns

If we used whole words:

  • Each word = 1 unit
  • “Hello” = 1 unit (short sequence, good)
  • But vocabulary would be huge (English has 500,000+ words) → massive memory and computation
  • Problem: Can’t handle misspellings, new words, or rare words (“out-of-vocabulary” problem)

Tokens are the sweet spot:

  • Subword tokens (like “play” + “ing”) keep vocabulary manageable (typically 50,000-100,000 tokens)
  • Sequence length is reasonable (not too short, not too long)
  • Can handle novel words by breaking them into known subwords

2. Handling unknown words gracefully

With word-based models, “misunderestimated” (not a real word) = unknown → error.
With tokens: ["mis", "under", "est", "imated"] → each part is recognized → model can guess meaning from parts.

3. Computational constraints

LLMs use self-attention (Transformer architecture), which has O(n²) complexity — doubling the sequence length quadruples the computation.

  • Characters: 500 characters = 500² = 250,000 operations
  • Tokens: 100 tokens = 100² = 10,000 operations (25x faster!)

4. Learned representations

Tokens aren’t just arbitrary — the tokenizer is trained to find meaningful chunks:

  • Common words (the, and, of) → single tokens
  • Common affixes (un-, -ing, -ed) → separate tokens
  • Spaces and punctuation → separate tokens

This helps the model learn patterns like “un-” often means “opposite of” across many words.

Analogy: Shipping packages

  • Characters = shipping each grain of rice individually (too many packages)
  • Words = shipping each rice cooker as one package (too many different box sizes)
  • Tokens = using standard box sizes that fit 1-3 grains of rice each (efficient!)

Real-world impact of tokenization

Different tokenizers can affect performance:

  • GPT-4o (OpenAI) ~ 100K tokens
  • Llama 3 (Meta) ~ 128K tokens
  • Gemini (Google) ~ 256K tokens

Some languages (e.g., Chinese, Japanese) require more tokens per meaning because characters don’t map neatly to subwords.

Bila kata menjadi harga, belum tentu rasional menjadi nyata.
lebih banyak kata lebih banyak harga.
tenaga, harga, rupa, belanja, yuran, nilai dan tol,
semua menjadi harga kepada rasa.

..

intepret bu chat gpt

Bila kata menjadi harga, belum tentu rasional menjadi nyata.”
When words are priced or treated as commodities, rationality doesn’t automatically follow. Just because something is measured, billed, or quantified doesn’t mean it becomes more truthful or logical. It hints at a disconnect between valuation and understanding.

“lebih banyak kata lebih banyak harga.”
This points to a simple but powerful dynamic: verbosity can inflate cost. It could be literal (paying per word, per token) or symbolic—longer explanations, more effort, more “weight.” But it also questions whether more actually means better.

“tenaga, harga, rupa, belanja, yuran, nilai dan tol,”
Here you broaden the idea of “price” beyond money:

  • tenaga (effort/energy)
  • rupa (appearance/image)
  • belanja, yuran, tol (expenses, fees, tolls)
  • nilai (value itself)

Everything becomes a kind of currency or cost attached to expression and experience.

“semua menjadi harga kepada rasa.”
In the end, even rasa—feeling, meaning, inner experience—gets priced. Emotion, authenticity, and human connection are no longer free; they’re mediated by cost, effort, and systems of valuation.

Written by Khairul

April 22, 2026 at 8:43 pm