The hidden carbon cost of AI: why every prompt has a footprint, and what to do about it​

Every time your organisation calls a cloud AI service, it generates carbon emissions. Not metaphorically. Literally. The GPU cluster spins up, the data centre draws power, the cooling system compensates, and your Scope 3 inventory grows by a fraction of a gram. Multiply that by thousands of prompts a month across a growing number of AI users and workflows in your organisation, and you have a reporting problem that most companies have not even started to quantify.

This matters now because the regulatory direction is clear, the GHG Protocol is actively developing a Software Carbon Intensity specification, and ESG auditors are beginning to ask questions that procurement teams cannot yet answer.

The scale of the problem

Not all AI inference is equal. The energy consumed per prompt varies by orders of magnitude depending on the model, the hardware, the data centre, and the task. A peer-reviewed systematic review published in Renewable and Sustainable Energy Reviews (Ji & Jiang, 2026) found that standard industry benchmarks consistently underestimate real-world inference energy consumption because they exclude cooling overhead, idle GPU power between requests, multi-GPU synchronisation, and data centre power usage effectiveness (PUE).

Independent analysis by Clune (2025), working from first-principles throughput data on H100 clusters, calculated that realistic long-prompt inference consumes between 10 and 72 joules per token, compared to the commonly cited figure of around 0.4 joules. That is 25 to 180 times more energy than the optimistic estimates most providers reference.

The gap widens further with the latest generation of reasoning models. GPT 5.4, Claude Opus 4.6, and similar frontier models generate hidden reasoning tokens that are billed as output but never appear in the response. These tokens consume the same energy as visible output. For complex analytical tasks, the model may burn two to four times the visible output in internal thinking, and you pay for every watt.

What this means per prompt

We built a five-stage AI support orchestration platform for a manufacturing client. Each incoming support email is parsed, matched against 1,800 historical tickets via vector search, triaged, and answered with a draft response. Seven LLM calls and six embedding operations per ticket.

On our self-hosted infrastructure, a single consumer-grade GPU processes each ticket in under 45 seconds, consuming approximately 8,500 tokens and 3 watt-hours of energy. That is roughly equivalent to running a lightbulb for ten seconds.

The same workload routed through a cloud-hosted frontier model would consume an estimated 50,000 to 85,000 tokens per ticket (including reasoning overhead) and between 600 and 1,150 watt-hours of energy, once you account for multi-GPU cluster draw, PUE, cooling, network redundancy, and standby capacity.

That is a 200 to 370 times difference in energy consumption per ticket. Not a rounding error. An order-of-magnitude architectural decision.

Translating energy into carbon

Energy becomes carbon through the carbon intensity of the electricity grid powering the hardware. Here is what a single support ticket looks like across three scenarios.

Cloud-hosted frontier model in a US data centre (carbon intensity approximately 0.309 kgCO₂eq/kWh, PUE 1.3 to 1.5): estimated 4 to 12 grams of CO₂eq per ticket. At 500 tickets per month across a group of subsidiaries, that is roughly 24 to 72 kilograms of CO₂eq per year from AI inference alone.

Cloud-hosted frontier model in a high-carbon grid such as Beijing or Shanghai (carbon intensity 0.645 to 0.690 kgCO₂eq/kWh): the same ticket generates 8 to 25 grams of CO₂eq. Most cloud providers do not let you choose which data centre processes your request, so you cannot control or verify this number.

Self-hosted on-premise in the UK (carbon intensity approximately 0.20 kgCO₂eq/kWh, single GPU, no PUE overhead): approximately 0.6 grams of CO₂eq per ticket. At 500 tickets per month, that is 3.6 kilograms of CO₂eq per year. An 85 to 95 per cent reduction versus cloud-hosted alternatives.

Where AI sits in your carbon reporting

Under the GHG Protocol, cloud AI API usage falls within Scope 3, Category 1: Purchased Goods and Services. These are indirect emissions generated by your suppliers on your behalf. You cannot measure them directly, you cannot control the hardware or the energy source, and until very recently, the major cloud providers did not even expose this data to customers.

AWS only began making Scope 3 emissions data freely available to customers in 2024, years behind Google and Microsoft, and even then the granularity is insufficient to attribute emissions to individual AI workloads. You get an aggregated number for your entire cloud usage, not a per-prompt or per-service breakdown. That makes it effectively impossible to report the carbon footprint of your AI operations at the activity level.

The GHG Protocol recognised this gap in its March 2024 Scope 3 proposals, explicitly calling for a Software Carbon Intensity specification to support accounting for emissions from purchased cloud services and other software services throughout the value chain. The proposal notes that emissions attributable to software services may currently be excluded by many companies due to lack of data access, despite potential materiality.

In plain language: the reporting framework is catching up with the reality that AI is now a material source of Scope 3 emissions, and companies that cannot quantify it will face increasing scrutiny.

The self-hosted alternative

When the AI infrastructure sits on your premises, the carbon accounting changes fundamentally.

The emissions move from Scope 3 (indirect, unverifiable, supplier-dependent) to Scope 2 (purchased electricity, directly measurable, under your control). You know exactly which GPU processed the ticket, how many watts it drew, how long it ran, and what carbon intensity applies to your electricity tariff. You can verify it, audit it, and report it at any level of granularity.

You also control the energy source. If your site runs on a renewable tariff, your AI inference is as green as your electricity contract. If you are on a standard grid supply, you at least know the real number and can make informed decisions about offsetting or switching.

This is not a marginal improvement. It is a structural shift in reportability. The difference between an opaque line item buried in a supplier’s aggregated Scope 3 disclosure and a verifiable, auditable figure in your own Scope 2 inventory.

Three architectural decisions that drive the gap

The 200x-plus energy difference is not magic. It comes from three deliberate design choices.

First, right-sizing the model to the task. A 20-billion-parameter open-source model running locally achieves 89 percent or better accuracy on support triage. Using a trillion-parameter frontier model for the same task is like hiring a Formula 1 team to deliver parcels. The extra capability exists, but the task does not need it, and the energy cost is enormous.

Second, eliminating multi-GPU synchronisation. Cloud inference on frontier models typically runs across four to eight GPUs, each drawing 700 watts, with significant overhead from inter-node communication. A single consumer GPU at 250 watts, with no synchronisation latency, processes the same ticket with a fraction of the energy.

Third, removing idle capacity. Cloud infrastructure maintains always-on GPU clusters to serve unpredictable demand. Your self-hosted system draws power only when processing. Between tickets, the GPU idles at near-zero draw. No standby power, no redundancy overhead, no cooling for capacity you are not using.

Why this matters now

The EU AI Act emphasises transparency and energy efficiency for AI-serving data centres. The UK’s Streamlined Energy and Carbon Reporting (SECR) framework already requires qualifying companies to report energy use and carbon emissions. The Science Based Targets initiative (SBTi) is tightening expectations around Scope 3 disclosure.

As AI usage scales across organisations, the carbon footprint of inference will move from an unreported externality to a mandatory disclosure line item. Companies that deploy AI on self-hosted infrastructure today are reducing their per-prompt carbon footprint by two orders of magnitude while positioning themselves ahead of the regulatory curve, with auditable data that satisfies both internal ESG commitments and external reporting requirements.

The question is not whether your AI has a carbon footprint. It does. The question is whether you can measure it, control it, report it and eventually reduce it. Self-hosted AI gives you all four.


References

Ji, Z. & Jiang, M. (2026). A systematic review of electricity demand for large language models: evaluations, challenges, and solutions. Renewable and Sustainable Energy Reviews, 225, 116159.

Clune, A. (2025). Another look at per token energy costs. Available at: https://clune.org/posts/per-token-energy-costs-again/

GHG Protocol (2024). Summary of Scope 3 Proposals. Available at: https://ghgprotocol.org


Ascentis AI builds self-hosted AI orchestration platforms for manufacturing and scientific equipment companies. Our solutions run entirely on your infrastructure, with zero cloud LLM dependency, full data sovereignty, and a verifiable energy footprint. Get in touch to discuss how this applies to your operations.