Measure footprint of open LLMs

Benchmark Apertus, compare with other models, and find strategies for efficient prompting.

The challenge is to measure and improve the environmental impact of the Swiss Large Language Model (Apertus, from the Swiss AI Initiative) and compare it with other models. Participants should design methods to quantify energy consumption, carbon emissions, and resource use during inference. In addition to transparent measurement frameworks or dashboards, solutions should propose concrete prompting strategies for impact reduction. The goal is to enable Switzerland to lead in sustainable AI by combining rigorous evaluation with actionable improvements.

Header photo: CSCS - Swiss National Supercomputing Centre

Purpose

The project aims to measure and improve the environmental impact of Large Language Models. It will create transparent benchmark and metrics as well as practical strategies to quantify and reduce energy use, carbon emissions, and resource consumption for inference. The goal is to enable sustainable AI that aligns with Switzerland’s leadership in responsible technology.

Inputs

A few initiatives (e.g. AI EnergyScore, MLCO2) have proposed frameworks for tracking carbon and energy usage, but these are not yet widely adopted or standardized. Based on one of these, we may try to:

Develop measurement methods for the Swiss LLM’s energy and carbon footprint across its lifecycle.
Explore prompting strategies for reducing impact
Compare the energy consumption and the prompting strategies with other LLMs
Document best practices and propose guidelines for sustainable Swiss AI.

Access to open source LLMs and underlying infrastructure is required to log compute usage, energy consumption, and hardware efficiency. We will provide a test machine (12GB VRAM 32GB RAM) on location, and remote access to a Mac Studio (192GB Unified Memory) for measurements.

Some technical information on the new Swiss LLM can be found here:

https://swissai.dribdat.cc/project/40

The following resources may be of use:

Papers of note:

Green Prompting
Evaluating the Energy–Performance Trade-Offs of Quantized LLM Agents (Weber et al 2025)
Large Language Model Supply Chain: A Research Agenda (Wang et al 2025)
Characterizing the Carbon Impact of LLM Inference (Lim et al 2024)
Towards Greener LLMs: Bringing Energy-Efficiency to the Forefront of LLM Inference (Stojkovic et al 2024)
LLMCarbon: Modeling the end-to-end Carbon Footprint of Large Language Models (Faiz et al 2023)
MLPerf Inference Benchmark (Reddi et al 2020)

Comparison chart by Sourabh Mehta - ADaSci 2024

As outputs, the project will deliver measurements, guidelines for measuring and prompting techniques for reducing AI impact. It directly promotes open science, climate consciousness, and responsible AI. The outcomes can catalyze a larger initiative on sustainable foundation models in Switzerland, influencing both public policy and industry adoption.

Most foundation models today are evaluated primarily on accuracy, scale, and downstream performance. Environmental impact is often reported inconsistently or not at all. Large-scale LLMs typically lack region-specific sustainability benchmarks or actionable improvement strategies, so efforts to crowdsource such results will be valuable to the community.

The activities align with the Swiss AI Initiative’s goals of advancing responsible and trustworthy AI. By focusing on Switzerland’s energy mix and regulatory context, the project addresses local sustainability priorities while producing globally relevant methods. It strengthens Switzerland’s role as a leader in sustainable and ethical AI across Europe and beyond.

Compliance

Environmental impact measurement and mitigation align with Swiss and European climate targets, responsible AI guidelines, and open science principles. Data used will be technical (compute and energy metrics), avoiding personal or sensitive information.

Illustration by Nhor CC BY 3.0

Results

The goal of the project was to

measure the energy consumption of inference on Apertus
understand the influence of prompt and response on the energy consumption
Bonus: compare to llama

Our measuring infrastructure consisted of Mac Studio with remote access.

We used the powermetrics tool with Begasoft BrandBot

Experiments

Experiment 1: Prompt/Response length
- We tested different prompt and response lengths.
- Ex: Short-Long: Explain photosynthesis in detail.
Experiment 2: Different subjects
Experiment 3: Compare with Llama
Experiment 4: Different languages

Learnings

We failed successfully on:

getting a stable setup ❌
excluding overhead ❌
proper energy calculations ❌
getting 2nd system to run ❌
do physical measurements ❌
change measurement method ❌
automate everything ❌
doing proper statistics ❌

We succeeded successfully on:

measure idle consumption ✅
measure different prompts ✅
Integrated a new team member ✅
learn a lot ✅
having fun! ✅

Results

Measuring is hard!

Prompt size:

long answers dominate the energy use
long prompts have some influence

Different types of prompts: Basic knowledge, Chemistry, Geography, Math does not seem to influence the energy consumption (other than the length of the response)

Different languages: apart from the length, the language seems to have some influence (portugese: +20%)

Different models: inconclusive but similar

Reports

Thanks to all!

Energy Footprint Analysis of Open LLMs

This repository contains the results and methodology from the swissAI Hackathon challenge "Measure footprint of open LLMs".

🔗 Challenge Details: https://swissai.dribdat.cc/project/44

Team Members

Agustín Herrerapicazo - agustin.it@proton.me
Luis Barros - luisantoniio1998@gmail.com
Stefan Aeschbacher - stefan@aeschbacher.ch

Project Overview

This project investigates the energy consumption patterns of the Apertus language model through controlled experiments measuring various aspects of prompting behavior.

Goals

Primary: Understand the energy consumption characteristics of Apertus
Secondary: Analyze energy impact across different prompting dimensions:
- Prompt length
- Response length
- Language of the prompt
- Subject matter of the prompt
Optional: Compare Apertus performance with Llama 7B

Limitations

⚠️ Important: Apertus is in early development stages. Quality assessments would be premature and unfair at this time.

Infrastructure

Systems Used

Mac Studio (Primary)
- Remote access capability
- Physical measurement limitations due to remote setup
- Used for all experimental measurements
Shuttle with 8GB Nvidia GPU (Available but unused)
- Not utilized in final experiments

Measurement Validation

Initial testing revealed a 3x-4x discrepancy between powermetrics readings and physical measurements on the Mac Studio. Due to time constraints, this variance was noted but not fully investigated.

Measurement Tools & Scripts

Energy Measurement Script (`measure.sh`)

Custom bash script that orchestrates energy measurement using powermetrics:

Features:

Real-time power monitoring with 100ms sampling intervals
CSV output with structured data format
Live progress display during measurement
Automatic calculation of cumulative energy consumption
Statistical analysis (min/max/average power, standard deviation)

Usage:

./measure.sh <duration_in_seconds> [test_name]
./measure.sh 60 "SS_math_test"

Output Format:

sample,elapsed_time,power_mw,cumulative_energy_j,test_name
1,0.10,380.00,0.038000,test_short_short
2,0.20,318.00,0.069800,test_short_short

Efficiency Analysis (`quick_efficiency.py`)

Python script for calculating token-to-energy efficiency metrics:

Key Metrics:

Tokens per Joule (efficiency rating)
Joules per Token (energy cost)
Tokens per Second (processing speed)
Energy efficiency rating system (⭐⭐⭐ Excellent > 1000 tokens/J)

Usage:

python3 quick_efficiency.py data.csv "model output text"

Data Visualization (`test.py`)

Comprehensive analysis and visualization tool:

Capabilities:

Multi-file comparison analysis
Power consumption profiles over time
Statistical analysis with error bars
Efficiency comparison charts
Automated plot generation

Features:

Single test detailed analysis
Multi-test comparative analysis
Normalized time comparisons
Energy consumption trends

Powermetrics Integration

Platform: Mac Studio
Command: powermetrics --samplers cpu_power -i 100 -n <duration>
Sampling Rate: 50ms effective (100ms configured with processing overhead)
Metrics: Combined CPU/GPU power consumption in milliwatts

Yocto-Watt

Type: Physical energy consumption device
Usage: Baseline understanding of machine behavior
Link: https://www.yoctopuce.com/EN/products/usb-electrical-sensors/yocto-watt

Experimental Setup

Standard Configuration

Sampling Rate: 50ms intervals
Powermetrics Command: powermetrics --samplers cpu_power -i 100 -n "$DURATION"

Baseline Measurements

Idle Consumption: 10 runs × 30 seconds each
Observation: Irregular consumption peaks during idle state
Impact: Background processes affected measurements but were accepted due to time constraints

Experiments & Results

1. Prompt Length Impact

Test Files: test_short_short, test_short_long, test_long_short, test_long_long

Tested all permutations of prompt and response length combinations to isolate energy consumption factors.

Methodology:

Short prompts: ~10-20 words
Long prompts: ~100+ words
Short responses: ~10-50 tokens
Long responses: ~200+ tokens
Duration: ~260 seconds per test

Key Finding: Response length showed stronger correlation with energy consumption than prompt length.

2. Subject Matter Analysis

Test Files: test_short_topic_1 through test_short_topic_10

Evaluated energy consumption across different domains with consistent short prompt/short response format:

Basic Knowledge
Geography
Chemistry
Mathematics
History
Science
Literature
Technology

Methodology:

10 different subject prompts
Consistent response length target
~270 seconds per test
Statistical analysis across topics

Result: Subject matter showed minimal impact on energy consumption. Response length remained the primary factor. Average energy consumption varied by less than 5% across different topics.

3. Language Comparison

Test Files: test_short_language_1 through test_short_language_4

Tested four languages with controlled response lengths:

German (test_short_language_1)
Spanish (test_short_language_2)
Portuguese (test_short_language_3)
English (test_short_language_4)

Methodology:

Identical semantic prompts translated to each language
Target response length controlled
~290 seconds per test
Power consumption analysis

Notable Finding: Portuguese consumed approximately 20% more energy for similar response lengths, warranting further investigation. This could indicate tokenization differences or model efficiency variations across languages.

4. Apertus vs Llama 7B Comparison

Test Files: test_short_llm_fight_1 through test_short_llm_fight_4

Limited comparison between models with controlled prompt/response scenarios.

Methodology:

Identical prompts for both models
Short prompt, short response format
4 comparative test runs
~300 seconds per test

Result: No clear energy consumption trend identified. Both models showed similar energy usage patterns, with Apertus consuming approximately 50J more in comparable scenarios. More extensive testing needed for statistical significance.

5. Baseline Measurements

Test Files: idletest_*, test10 through test20

Idle Consumption Study:

10 runs of 30-second idle measurements
Mac Studio background processes created measurement variance
Standard deviation: ~15-20% of mean idle power
Average idle power: ~250-300mW

Measurement Stability:

20 repeated measurements under identical conditions
Assessed measurement reproducibility
Identified systematic measurement overhead from powermetrics

Key Findings

Response Length Correlation: Strong positive correlation between response length and energy consumption
Prompt Length Impact: Minimal influence on overall energy usage
Language Variance: Portuguese showed unexpected 20% higher consumption
Model Comparison: Apertus and Llama 7B perform similarly in energy usage

Limitations & Future Work

Identified Issues

System Stability: Mac Studio showed high idle consumption variance
Measurement Overhead: Powermetrics introduces CPU load and timing discrepancies (~20ms vs 100ms target)
Physical vs Digital: 3x-4x gap between powermetrics and physical measurements needs quantification
Sample Size: Limited runs per experiment due to time constraints

Recommendations

Investigate and eliminate idle consumption variance sources
Quantify powermetrics measurement overhead impact
Establish correlation between digital metrics and physical power consumption
Increase sample sizes for statistical significance

Technical Implementation

Data Collection Pipeline

Measurement Script (measure.sh): Orchestrates powermetrics data collection
Real-time Processing: AWK script processes powermetrics output stream
Data Storage: Structured CSV format with timestamp synchronization
Analysis Tools: Python scripts for statistical analysis and visualization

File Structure

├── measure.sh              # Main measurement orchestration script
├── quick_efficiency.py     # Token efficiency analysis tool
├── test.py                 # Data visualization and comparison
├── data/                   # Experimental results (42 CSV files)
│   ├── idletest_*          # Baseline idle measurements
│   ├── test_short_short    # Short prompt/short response
│   ├── test_long_long      # Long prompt/long response
│   ├── test_short_topic_*  # Subject matter experiments
│   ├── test_short_language_* # Language comparison tests
│   └── test_short_llm_fight_* # Model comparison tests
└── energy_analysis.png     # Generated visualization

Data Format Specification

Each measurement produces timestamped CSV files with the following schema:

sample,elapsed_time,power_mw,cumulative_energy_j,test_name

sample: Sequential measurement number
elapsed_time: Time elapsed since measurement start (seconds)
power_mw: Instantaneous power consumption (milliwatts)
cumulative_energy_j: Running total energy consumption (joules)
test_name: Experiment identifier for batch processing

Statistical Analysis Methods

Central Tendency: Mean, median power consumption calculation
Variability: Standard deviation analysis for measurement stability
Energy Integration: Cumulative energy calculation using trapezoidal rule
Comparative Analysis: Multi-test statistical comparison with confidence intervals

Reproducibility

Dependencies

macOS: powermetrics utility (built-in)
Python 3.x: pandas, matplotlib for analysis scripts
Hardware: Mac Studio (M1/M2) or compatible Apple Silicon system

Running the Experiments

# 1. Make measurement script executable
chmod +x measure.sh

# 2. Run energy measurement (requires sudo for powermetrics)
./measure.sh 300 "my_experiment"

# 3. Analyze results
python3 quick_efficiency.py my_experiment_*.csv "model output text"

# 4. Generate visualizations
python3 test.py data/*.csv

Dataset Summary

Total Measurements: 42 experimental runs Data Points: ~12,000 individual power measurements Test Categories:

Baseline idle: 12 runs
Prompt length: 4 systematic tests
Subject matter: 10 domain-specific tests
Language comparison: 4 language tests
Model comparison: 4 head-to-head tests
Stability assessment: 20 repeated measurements

Conclusion

This proof-of-concept demonstrates feasible methodologies for LLM energy consumption measurement and provides initial insights into consumption patterns. The comprehensive dataset of 42 experimental runs with over 12,000 data points establishes a foundation for more rigorous future analysis.

Primary Insights:

Response Length Dominance: Strong positive correlation between token output and energy consumption
Prompt Length Independence: Minimal energy impact from input prompt length variations
Language Efficiency Variance: Notable 20% difference between languages (Portuguese anomaly)
Model Parity: Apertus and Llama 7B show similar energy profiles within measurement uncertainty
Measurement Methodology: Established reproducible framework for LLM energy analysis

Technical Contributions:

Automated measurement pipeline with real-time analysis
Structured dataset with comprehensive experimental coverage
Statistical analysis framework for energy efficiency assessment
Open-source toolset for reproducible LLM energy research

Future Work: Quantify powermetrics calibration, expand language coverage, increase statistical sample sizes, and investigate tokenization impacts on energy consumption patterns.

This work represents exploratory research conducted during a hackathon timeframe and should be considered preliminary findings rather than conclusive scientific results. All code and data are available for reproduction and extension.

Preview of external content.

👋 Contact ✨ Demo 💻 Source

machine learning python software engineering sustainability

Hackathons full of ideas, collaboration, and innovation are based on the premise of keeping the experience safe, inclusive, and respectful for everyone. We follow a clear Code of Conduct and support the Universal Declaration of Human Rights. Harassment or discrimination of any kind won't be tolerated—this applies to all staff, participants, coaches, visitors and sponsors. Please take a moment to review the full guidelines.

The contents of this website, unless otherwise stated, are licensed under a Creative Commons Attribution 4.0 International License. The application that powers this site is available under the MIT license.

Previous
Hackathon Bern
Next project

Hackathon Bern

Measure footprint of open LLMs

Purpose

Inputs

Compliance

Results

Reports

Energy Footprint Analysis of Open LLMs

Team Members

Project Overview

Goals

Limitations

Infrastructure

Systems Used

Measurement Validation

Measurement Tools & Scripts

Energy Measurement Script (measure.sh)

Efficiency Analysis (quick_efficiency.py)

Data Visualization (test.py)

Powermetrics Integration

Yocto-Watt

Experimental Setup

Standard Configuration

Baseline Measurements

Experiments & Results

1. Prompt Length Impact

2. Subject Matter Analysis

3. Language Comparison

4. Apertus vs Llama 7B Comparison

5. Baseline Measurements

Key Findings

Limitations & Future Work

Identified Issues

Recommendations

Technical Implementation

Data Collection Pipeline

File Structure

Data Format Specification

Statistical Analysis Methods

Reproducibility

Dependencies

Running the Experiments

Dataset Summary

Conclusion

Energy Measurement Script (`measure.sh`)

Efficiency Analysis (`quick_efficiency.py`)

Data Visualization (`test.py`)