Habemus Apertus

This challenge to tap into public data sources is completely AI-generated.

Demonstrate Public Interest of a Language Model

🅰️ℹ️ Generated with APERTUS-70B-INSTRUCT

Explore the datasets and resources used by the Swiss AI Initiative for Apertus (starting points listed below). It is known that the model is trained on 15 trillion tokens covering more than 1,500 languages, including diverse and underrepresented ones. Seek out explanations on the training framework, visualize the sources of data or training process. While the model focuses on linguistic diversity, consider as a team how to truly meet the needs of global communities by enhancing specific cultural capabilities - accessing untapped datasets, or even advocating for data contributions like it was done already for Rumantsch dialects.

Project Ideas

On a more technical level, you can evaluate the current model against open benchmarks (e.g., XTREME, LAMA, for which existing scores may already be available) in performance, fairness, and robustness across languages and domains. Based on this, brainstorm enhancements to model performance on complex tasks, such as:

Multilingual understanding (e.g., translating between rare languages or handling low-resource languages)
Explainability (including the reasons for certain outputs at a reasonable level of detail)
Ethical use (making privacy protections and transparency even more robust)
Interoperability and collaboration with other open models

Datasets and Resources

Read the Model Card and Technical Report from the Apertus team.
Evaluation by Jannis Vamvas - Expanding the WMT24++ Benchmark (Vamvas et al 2025): arxiv paper, LinkedIn visualization
Look up epfml from Hugging Face, as mentioned in the documentation, provides a starting point for multilingual resources. Ensure exploration of datasets like xglue, m4cite for additional support.
The EU's language resources platform (Erasmus+) or language collections from UNESCO can yield insights on and languages with limited digital presence.
Use XTREME for evaluating cross-lingual understanding capabilities.
Hugging Face's model card format for designing fairness and transparency metrics aligned with EU AI Guidelines.
Review papers on privacy-preserving training techniques for potential integration into future model development.
Prototype new interfaces, where users can ask questions in their language and receive responses from Apertus, guiding development towards enhanced user experience for linguistically diverse users.

Experience to be Gained

Data Science/Engineering:
- Understanding of natural language processing (NLP)
- Basic understanding of large language model architectures (Transformer model knowledge is beneficial)
- Experience with handling large datasets and possibly low-resource languages
- Familiarity with Hugging Face and associated libraries (transformers, datasets, etc.)
Data Ethics and Legal: Knowledge of EU AI Act, data protection standards (GDPR), and open source licensing (Apache 2.0).
Multilingual and Cultural Awareness: Proficiency in multiple languages or understanding the importance of diverse linguistic inputs for global applicability.
Software Development and Prototyping: Skills in Python, JavaScript, or relevant languages for quick prototyping and integrating models into applications or web interfaces.
Collaboration: Ability to collaborate in a team to design solutions, discuss implications, and iteratively improve based on user feedback.
Passion for Fair and Open AI: Passion for democratizing technology, ensuring privacy, and fostering empathy in AI.

This collaborative hackathon provides a platform to critically engage with Apertus by expanding its reach through diverse datasets, improving technical performance, and ensuring it meets broader ethical and practical needs. The result should be both a technical prototype and a document (or presentation) detailing your approach, findings, and recommendations for further improvement and development. Get in touch with us at llm-develop@swiss-ai.org for any specific questions or to connect with our team for input.

We look forward to the innovative solutions and new ideas this collaboration can generate.

Hackathons full of ideas, collaboration, and innovation are based on the premise of keeping the experience safe, inclusive, and respectful for everyone. We follow a clear Code of Conduct and support the Universal Declaration of Human Rights. Harassment or discrimination of any kind won't be tolerated—this applies to all staff, participants, coaches, visitors and sponsors. Please take a moment to review the full guidelines.

The contents of this website, unless otherwise stated, are licensed under a Creative Commons Attribution 4.0 International License. The application that powers this site is available under the MIT license.

Previous
Hackathon Bern
Next project