Habemus Apertus
This challenge to tap into public data sources is completely AI-generated.
Demonstrate Public Interest of a Language Model
🅰️ℹ️ Generated with APERTUS-70B-INSTRUCT
Explore the datasets and resources used by the Swiss AI Initiative for Apertus (starting points listed below). It is known that the model is trained on 15 trillion tokens covering more than 1,500 languages, including diverse and underrepresented ones. Seek out explanations on the training framework, visualize the sources of data or training process. While the model focuses on linguistic diversity, consider as a team how to truly meet the needs of global communities by enhancing specific cultural capabilities - accessing untapped datasets, or even advocating for data contributions like it was done already for Rumantsch dialects.
See also Oleg's blog #110
Apertus, an open, multilingual 70B parameter language model from the Swiss AI Initiative, is a notable achievement in open and fair technology for the public good. It supports over 1,500 languages, promoting data privacy and linguistic diversity. This collaboration seeks to build on its strengths, address its limitations, and expand its utility for diverse global applications. More information is available here:
https://swissai.dribdat.cc/project/40
Project Ideas
On a more technical level, you can evaluate the current model against open benchmarks (e.g., XTREME, LAMA, for which existing scores may already be available) in performance, fairness, and robustness across languages and domains. Based on this, brainstorm enhancements to model performance on complex tasks, such as:
- Multilingual understanding (e.g., translating between rare languages or handling low-resource languages)
- Explainability (including the reasons for certain outputs at a reasonable level of detail)
- Ethical use (making privacy protections and transparency even more robust)
- Interoperability and collaboration with other open models
Datasets and Resources
- Read the Model Card and Technical Report from the Apertus team.
- Evaluation by Jannis Vamvas - Expanding the WMT24++ Benchmark (Vamvas et al 2025): arxiv paper, LinkedIn visualization
- Look up
epfmlfrom Hugging Face, as mentioned in the documentation, provides a starting point for multilingual resources. Ensure exploration of datasets likexglue,m4citefor additional support. - The EU's language resources platform (Erasmus+) or language collections from UNESCO can yield insights on and languages with limited digital presence.
- Use XTREME for evaluating cross-lingual understanding capabilities.
- Hugging Face's model card format for designing fairness and transparency metrics aligned with EU AI Guidelines.
- Review papers on privacy-preserving training techniques for potential integration into future model development.
- Prototype new interfaces, where users can ask questions in their language and receive responses from Apertus, guiding development towards enhanced user experience for linguistically diverse users.
Experience to be Gained
- Data Science/Engineering:
- Understanding of natural language processing (NLP)
- Basic understanding of large language model architectures (Transformer model knowledge is beneficial)
- Experience with handling large datasets and possibly low-resource languages
- Familiarity with Hugging Face and associated libraries (transformers, datasets, etc.)
- Data Ethics and Legal: Knowledge of EU AI Act, data protection standards (GDPR), and open source licensing (Apache 2.0).
- Multilingual and Cultural Awareness: Proficiency in multiple languages or understanding the importance of diverse linguistic inputs for global applicability.
- Software Development and Prototyping: Skills in Python, JavaScript, or relevant languages for quick prototyping and integrating models into applications or web interfaces.
- Collaboration: Ability to collaborate in a team to design solutions, discuss implications, and iteratively improve based on user feedback.
- Passion for Fair and Open AI: Passion for democratizing technology, ensuring privacy, and fostering empathy in AI.
This collaborative hackathon provides a platform to critically engage with Apertus by expanding its reach through diverse datasets, improving technical performance, and ensuring it meets broader ethical and practical needs. The result should be both a technical prototype and a document (or presentation) detailing your approach, findings, and recommendations for further improvement and development. Get in touch with us at llm-develop@swiss-ai.org for any specific questions or to connect with our team for input.
We look forward to the innovative solutions and new ideas this collaboration can generate.
Previous
Hackathon Bern
Next project