Habemus Apertus

This challenge to tap into public data sources is completely AI-generated.

Demonstrate Public Interest of a Language Model

🅰️ℹ️ Generated with APERTUS-70B-INSTRUCT

Explore the datasets and resources used by the Swiss AI Initiative for Apertus (starting points listed below). It is known that the model is trained on 15 trillion tokens covering more than 1,500 languages, including diverse and underrepresented ones. Seek out explanations on the training framework, visualize the sources of data or training process. While the model focuses on linguistic diversity, consider as a team how to truly meet the needs of global communities by enhancing specific cultural capabilities - accessing untapped datasets, or even advocating for data contributions like it was done already for Rumantsch dialects.

See also Oleg's blog #110


Apertus, an open, multilingual 70B parameter language model from the Swiss AI Initiative, is a notable achievement in open and fair technology for the public good. It supports over 1,500 languages, promoting data privacy and linguistic diversity. This collaboration seeks to build on its strengths, address its limitations, and expand its utility for diverse global applications. More information is available here:

https://swissai.dribdat.cc/project/40

Project Ideas

On a more technical level, you can evaluate the current model against open benchmarks (e.g., XTREME, LAMA, for which existing scores may already be available) in performance, fairness, and robustness across languages and domains. Based on this, brainstorm enhancements to model performance on complex tasks, such as:

  • Multilingual understanding (e.g., translating between rare languages or handling low-resource languages)
  • Explainability (including the reasons for certain outputs at a reasonable level of detail)
  • Ethical use (making privacy protections and transparency even more robust)
  • Interoperability and collaboration with other open models

Datasets and Resources

Experience to be Gained

  • Data Science/Engineering:
    • Understanding of natural language processing (NLP)
    • Basic understanding of large language model architectures (Transformer model knowledge is beneficial)
    • Experience with handling large datasets and possibly low-resource languages
    • Familiarity with Hugging Face and associated libraries (transformers, datasets, etc.)
  • Data Ethics and Legal: Knowledge of EU AI Act, data protection standards (GDPR), and open source licensing (Apache 2.0).
  • Multilingual and Cultural Awareness: Proficiency in multiple languages or understanding the importance of diverse linguistic inputs for global applicability.
  • Software Development and Prototyping: Skills in Python, JavaScript, or relevant languages for quick prototyping and integrating models into applications or web interfaces.
  • Collaboration: Ability to collaborate in a team to design solutions, discuss implications, and iteratively improve based on user feedback.
  • Passion for Fair and Open AI: Passion for democratizing technology, ensuring privacy, and fostering empathy in AI.

This collaborative hackathon provides a platform to critically engage with Apertus by expanding its reach through diverse datasets, improving technical performance, and ensuring it meets broader ethical and practical needs. The result should be both a technical prototype and a document (or presentation) detailing your approach, findings, and recommendations for further improvement and development. Get in touch with us at llm-develop@swiss-ai.org for any specific questions or to connect with our team for input.

We look forward to the innovative solutions and new ideas this collaboration can generate.

Hackathons full of ideas, collaboration, and innovation are based on the premise of keeping the experience safe, inclusive, and respectful for everyone. We follow a clear Code of Conduct and support the Universal Declaration of Human Rights. Harassment or discrimination of any kind won't be tolerated—this applies to all staff, participants, coaches, visitors and sponsors. Please take a moment to review the full guidelines.

The contents of this website, unless otherwise stated, are licensed under a Creative Commons Attribution 4.0 International License. The application that powers this site is available under the MIT license.