GitHub Email ORCID Google Scholar ACL Anthology DBLP Stack Overflow X FOAF Zurich, Switzerland RSS
University of Zurich · Spring 2026 · BIN-1-D.22 · Wednesdays 14:00–15:45
More than 7,000 languages are spoken around the world today, yet they do not share equal status. Many are endangered, and the vast majority lack adequate technological support. In fact, an estimated 98% of the world’s languages can be considered “low-resource” from an NLP perspective. The term “low-resource” itself is not fixed: any language can fall into this category depending on the specific task or domain. A language well supported for basic text processing may still lack resources for machine translation, speech recognition, or domain-specific applications like medical or legal NLP.
Beyond the question of resources, language and speech technology has traditionally operated on a monolithic assumption, privileging standardized written forms while neglecting the rich diversity that exists within languages. Dialects, regional varieties, and non-standard forms (often limited to spoken language) pose distinct challenges: orthographic variation, lexical differences, loanwords, and the absence of conventional data sources. This is where we face our greatest challenge.
The course was organized around three core pillars of robust NLP: Data, Learning, and Evaluation. In the data pillar, we investigated approaches to resource creation, annotation strategies, and data quality considerations when working with limited materials. The learning pillar focused on how to optimally leverage existing models and knowledge from high-resource languages through techniques such as transfer learning, cross-lingual representations, and zero-shot methods. Finally, the evaluation pillar examined how metrics and benchmarks shape our understanding of model performance, and how evaluation practices must adapt when standard assumptions no longer hold.
This seminar follows a problem-driven structure. Each student identifies a specific technical or sociolinguistic bottleneck in the current NLP landscape for a marginalized variety and shipped a concrete solution. Projects progress through three milestones: problem identification (Week 5), solution proposal (Week 8), and final implementation and presentation (Week 14). More on this on OLAT.