Skip to content

Creating Multilingual Digital Infrastructures

By Puthiya Purayil SnehaOctober 20227 Minute Read

0*kYl_lACJoclUqCQH

Screenshot of cover from “State of the Internet’s Languages” report. Licensed under CC NC-SA 4.0.

From the Metadata Learning and Unlearning series



This essay shares reflections and learnings from research and mapping efforts to build multilingual digital infrastructures.

The growing discourse on the need for a multilingual and diverse internet over the last several years, particularly in the wake of the Covid-19 pandemic, has opened up several important questions about our engagement with digital infrastructures in general. As illustrated by a substantive body of work in digital rights and policy, research, and computing, among others—especially in the last decade—the development of inclusive and accessible digital content and spaces is also premised on addressing several systemic barriers to access, language being a persistent one. Challenges in reading, writing, and speaking in multiple languages on digital interfaces continue to remain prevalent across the world, especially for non-dominant communities. While there are several longstanding efforts to address these challenges, they also remain limited by larger infrastructural challenges in access to and use of internet and digital technologies.

The Internet Reinforces Global Language Disparities

Ethnologue, a global reference publication on languages, notes that as of 2022, 3,045 languages are endangered, which is 43% of all living languages. Among the 7,151 spoken languages that we know of, just 23 account for more than half the world’s population. The asymmetries in development and use of languages are stark as seen in these numbers, as is the scale of the problem. These disparities are reflected on the internet as well, with a selective number of languages available on digital interfaces in multifunctional ways. And while this may present as a technological problem, the gaps actually predate the digital turn, as they are a result of several forms of systemic social exclusion, many of which are located in colonial infrastructures of knowledge production. Today, with the emergence of new forms of colonisation of data, efforts to identify and address language inequities have become even more significant.

Wikipedia content and number of speakers for the ten most widely spoken languages in the world. (Population estimate: Ethnologue 2019, which includes second-language speakers.) Screenshot taken 12 August 2022. Licensed under CC NC-SA 4.0.

Initiatives to Document Knowledge Gaps

A recently published report examines the State of the Internet’s Languages (STIL), led by Whose Knowledge? in collaboration with the Oxford Internet Institute, the Centre for Internet and Society (CIS), and over 100 people around the world. Through an exploration of data and stories on how people read, write, and speak online in multiple languages, the report offers an overview of some of the key issues related to language inequity online. Building on the premise that “language is a proxy for knowledge,” the report reflects on how human knowledge, especially that produced in non-dominant and marginalized languages, continues to remain underrepresented on the web, along with documenting several ongoing efforts to address these challenges.

The second initiative is a set of short-term research projects on Wikimedia platforms and communities in India undertaken by the Access to Knowledge program at CIS. The research studies cover an array of topics, including systemic gaps like the gender bias and divide in Indian language Wikimedia projects, debates on open access and reuse across Wikimedia and Galleries, Libraries, Archives, and Museums (GLAM) initiatives, and forms of multilingual pedagogy and content creation across diverse projects. (Read a compilation of the projects completed between 2019–2021 here.)

Compilation of research studies by the Access to Knowledge Program, Centre for Internet and Society. Shared on Wikimedia Commons under a CC BY-SA 3.0 license.

Language Reflects Larger Power Dynamics in Society

Languages don’t exist in isolation—they grow with people and other languages. The STIL report talks of dominant and marginalized languages in multiple global and local contexts of power and privilege, and this is illustrated in the data narratives and stories in the report. Many of them speak of forms of interlanguage marginalisations, the relationship with colonial languages, and how this affects access to critical information and educational content, social and economic mobility, community identities and memory, and so on. The report includes ”contributions about Indigenous languages like Chindali, Cree, Ojibway, Mapuzugun, Zapotec, and Arrernte from Africa, the Americas, and Australia . . . minority languages like Breton, Basque, Sardinian, and Karelian in Europe, as well as regionally and globally dominant languages like Bengali, Indonesian (Bahasa Indonesia) and Sinhala in Asia, and different forms of Arabic across North Africa.”1

Wikipedia’s local-language prevalence. Are the most detailed representations of a country written in a local language (orange and beige), or a foreign language (blue)? Language data: Unicode CLDR 2019). Licensed under CC NC-SA 4.0. Screenshot taken 12 August 2022.

Linguistic barriers also disproportionately affect marginalized and vulnerable groups, as they often open up space for harms such as misinformation, hate speech, and gender-based violence. The growing body of work on the gender gap in terms of content about and participation by women across Wikimedia projects also illustrates that these gaps are tied to several factors such as access, infrastructure, and capacity-building. The limited availability of existing Indian-language resources on gender, sexuality and feminism in digital forms is an added impediment to addressing these gaps.

The Role of Technology

Learnings from these two projects also offer a multi-layered and intersectional perspective to understanding infrastructure conceptually and politically, because the technologies we use speak a “different language” than what we may use to communicate with each other. Consider the need for development of keyboards, fonts, and software in various languages; the number of languages that are easily accessible on your smartphone; the lack of accurate translations into and from Indigenous and regional languages or for conceptual terms related to gender, sexuality, and feminism; and accessibility of content and devices for persons with disabilities.

Efforts in preservation, sourcing, digitizing, translating, sharing, and (re)using content in multiple languages (especially on open knowledge platforms like Wikimedia projects) are beset by multiple challenges, including legal and cultural factors. As mentioned earlier, these are not just technological gaps, but historical knowledge gaps that affect marginalized communities disparately by contributing to existing power inequalities.

While many of these challenges with the development of digital infrastructures remain prevalent across the world, there are also multiple affordances of these technologies and platforms which may actually lend themselves effectively to address these challenges. While the data narratives and maps in the STIL report offer an important macro perspective on the scale of these knowledge gaps, the stories present several embodied, experiential narratives of languages in the digital space, whether through speech, signs, emojis, or text. The significance of orality and voice is emphasized across many narratives, which also questions the primacy of the textual on the internet and digital interfaces. Several communities across the world, as illustrated in these stories, also use digital tools and platforms, including social media, in creative ways to circumvent the barriers created by lack of access and low resources. Efforts in multilingual content creation on Wikimedia projects such as Wikimedia Commons and Wikidata also illustrate the necessity to map such existing content across diverse formats and to invest in creating inclusive and accessible structured and linked data. The multiplicity of forms and formats in content creation in diverse languages are an important factor in terms of rethinking diversity in access and use.

Multilingual Internets Require an Intersectional Approach

These are a few key learnings from the two projects; many of the concerns highlighted by diverse communities across the world are also informed by larger contexts of ownership and regulation of digital infrastructures. As mentioned earlier, while the projects are focused on linguistic disparities, the challenges are indicative of longer historical knowledge gaps. Efforts in building multilingual internets need to take an intersectional approach, and importantly foreground community-led initiatives in the space which have long been working to address these gaps. The learnings from these projects, and indeed continued work in these areas, aims to inform research and practice across different spaces—including but not limited to language-related computing, archival practice, open educational resources, and newer fields like digital humanities. Collaborations across these spaces would further support researchers, creative practitioners, academia and policymakers in aiding efforts to develop and foster open, decolonial, and multilingual digital infrastructures.

——
The Metadata Learning and Unlearning series was originally published on Medium.com and edited by Sharon Mizota, Virginia Poundstone, and Garrett Graddy-Lovelace. This series raises questions and makes proposals for what metadata can do to advance a broader dialogue about diverse worldviews within open education and openGLAM realms.

Puthiya Purayil Sneha

Puthiya Purayil Sneha is a researcher with the Centre for Internet and Society (CIS), India. Her areas of interest and work include digital media and cultures, methodological concerns in arts and humanities practice and pedagogy, and access to knowledge.

Citations

1.

“State of the Internet‘s Languages Report.” 2022, https://internetlanguages.org/en/. Accessed 7 April 2023.

Puthiya Purayil Sneha

Puthiya Purayil Sneha is a researcher with the Centre for Internet and Society (CIS), India. Her areas of interest and work include digital media and cultures, methodological concerns in arts and humanities practice and pedagogy, and access to knowledge.