Close Menu

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Warner Bros. Discovery is “rebuilding its video game pipeline” after a “significant” 2025

    Resident Evil Requiem Japanese players fight “immersion-breaking” censorship

    Netflix boss says Paramount acquisition of Warner Bros will result in “cuts in excess of $16 billion” within “18 months or so”

    Facebook X (Twitter) Instagram
    • Artificial Intelligence
    • Business Technology
    • Cryptocurrency
    • Gadgets
    • Gaming
    • Health
    • Software and Apps
    • Technology
    Facebook X (Twitter) Instagram Pinterest Vimeo
    Tech AI Verse
    • Home
    • Artificial Intelligence

      What the polls say about how Americans are using AI

      February 27, 2026

      Tensions between the Pentagon and AI giant Anthropic reach a boiling point

      February 21, 2026

      Read the extended transcript: President Donald Trump interviewed by ‘NBC Nightly News’ anchor Tom Llamas

      February 6, 2026

      Stocks and bitcoin sink as investors dump software company shares

      February 4, 2026

      AI, crypto and Trump super PACs stash millions to spend on the midterms

      February 2, 2026
    • Business

      Weighing up the enterprise risks of neocloud providers

      March 3, 2026

      A stolen Gemini API key turned a $180 bill into $82,000 in two days

      March 3, 2026

      These ultra-budget laptops “include” 1.2TB storage, but most of it is OneDrive trial space

      March 1, 2026

      FCC approves the merger of cable giants Cox and Charter

      February 28, 2026

      Finding value with AI and Industry 5.0 transformation

      February 28, 2026
    • Crypto

      Strait of Hormuz Shutdown Shakes Asian Energy Markets

      March 3, 2026

      Wall Street’s Inflation Alarm From Iran — What It Means for Crypto

      March 3, 2026

      Ethereum Price Prediction: What To Expect From ETH In March 2026

      March 3, 2026

      Was Bitcoin Hijacked? How Institutional Interests Shaped Its Narrative Since 2015

      March 3, 2026

      XRP Whales Now Hold 83.7% of All Supply – What’s Next For Price?

      March 3, 2026
    • Technology

      Buckle Up for Bumpier Skies

      March 3, 2026

      Daily Driving GrapheneOS

      March 3, 2026

      OpenAI will amend Defense Department deal to prevent mass surveillance in the US

      March 3, 2026

      Intent-Based Commits

      March 3, 2026

      Elevated Errors in Claude.ai

      March 3, 2026
    • Others
      • Gadgets
      • Gaming
      • Health
      • Software and Apps
    Check BMI
    Tech AI Verse
    You are at:Home»Technology»The race to make AI as multilingual as Europe
    Technology

    The race to make AI as multilingual as Europe

    TechAiVerseBy TechAiVerseJuly 1, 2025No Comments12 Mins Read5 Views
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
    The race to make AI as multilingual as Europe
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp Email

    The race to make AI as multilingual as Europe

    The European Union has 24 official languages and dozens more unofficial ones spoken across the continent. If you add in the European countries outside the union, then that brings at least a dozen more into the mix. Add dialects, endangered languages, and languages brought by migrants to Europe, and you end up with hundreds of languages.

    One thing many of us in technology could agree on is that the US dominates — and that extends to online languages. There are many reasons for this, mostly due to American institutions, standards bodies, and companies defining how computers, their operating systems, and the software they run work in their nascent days. This is changing, but for the short term at least, it remains the norm. This has also led to the majority of the web being in English. An astounding 50% of websites are in English, despite it being the native tongue of only about 6% of the world’s population, with Spanish, German, and Japanese next, but a long way behind, each only between 5-6% of the web.

    As we delve deeper into the new wave of AI-powered applications and services, many are driven by data in large language models (LLMs). As much of the data in these LLMs is scraped (controversially in many cases) from the web, LLMs predominantly understand and respond in English. As we find ourselves at the start of or in the midst of a shift in technological paradigm caused by the rapid growth of AI tools, this is a problem, and we’re bringing that problem into a new age.

    Europe already boasts several high-profile AI companies and projects, such as Mistral and Hugging Face. Google DeepMind also originated as a European company. The continent has research projects that develop language models to enhance how AI tools comprehend less commonly spoken languages.

    This article explores some of these initiatives, questions their effectiveness, and asks whether their efforts are worthwhile or if many users default to using English versions of tools. As Europe seeks to build its independence in AI and ML, does the continent have the companies and skills necessary to achieve its goals?

    Terminology and technology primer

    To make sense of what follows, you don’t need to understand how models are created, trained, or function. But it’s helpful to understand a couple of basics about models and their human language support.

    Unless model documentation explicitly mentions it is multilingual or cross-lingual, prompting it or requesting a response in an unsupported language may cause it to translate back and forth or respond in a language it does understand. Both strategies can produce unreliable and inconsistent results — especially in low-resource languages.

    While high-resource languages, such as English, benefit from abundant training data. Low-resource languages, such as Gaelic or Galician, have far less, which often leads to inferior performance

    The harder concept to explain regarding models is “open,” which is unusual, as software in general has had a fairly clear definition of “open source” for a while. I don’t want to delve too deeply into this topic as the exact definition is still in flux and controversial. The summary is that even when a model might call itself “open” and is referenced as “open,” the meaning of “open” isn’t always the same.

    Here are two other useful terms to know:

    Training teaches a model to make predictions or decisions based on input data.

    Parameters are variables learned during model training that define how the model maps inputs to outputs. In other words, how it understands and responds to your questions. The larger the number of parameters, the more complex the model is.

    With that brief explanation done, how are European AI companies and projects working to enhance these processes to improve European language support?

    Hugging Face

    When someone wants to share code, they typically provide a link to their GitHub repository. When someone wants to share a model, they typically provide a Hugging Face link. Founded in 2016 by French entrepreneurs in New York City, the company is an active participant in creating communities and a strong proponent of open models. In 2024, it started an AI accelerator for European startups and partnered with Meta to develop translation tools based on Meta’s “No Language Left Behind” model. They are also one of the driving forces behind the BLOOM model, a groundbreaking multilingual model that set new standards for international collaboration, openness, and training methodologies.

    Hugging Face is a useful tool for getting a rough idea of the language support in models. At the time of writing, Hugging Face lists 1,743,136 models and 298,927 datasets. Look at its leaderboard for monolingual models and datasets, and you see the following ranking for models and datasets that developers tag (add metadata) as supporting European languages at the time of writing:

    Language Language code Datasets Models
    English English en 27,702 205,459
    English eng 1,370 1,070
    French fra 1,933 850
    Spanish Español es 1,745 10,028
    German Deutsch de 1,442 9,714
    English eng 1,370 1,070

    You can already see some issues here. These aren’t tags set in stone. The community can add values freely. While you can see that they follow them for the most part, there is some duplication.

    As you can see, the models are dominated by English. A similar issue applies to the datasets on Hugging Face, which lack non-English data.

    What does this mean?

    Lucie-Aimée Kaffee, EU Policy Lead at Hugging Face, said that the tags indicate that a model has been trained to understand and process this language or that the dataset contains materials in that language. She added that the confusion between language support often comes during training.“When training a large model, it’s common for other languages to accidentally get caught in training because there were some artefacts of it in that dataset,” she said. “The language a model is tagged with is usually what the developers intended the model to understand.”

    As one of the main and busiest destinations for model developers and researchers, Hugging Face not only hosts much of their work, but also lets them create outward-facing communities to tell people how to use them.

    Thomas Wolf, co-founder of Hugging Face, described Bloom as “the world’s largest open multilingual language model.” Credit: Shauna Clinton/Web Summit via Sportsfile

    Mistral AI

    Perhaps the best-known Europe-based AI company is France’s Mistral AI, which unfortunately declined an interview. Its multilingual challenges partly inspired this article. At the FOSDEM developer conference in February 2024, linguistics researcher Julie Hunter asked one of Mistral’s models for a recipe in French — but it responded in English. However, 16 months is an eternity in AI development, and neither the company’s “Le Chat” chat interface nor running its 7B model locally reproduced the same error in recent tests. But interestingly, 7B did produce a spelling error in the opening line: “boueef” — and more may follow.

    While Mistral sells several commercial models, tools, and services, its free-to-use models are popular, and I personally tend to use Mistral 7B for running tasks through local models.

    Until recently, the company wasn’t explicit about its models having multilingual support, but its announcement of the Magistral model at London Tech Week in June 2025 confirmed support for several European languages.

    EuroLLM

    EuroLLM was created as a partnership between Portuguese AI platform Unbabel and several European universities to understand and generate text in all official European Union languages. The model also includes non-European languages widely spoken by immigrant communities and major trading partners, such as Hindi, Chinese, and Turkish.

    Like some of the other open model projects in this article, its work was partly funded by the EU’s High Performance Computing Joint Undertaking program (EuroHPC JU). Many of them share similar names and aims, making it confusing to separate them all. EuroLLM was one of the first, and as Ricardo Rei, Senior Research Scientist at Unbabel, told me, the team has learned a lot from the projects that have come since.

    As Unbabel’s prime business is language translation, and translation is a key task for many multilingual models, the work on EuroLLM made sense to the Portuguese platform. Before EuroLLM, Unbabel had already been refining existing models to make its own and found them all too English-centric.

    One of the team’s biggest challenges was finding sufficient training data for low-resource languages. Ultimately, the availability of training material reflects the number of people who speak the language. One of the common data sources used to train European language models is Europarl, which contains transcripts of the European Parliament’s activities translated into all official EU languages. It’s also available as a Hugging Face dataset, thanks to ETH Zürich.

    Currently, the project has a 1.7B parameter model and a 9B parameter model, and is working on a 22B parameter model. In all cases, the models can translate, but are also general-purpose, meaning you can chat with them in a similar way to ChatGPT, mixing and matching languages as you do.

    OpenLLM Europe

    OpenLLM Europe isn’t building anything directly, but it’s fostering a Europe-wide community of LLM projects, specifically medium and low-resource languages. Don’t let the one-page GitHub repository fool you: the Discord server is lively and active.

    OpenEuroLLM, Lumi, and Silo

    A joint project between several European universities and companies, OpenEuroLLM is one of the newer and larger entrants to the list of projects funded by EuroHPC. This means that it has no public models as of yet, but it involves many of the institutions and individuals behind the Lumi family of models that focus on Scandinavian and Nordic languages. It aims to create a multilingual model, provide more datasets for other models and conform to the EU AI Act.

    I spoke with Peter Sarlin of AMD Silo, one of the companies involved in the project and a key figure in Finnish and European AI development, about the plans. He explained that Finland, especially, has several institutes with significant AI research programs, including Lumi, one of the supercomputers part of EuroHPC. Silo, through its SiloGen product, offers open source models to customers, with a strong focus on supporting European languages. Sarlin pointed out that while sovereignty is an important motivation to him and Silo for creating and maintaining models that support European languages, the better reason is expanding the business and helping companies build solutions for small markets such as Estonia.

    “Open models are great building blocks, but they aren’t as performant as closed ones, and many businesses in the Nordics and Scandinavia don’t have the resources to build tools based on open models,” he said. “So Silo and our models can step in to fill the gaps.”

    Under Sarlin’s leadership, Silo AI built a Nordic LLM family to protect the region’s linguistic diversity. Credit: Silo AI

    The Lumi models use a “cross-lingual training” technique in which the model shares its parameters between high-resource and low-resource languages.

    All this prior work led to the OpenEuroLLM project, which Sarlin describes as “Europe’s largest open source AI initiative ever, including pretty much all AI developers in Europe apart from Mistral.”

    While many efforts are underway and performing well, the training data issue for low-resource languages remains the biggest challenge, especially amid the move towards more nuanced reasoning models. Translations and cross-lingual training are options, but can create responses that sound unnatural to native speakers. As Sarlin said, “We don’t want a model that sounds like an American speaking Finnish.”

    OpenLLM France

    France is one of the more active countries in AI development, with Mistral and Hugging Face leading the way. From a community perspective, the country also has OpenLLM France. The project (unsurprisingly) focuses on French language models, with several models of different parameters and datasets, which help other projects train and improve their models that support French. The datasets include a mix of political discourse, meeting recordings, theatre shows, and casual conversations. The project also maintains a leaderboard of French models on Hugging Face, one of the few (active) European language model benchmark pages.

    Do Europeans care about multilingual AI?

    Europe is full of people and projects working on multilingual language models. But do consumers care? Unfortunately, getting language usage rates for proprietary tools such as ChatGPT or Mistral is almost impossible. I created a poll on LinkedIn asking if people use AI tools in their native language, English, or a mixture of both. The results were a 50/50 split between English and a mixture of languages. This could indicate that the number of people using AI tools in a non-English language is higher than you think.

    Typically, people use AI tools in English for work and in their own language for personal tasks.

    Kaffee, a German and English speaker, said: “I use them mostly in English because I speak English at work and with my partner at home. But then, for personal tasks…, I use German.”

    Kaffee mentioned that Hugging Face was working on a soon-to-be-published research project that fully analysed the usage of multilingual models on the platform. She also noted anecdotally that their usage is on the rise. 

    “Users have a conception that models are now more multilingual. And with the accessibility through large models like Llama, for example, being multilingual, I think that made a big impact on the research world regarding multilingual models and the number of people wanting to now use them in their own language.”

    The internet was always supposed to be global and for everyone, but the damning statistic that 50% of sites are in English shows it never really worked out that way. We’re entering a new phase in how we access information and who controls it. Maybe this time, the (AI) revolution will be international.

    Share. Facebook Twitter Pinterest LinkedIn Reddit WhatsApp Telegram Email
    Previous ArticleA mammoth tusk boomerang from Poland is 40,000 years old
    Next Article How Do Pimple Patches Work? Here’s Everything You Need to Know
    TechAiVerse
    • Website

    Jonathan is a tech enthusiast and the mind behind Tech AI Verse. With a passion for artificial intelligence, consumer tech, and emerging innovations, he deliver clear, insightful content to keep readers informed. From cutting-edge gadgets to AI advancements and cryptocurrency trends, Jonathan breaks down complex topics to make technology accessible to all.

    Related Posts

    Buckle Up for Bumpier Skies

    March 3, 2026

    Daily Driving GrapheneOS

    March 3, 2026

    OpenAI will amend Defense Department deal to prevent mass surveillance in the US

    March 3, 2026
    Leave A Reply Cancel Reply

    Top Posts

    Ping, You’ve Got Whale: AI detection system alerts ships of whales in their path

    April 22, 2025702 Views

    Lumo vs. Duck AI: Which AI is Better for Your Privacy?

    July 31, 2025285 Views

    6.7 Cummins Lifter Failure: What Years Are Affected (And Possible Fixes)

    April 14, 2025164 Views

    6 Best MagSafe Phone Grips (2025), Tested and Reviewed

    April 6, 2025124 Views
    Don't Miss
    Gaming March 3, 2026

    Warner Bros. Discovery is “rebuilding its video game pipeline” after a “significant” 2025

    Warner Bros. Discovery is “rebuilding its video game pipeline” after a “significant” 2025 The company…

    Resident Evil Requiem Japanese players fight “immersion-breaking” censorship

    Netflix boss says Paramount acquisition of Warner Bros will result in “cuts in excess of $16 billion” within “18 months or so”

    Metacritic pulls Resident Evil Requiem review after reports it was generated by AI and attributed to a writer that doesn’t exist

    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    About Us
    About Us

    Welcome to Tech AI Verse, your go-to destination for everything technology! We bring you the latest news, trends, and insights from the ever-evolving world of tech. Our coverage spans across global technology industry updates, artificial intelligence advancements, machine learning ethics, and automation innovations. Stay connected with us as we explore the limitless possibilities of technology!

    Facebook X (Twitter) Pinterest YouTube WhatsApp
    Our Picks

    Warner Bros. Discovery is “rebuilding its video game pipeline” after a “significant” 2025

    March 3, 20262 Views

    Resident Evil Requiem Japanese players fight “immersion-breaking” censorship

    March 3, 20262 Views

    Netflix boss says Paramount acquisition of Warner Bros will result in “cuts in excess of $16 billion” within “18 months or so”

    March 3, 20262 Views
    Most Popular

    7 Best Kids Bikes (2025): Mountain, Balance, Pedal, Coaster

    March 13, 20250 Views

    VTOMAN FlashSpeed 1500: Plenty Of Power For All Your Gear

    March 13, 20250 Views

    Best TV Antenna of 2025

    March 13, 20250 Views
    © 2026 TechAiVerse. Designed by Divya Tech.
    • Home
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms & Conditions

    Type above and press Enter to search. Press Esc to cancel.