février 12, 2026

Roberto Di Cosmo on software sovereignty

Most people think of digital sovereignty as « where your data is stored. » But for Software Heritage Director Roberto Di Cosmo, that’s missing the forest for the trees. True autonomy isn’t just about the server; it’s about the code. If you don’t have the source code, you don’t own the machine—the machine owns you.

In an interview for the Politiques Numériques (POL/N) series, veteran tech journalist Delphine Sabattier sat down with Di Cosmo at UNESCO Headquarters in Paris during the recent Software Heritage Symposium. They discussed the 10-year journey of Software Heritage and why this « universal archive » has evolved from a preservation project into a critical piece of global infrastructure. (Note: You can watch the full 27-minute conversation in French on YouTube.)

Q: Can you describe the actual scale of this « Library of Alexandria » for code?

A: The Archive has gathered the world’s software knowledge and the entire history of its development into a single, unique graph. This massive data structure currently contains 50 billion nodes and 1 trillion « edges » (connections between pieces of code), representing more than 400 million projects. It preserves every modification and every version ever made public, allowing researchers to track the evolution of software over the last 50 years.

Q: What is the major geopolitical risk facing the world’s software infrastructure today?

A: Currently, about three-quarters of the world’s public software is hosted on a single platform, GitHub (owned by Microsoft). This creates a massive dependency; if a platform owner were to cut off access or delete software due to changing geopolitical situations, it would endanger the entire software chain for companies, administrations, and researchers. Software Heritage provides « software sovereignty » by maintaining independent copies of all these versions regardless of their origin.

Q: How does Software Heritage protect its archive from being destroyed by a single technical bug or attack?

A: The project uses a strategy Di Cosmo calls « computational biodiversity. » Rather than just making identical copies, they establish « mirrors » in different countries—such as a recently announced mirror in Spain—and encourage them to use different underlying technologies and storage methods. This ensures that if one technology fails or has a vulnerability, the other copies remain intact. Every object in the Archive is also identified by a cryptographic key, called the SoftWare Hash IDentifier (SWHID), that has become an ISO standard to ensure data integrity.

*https://www.youtube.com/watch?v=jSk2ky2NUC4*

Q: Why is software code so important for training modern AI models, even if they aren’t designed for coding?

A: Di Cosmo reveals a « secret » in the industry: to train high-quality AI models, it’s necessary to include approximately 10 percent source code in the training data. Because source code is formal, executable, and lacks ambiguity, including it appears to improve the AI’s overall reasoning capabilities. Software Heritage is uniquely positioned here because it sits on the largest mass of code available on the planet, making it a primary source for training these models.

Software Heritage

Suivez nous