Roadmap 2025#
(Version 1.0, last modified 2025-05-15)
This document provides an overview of the technical roadmap of the Software Heritage initiative for the year 2025.
This year, our roadmap is focused on three main objectives:
Enlarge the archive: archive more content and be able to archive even more
Enrich the archive: add new data about the archive content
- Empower archive users: propose new ways to access the archive to allow more
usage
It is mostly driven by several projects:
CodeCommons: https://codecommons.org/
SWH-Sec: https://www.softwareheritage.org/2023/04/07/enhancing-cybersecurity-through-swh/
OSPO-Radar: https://www.softwareheritage.org/2025/04/02/ospo-radar-project-launch/
Many items of this roadmap are handled by other teams involved with us in the CodeCommons and SWH-Sec projects, they are tagged “Externals”.
Some items tagged “Next” are not prioritized this year but kept here for next year or if other items are delivered faster.
Coar Notify#
Priority: High
Tags: Interfaces Work Group, FAIR, Deposit, Enlarge
Description
Add support for the COAR Notify protocol in the SWH Archive to allow partners to notify us new relations between software source code artifacts and external entities, especially scholarly publications and scientific papers.
Includes work
- Implement the basic use case allowing partners to notify the archive of
software mentions in scientific papers
Document it and test it with user
- Add requirements for production usage (monitoring, alerting, integration
tests)
KPIs
New API
Users can send software mentions in scientific publications
- Users can search the archive for the software artifacts related to scientific
work using simple search criteria
Institutional portal (aka OSPO Radar)#
Priority: Medium
Tags: Interfaces work group, Empower, OSPO-Radar
Description
Set up an Institutional Portal, a UI feature aiming to present, qualify and extract software catalogs for specific entities (institutions, administrations, …). In 2025, we will bootstrap the project, write the specifications and start implementation. We do not expect to release it this year.
Includes work
Gather key users and collect requirements
Design the specification
Implement and deploy
KPIs
Institutional portal deployed in production
Number of user institutions
Number of origins per institution
Rethink Archive UI#
Priority: Medium
Tags: Interfaces Work Group, Empower
Description
The main way to access the Software Heritage archive is the user interface exposed at https://archive.softwareheritage.org The current interface has a few drawbacks. Some information are not easily accessible, for instance metadata. It is also difficult to see connections between origins, for instance which origins share a given file. We want to think about archive UI/UX and design new features that we want to add in the future.
Includes work
List easy and hard features to add
For hard features, describe requirements to make them accessible
Draw some design of what we would expect
Prepare a plan on how to build and release them
KPIs
List of features
Tasks decomposition to build them
Expose known vulnerabilities through Scanner#
Priority: Low
Tags: Interfaces work group, SWH-Scanner, Empower, Next
Description
Add a feature to SWH Scanner that allows to show known vulnerabilities (CVEs) related to scanned source code, based on CVE information collected in the Software Heritage archive
Includes work
Design, implement and deploy an api to query CVE information
Implement a “show CVE” feature in swh-scanner
KPIs
New backend API in production
- New swh-scanner version released embedding the “show CVE”
feature
Review existing documentation according to identified personas#
Priority: Low
Tags: Interfaces Work Group, Empower, Next
Description
The existing documentation is fairly extensive but somewhat unfocused. There is work scheduled to come up with personas to reflect on various Software Heritage stakeholders. Once that work is done, the existing documentation should be reviewed to identify who could be interested in which parts.
Includes work
Review each piece of documentation.
Tag each page with the personas that could be interested.
Identify undocumented aspects.
Perform “low-hanging fruit” changes in the documentation.
KPIs
Pages of the documentation tagged with a set of personas.
List of areas lacking documentation.
Update of the documentation landing page to better fit the different personas.
Improve ingestion efficiency#
Priority: Medium
Tags: CodeCommons, Enlarge, Archive Work Group, Externals
Description
GitHub growth is faster than Software Heritage’s current ingestion capacities, resulting in a lag of more than 140 million origins. In order to stay an up-to-date archive after the lag catch up, we need to improve our ingestion efficiency and optimize even more our platform.
Includes work
Measure current bottlenecks
Plan and implement solution to these bottlenecks
KPIs
Number of ingested origins per unit of time
Support archiving repositories containing SHA1 hash conflicts#
Priority: Medium
Tags: Enlarge, Archive Work Group
Description
SHA1 is used to identify duplicated files but this hash function is now fragile and hash collisions can be crafted. Those hash collisions are of particular interest and we want to be able to archive them.
Includes work
Archive repositories with hash conflicts in winery storage
- Analyze possibility for other object storages and implement it if
possible
KPIs
Improve Object Storage#
Priority: Medium
Tags: Enlarge, Archive Work Group
Description
We believe we can improve Winery, our current object storage. Some large scale access patterns are complicated and some ongoing studies show that we may improve compression rate by clustering similar files together.
Includes work
Follow and help studies on object storage compression
Propose and bench solutions for improved object storage
Prepare a migration plan
KPIs
Benchmarks
Provide an executive-friendly monitoring of services#
Priority: Medium
Tags: Enlarge, Archive Work Group, Interfaces Work Group
Description
Provide a high-level and easy to find dashboard of running services with documented key indicators.
Includes work
Gather public site metrics
Publish and document a dedicated dashboard
Add links to it on common web applications (web app and docs.s.o)
KPIs
Indicators available for public sites status
Indicators for archive workers status
Indicators for archive behavior
Main dashboard that aggregates the indicators
Dashboard referenced in common web applications
GitLab crawler#
Priority: High
Tags: Archive Work Group, SWHSec, Enlarge
Description
Recent addition to gitlab from Software Heritage allows us to fetch metadata from gitlab forges. Now that they are accessible, we want to fetch them
Includes work
Implement new crawler
Deploy it
KPIs
Metadata coverage from gitlab forges
Handle pending loaders and listers#
Priority: Medium
Tags: Archive Work Group, Externals, Enlarge
Description
Several contributions have been made to archive content from new forges or packages indexes but never deployed. Review, update if required and merge all pending loaders and listers
Includes work
Review loaders
Decide for each on if we merge, update or discard
Merge, update and deploy those we want to keep
KPIs
Closed merge requests
Support hash collisions globally#
Priority: Low
Tags: Archive Work Group, Enlarge, Next
Description
Several data points in the Software Heritage are identified by their hash, in general a sha1. Hash collisions may happen and we need to find a way to be resilient to them. This is similar to the archiving of repositories with hash collision but more general to the whole Software Heritage Archive.
Includes work
- Analyze hash collisions issues for all Software Heritage object types
(content, directory, revisions, origins…)
Propose and implement workarounds
KPIs
Diff Service#
Priority: High
Tags: Data Work Group, Empower, SWH-Sec
Description
Implement a way to compute diff between two revisions
Includes work
Implement algorithm outputting git like diff
Compute diff on revisions of some important repositories
- Add requirements for production usage (monitoring, alerting,
integration tests)
KPIs
Diff algorithm implementation
Dataset produced with it
PySpark Tooling#
Priority: Medium
Tags: Data Work Group, Next
Description
We use pyspark for some large scale data handling. Our usage is currently not distributed and we need to develop our tooling to be able to execute large scale pyspark jobs on our infrastructure
Includes work
Be able to run distributed pyspark jobs on our kubernetes cluster
Access to pyspark web UI during job
Metrics of pyspark jobs
History server to access finished jobs metrics
Object storage to store job inputs, outputs, transient data…
JupyterHub
Way to use content object storage easily and efficiently in jobs
KPIs
Prepare hosting move#
Priority: High
Tags: Ops Work Group
Description
Our current hosting will be closed, we need to get ready to move from it when it will happen
Includes work
Evaluate hosting solutions
Prepare a plan for the move
Study how to minimize the service interruption
Tackle logistics issues
List required investments
KPIs
Actionable plan
Advantages and disadvantages of several solutions
Documentation for mirror operators#
Priority: Medium
Tags: Ops Work Group
Description
Managing and operating a mirror is a complicated task and it is time consuming to help them. We need to improve the documentation to give more autonomy to mirror operators.
Includes work
Review each piece of documentation with mirror operator and Software Heritage Ops
Update documentation
KPIs
CodeCommons#
Unified Data Model#
Priority: High
Tags: CodeCommons, Enrich, Externals
Description
Building a unified data model to enrich the Software Heritage core data model is a keystone of the CodeCommons project. It consists in collecting metadata from many sources and to store them in an unified model, in a way that makes the data available for efficient indexing and querying. The purpose of this unified data model is to generate qualified and specialized datasets, filtered with a wide range of criteria in order to produce highly specialized datasets.
The scope of the CodeCommons Unified Data Model includes:
- Project Context data (extrinsic): data from various collaboration
platforms (forges, bug trackers…)
- Research articles and other context (extrinsic): structured metadata
from publications metadata and its connection to software artifacts
- Code Qualification (intrinsic): code-related data,including
dependencies detection, language identification and quality measurement
- Licence detection (intrinsic): structured data model for licence
information, at both file-level and project level
Includes work
Design architecture for the Unified Data Model
Implement and deploy the Unified Data Model components
KPIs
Project context metadata#
Priority: High
Tags: CodeCommons, Enrich, Externals
Description
This task of the CodeCommons project includes collecting context data from various collaboration platforms (forges, bug trackers…) and storing it in an unified data model. It aims at adding helpful information to qualify source codes in regards with projects activity, including issues, pull requests and discussions.
Among the identified collaboration platforms, GitHub context data will be stored using GHArchive.
Includes work
- Design the unified data model for project context metadata, based on a
benchmark of existing models like ForgeFed
- Implement and deploy crawlers for project context metadata for each
identified platform
Run a massive crawling and store the data in the unified data model
KPIs
List of supported collaboration platforms
Number of origins covered in the archive
License metadata#
Priority: High
Tags: CodeCommons, Enrich, Externals
Description
CodeCommons aims to detect license, copyright, and package metadata on the whole Software Heritage Archive, critical to ensure the transparency and traceability for sovereign and sustainable AI.
This will be done using ScanCode, in partnership with AboutCode, a well-reputed, non-profit, public benefit organisation with ample experience designing and architecting FOSS tools for analysing and organising software and the webs of components each software package depends on, providing a great advancement for software supply chain and license compliance across the software ecosystem.
The ScanCode for CodeCommons project includes running a massive license scan on the whole Software Heritage Archive.
To ensure the efficiency and efficacy of this massive scan, this project also improves the accuracy and quality of ScanCode’s license detection.
Includes work
- Benchmark, adapt and optimize ScanCode for large scale analysis on
Software Heritage archive
Run scan at file level on the whole Software Heritage archive
- Run scan at project level on relevant versions of Software Heritage
origins
Assemble and store the result in a unified data model
KPIs
Number of files scanned
Number of software versions scanned
Research publications metadata#
Priority: Medium
Tags: CodeCommons, Enrich, Externals
Description
This task of the CodeCommons project aims to identify to which thematics a software project is related, by collecting metadata from research publications, referenced by several platforms (e.g. HAL, Open Alex).
The collected data will be structured in a unified data model.
Includes work
- Design the unified data model for publications metadata, based on a
benchmark of existing models like OpenAlex
- Implement and deploy crawlers for publications metadata for each
identified platform
Run a massive crawling and store the data in the unified data model
KPIs
List of supported publications platforms
Number of referenced publications
Number of origins covered in the archive
Software versions metadata#
Priority: High
Tags: CodeCommons, Enrich, Externals
Description
Many references to specific software versions use version name of software projects. The current Software Heritage model doesn’t provide explicit and formal version identification.
The goal of this task is to add version information to the Software Heritage data model, providing relevant information adapted to various levels of granularity.
Includes work
Identify external data sources providing accurate information
- Identify and validate heuristics for Software Versions identification
analysis in archive contents
Design a data model for Software versions Data model
Map software versions to objects in the archive
KPIs
Number of software projects identified
Number of versions identified
Catch up with GitHub lag#
Priority: High
Tags: CodeCommons, Enlarge, Archive Work Group, Externals
Description
GitHub growth is faster than Software Heritage’s current ingestion capacities, resulting in a lag of more than 140 million origins. In order to return to an up-to-date archive, the CodeCommons project includes the usage of CINES HPC infrastructure to massively clone and ingest the missing repositories.
Includes work
List the missing GitHub origins in Software Heritage archive
Implement and deploy massive ingestion tools at CINES
Clone and ingest the missing origins at CINES
Generate deduplicated datasets for retrieval in the main archive
KPIs
Number of ingested GitHub origins
Number of origins not archived
Expose full archive for large scale analysis#
Priority: High
Tags: CodeCommons, Enrich, Tooling, Data Work Group
Description
CINES’s Adastra HPC infrastructure has been made available to CodeCommons for providing the compute and storage capabilities required for CodeCommons massive data processing and additional metadata collection around Software Heritage. This item covers the prerequisite actions on CINES HPC, which consist of depositing a full copy of the main archive (contents and graph) and deploy the tooling for large scale archive access.
Includes work
Copy archive contents at CINES
Copy archive compressed graph at CINES
Improve and adapt SWH-Fuse for optimized large-scale access to the archive
KPIs
Full copy of the archive available at CINES
SWH-Fuse deployed at CINES
Performance metrics for SWH-Fuse
Similarity analysis#
Priority: Low
Tags: CodeCommons, Enrich, Externals
Description
Additionally to Software Heritage’s strong commitment to transparency and respect of the authors in training datasets for LLMs for code (as stated more than a year ago: https://www.softwareheritage.org/2023/10/19/swh-statement-on-llm-for-code/), CodeCommons includes to provide mechanisms of similarity detection for generated code, in order to ensure a proper attribution to the authors of the original source code. We are planning to use text and syntax analysis methods for similarity, but also to challenge machine learning approach that may complete the results.
Includes work
Design and implement tools for code Similarity analysis
Benchmark results from different approaches
- Prepare the integration of provenance for attribution of generated
code
KPIs
Documented benchmark results
Code Qualification#
Priority: Medium
Tags: CodeCommons, Enrich, Externals
Description
In order to provide qualified datasets according to multiple criteria based on the code qualification, the Software Heritage will be enriched with metadata extracted from an in-depth analysis of the source code archive, including the following topics: - Programming languages identification - Dependencies detection - Code quality metrics
Includes work
Programming languages:
Benchmark existing tools and select the most relevant ones
- Run language identification analysis at scale on Software Heritage
contents
Store and index the results in a unified data model
Dependencies detection
Customize ScanCode tools for scaling to Software Heritage
Run a file-level analysis on the archive contents
- Run a project level analysis on the graph (projects filesystems
browsing)
Store and index the results in a unified data model
Code quality metrics extraction
Identify relevant code quality metrics, possibly:
Static analysis
Code coverage
Design patterns identification
KPIs
% of the archive covered for each subject
Automate dataset generation#
Priority: Medium
Tags: CodeCommons, Enrich, Dataset factory, Data work group
Description
We need to produce datasets regularly and reliably to be more efficient and to clarify which datasets users can expect. Provide tooling for an automated production and publishing of derived datasets
Includes work
Design and implement the required automation tools
Setup and configure an automation pipeline
Provide a dashboard for monitoring
Document datasets for clear interface
KPIs
Number of derived datasets automatically published
Generate contents datasets#
Priority: High
Tags: CodeCommons, Enrich, Dataset factory, Data work group
Description
Create a tool that generates a dataset embedding file contents, based on a list of SWHIDs.
Includes work
- Enable SWHID mapping on existing object storage (currently indexed by
hash)
- Design and implement a generation engine for datasets embedding
contents
Benchmark and optimize performance for large-scale usage
KPIs
Performance metrics
Integrate CodeCommons in main archive#
Priority: High
Tags: CodeCommons, Enlarge, Next
Description
Most CodeCommons tools for metadata crawling and archive analysis will be run on Adastra HPC at CINES. On the one hand, the computed metadata will need to retrieved in the main archive, and on the other hand, the tools used for a massive processing on the whole archive copy will need to be integrated to Software Heritage standard ingestion pipeline in order to keep maintaining the CodeCommos metadata up-to-date on the long term. This task also includes the retrieval of the GitHub lag ingestion.
Includes work
Retrieve archive core data from CINES
Retrieve unified metadata from CINES
- Design architecture and infrastructure for retrieving full archive
and unified metadata
Integrate CodeCommons tools in the standard ingestion pipeline
KPIs
Main archive core data up-to-date with CINES
Main archive metadata up-to-date with CINES
Tools integrated to the ingestion pipeline
SWHSec#
Collect and store CVE metadata#
Priority: High
Tags: Data work group, SWHSec, Enrich
Description
Collect CVE metadata from relevant external data sources, map it to Software Heritage data model and link CVEs to relevant revisions (introducing and fixing revisions).
Includes work
Design a data model for CVEs
Implement crawlers for CVE data sources
Store metadata
KPIs
Number of CVEs stored
Number of Objects linked to a CVE
Vulnerability Dataset extraction#
Priority: High
Tags: Data work group, SWHSec, Enrich
Description
Develop a tool that extracts the relevant introducing/fixing commits from Software Heritage for a dataset of vulnerabilities
Includes work
Design and implement the detection mechanisms
Generate raw datasets
Iterate with people involved in the extracted data evaluation
KPIs
Introducing commits detection ratio
Fixing commits detection ratio
Number of CVEs supported