Roadmap 2025#

(Version 1.0, last modified 2025-05-15)

This document provides an overview of the technical roadmap of the Software Heritage initiative for the year 2025.

This year, our roadmap is focused on three main objectives:

  • Enlarge the archive: archive more content and be able to archive even more

  • Enrich the archive: add new data about the archive content

  • Empower archive users: propose new ways to access the archive to allow more

    usage

It is mostly driven by several projects:

Many items of this roadmap are handled by other teams involved with us in the CodeCommons and SWH-Sec projects, they are tagged “Externals”.

Some items tagged “Next” are not prioritized this year but kept here for next year or if other items are delivered faster.

Coar Notify#

  • Priority: High

  • Tags: Interfaces Work Group, FAIR, Deposit, Enlarge

Description

Add support for the COAR Notify protocol in the SWH Archive to allow partners to notify us new relations between software source code artifacts and external entities, especially scholarly publications and scientific papers.

Includes work

  • Implement the basic use case allowing partners to notify the archive of

    software mentions in scientific papers

  • Document it and test it with user

  • Add requirements for production usage (monitoring, alerting, integration

    tests)

KPIs

  • New API

  • Users can send software mentions in scientific publications

  • Users can search the archive for the software artifacts related to scientific

    work using simple search criteria

Institutional portal (aka OSPO Radar)#

  • Priority: Medium

  • Tags: Interfaces work group, Empower, OSPO-Radar

Description

Set up an Institutional Portal, a UI feature aiming to present, qualify and extract software catalogs for specific entities (institutions, administrations, …). In 2025, we will bootstrap the project, write the specifications and start implementation. We do not expect to release it this year.

Includes work

  • Gather key users and collect requirements

  • Design the specification

  • Implement and deploy

KPIs

  • Institutional portal deployed in production

  • Number of user institutions

  • Number of origins per institution

Rethink Archive UI#

  • Priority: Medium

  • Tags: Interfaces Work Group, Empower

Description

The main way to access the Software Heritage archive is the user interface exposed at https://archive.softwareheritage.org The current interface has a few drawbacks. Some information are not easily accessible, for instance metadata. It is also difficult to see connections between origins, for instance which origins share a given file. We want to think about archive UI/UX and design new features that we want to add in the future.

Includes work

  • List easy and hard features to add

  • For hard features, describe requirements to make them accessible

  • Draw some design of what we would expect

  • Prepare a plan on how to build and release them

KPIs

  • List of features

  • Tasks decomposition to build them

Expose known vulnerabilities through Scanner#

  • Priority: Low

  • Tags: Interfaces work group, SWH-Scanner, Empower, Next

Description

Add a feature to SWH Scanner that allows to show known vulnerabilities (CVEs) related to scanned source code, based on CVE information collected in the Software Heritage archive

Includes work

  • Design, implement and deploy an api to query CVE information

  • Implement a “show CVE” feature in swh-scanner

KPIs

  • New backend API in production

  • New swh-scanner version released embedding the “show CVE”

    feature

Review existing documentation according to identified personas#

  • Priority: Low

  • Tags: Interfaces Work Group, Empower, Next

Description

The existing documentation is fairly extensive but somewhat unfocused. There is work scheduled to come up with personas to reflect on various Software Heritage stakeholders. Once that work is done, the existing documentation should be reviewed to identify who could be interested in which parts.

Includes work

  • Review each piece of documentation.

  • Tag each page with the personas that could be interested.

  • Identify undocumented aspects.

  • Perform “low-hanging fruit” changes in the documentation.

KPIs

  • Pages of the documentation tagged with a set of personas.

  • List of areas lacking documentation.

  • Update of the documentation landing page to better fit the different personas.

Improve ingestion efficiency#

  • Priority: Medium

  • Tags: CodeCommons, Enlarge, Archive Work Group, Externals

Description

GitHub growth is faster than Software Heritage’s current ingestion capacities, resulting in a lag of more than 140 million origins. In order to stay an up-to-date archive after the lag catch up, we need to improve our ingestion efficiency and optimize even more our platform.

Includes work

  • Measure current bottlenecks

  • Plan and implement solution to these bottlenecks

KPIs

  • Number of ingested origins per unit of time

Support archiving repositories containing SHA1 hash conflicts#

  • Priority: Medium

  • Tags: Enlarge, Archive Work Group

Description

SHA1 is used to identify duplicated files but this hash function is now fragile and hash collisions can be crafted. Those hash collisions are of particular interest and we want to be able to archive them.

Includes work

  • Archive repositories with hash conflicts in winery storage

  • Analyze possibility for other object storages and implement it if

    possible

KPIs

Improve Object Storage#

  • Priority: Medium

  • Tags: Enlarge, Archive Work Group

Description

We believe we can improve Winery, our current object storage. Some large scale access patterns are complicated and some ongoing studies show that we may improve compression rate by clustering similar files together.

Includes work

  • Follow and help studies on object storage compression

  • Propose and bench solutions for improved object storage

  • Prepare a migration plan

KPIs

  • Benchmarks

Provide an executive-friendly monitoring of services#

  • Priority: Medium

  • Tags: Enlarge, Archive Work Group, Interfaces Work Group

Description

Provide a high-level and easy to find dashboard of running services with documented key indicators.

Includes work

  • Gather public site metrics

  • Publish and document a dedicated dashboard

  • Add links to it on common web applications (web app and docs.s.o)

KPIs

  • Indicators available for public sites status

  • Indicators for archive workers status

  • Indicators for archive behavior

  • Main dashboard that aggregates the indicators

  • Dashboard referenced in common web applications

GitLab crawler#

  • Priority: High

  • Tags: Archive Work Group, SWHSec, Enlarge

Description

Recent addition to gitlab from Software Heritage allows us to fetch metadata from gitlab forges. Now that they are accessible, we want to fetch them

Includes work

  • Implement new crawler

  • Deploy it

KPIs

  • Metadata coverage from gitlab forges

Handle pending loaders and listers#

  • Priority: Medium

  • Tags: Archive Work Group, Externals, Enlarge

Description

Several contributions have been made to archive content from new forges or packages indexes but never deployed. Review, update if required and merge all pending loaders and listers

Includes work

  • Review loaders

  • Decide for each on if we merge, update or discard

  • Merge, update and deploy those we want to keep

KPIs

  • Closed merge requests

Support hash collisions globally#

  • Priority: Low

  • Tags: Archive Work Group, Enlarge, Next

Description

Several data points in the Software Heritage are identified by their hash, in general a sha1. Hash collisions may happen and we need to find a way to be resilient to them. This is similar to the archiving of repositories with hash collision but more general to the whole Software Heritage Archive.

Includes work

  • Analyze hash collisions issues for all Software Heritage object types

    (content, directory, revisions, origins…)

  • Propose and implement workarounds

KPIs

Diff Service#

  • Priority: High

  • Tags: Data Work Group, Empower, SWH-Sec

Description

Implement a way to compute diff between two revisions

Includes work

  • Implement algorithm outputting git like diff

  • Compute diff on revisions of some important repositories

  • Add requirements for production usage (monitoring, alerting,

    integration tests)

KPIs

  • Diff algorithm implementation

  • Dataset produced with it

PySpark Tooling#

  • Priority: Medium

  • Tags: Data Work Group, Next

Description

We use pyspark for some large scale data handling. Our usage is currently not distributed and we need to develop our tooling to be able to execute large scale pyspark jobs on our infrastructure

Includes work

  • Be able to run distributed pyspark jobs on our kubernetes cluster

  • Access to pyspark web UI during job

  • Metrics of pyspark jobs

  • History server to access finished jobs metrics

  • Object storage to store job inputs, outputs, transient data…

  • JupyterHub

  • Way to use content object storage easily and efficiently in jobs

KPIs

Prepare hosting move#

  • Priority: High

  • Tags: Ops Work Group

Description

Our current hosting will be closed, we need to get ready to move from it when it will happen

Includes work

  • Evaluate hosting solutions

  • Prepare a plan for the move

  • Study how to minimize the service interruption

  • Tackle logistics issues

  • List required investments

KPIs

  • Actionable plan

  • Advantages and disadvantages of several solutions

Documentation for mirror operators#

  • Priority: Medium

  • Tags: Ops Work Group

Description

Managing and operating a mirror is a complicated task and it is time consuming to help them. We need to improve the documentation to give more autonomy to mirror operators.

Includes work

  • Review each piece of documentation with mirror operator and Software Heritage Ops

  • Update documentation

KPIs

CodeCommons#

Unified Data Model#

  • Priority: High

  • Tags: CodeCommons, Enrich, Externals

Description

Building a unified data model to enrich the Software Heritage core data model is a keystone of the CodeCommons project. It consists in collecting metadata from many sources and to store them in an unified model, in a way that makes the data available for efficient indexing and querying. The purpose of this unified data model is to generate qualified and specialized datasets, filtered with a wide range of criteria in order to produce highly specialized datasets.

The scope of the CodeCommons Unified Data Model includes:

  • Project Context data (extrinsic): data from various collaboration

    platforms (forges, bug trackers…)

  • Research articles and other context (extrinsic): structured metadata

    from publications metadata and its connection to software artifacts

  • Code Qualification (intrinsic): code-related data,including

    dependencies detection, language identification and quality measurement

  • Licence detection (intrinsic): structured data model for licence

    information, at both file-level and project level

Includes work

  • Design architecture for the Unified Data Model

  • Implement and deploy the Unified Data Model components

KPIs

Project context metadata#

  • Priority: High

  • Tags: CodeCommons, Enrich, Externals

Description

This task of the CodeCommons project includes collecting context data from various collaboration platforms (forges, bug trackers…) and storing it in an unified data model. It aims at adding helpful information to qualify source codes in regards with projects activity, including issues, pull requests and discussions.

Among the identified collaboration platforms, GitHub context data will be stored using GHArchive.

Includes work

  • Design the unified data model for project context metadata, based on a

    benchmark of existing models like ForgeFed

  • Implement and deploy crawlers for project context metadata for each

    identified platform

  • Run a massive crawling and store the data in the unified data model

KPIs

  • List of supported collaboration platforms

  • Number of origins covered in the archive

License metadata#

  • Priority: High

  • Tags: CodeCommons, Enrich, Externals

Description

CodeCommons aims to detect license, copyright, and package metadata on the whole Software Heritage Archive, critical to ensure the transparency and traceability for sovereign and sustainable AI.

This will be done using ScanCode, in partnership with AboutCode, a well-reputed, non-profit, public benefit organisation with ample experience designing and architecting FOSS tools for analysing and organising software and the webs of components each software package depends on, providing a great advancement for software supply chain and license compliance across the software ecosystem.

The ScanCode for CodeCommons project includes running a massive license scan on the whole Software Heritage Archive.

To ensure the efficiency and efficacy of this massive scan, this project also improves the accuracy and quality of ScanCode’s license detection.

Includes work

  • Benchmark, adapt and optimize ScanCode for large scale analysis on

    Software Heritage archive

  • Run scan at file level on the whole Software Heritage archive

  • Run scan at project level on relevant versions of Software Heritage

    origins

  • Assemble and store the result in a unified data model

KPIs

  • Number of files scanned

  • Number of software versions scanned

Research publications metadata#

  • Priority: Medium

  • Tags: CodeCommons, Enrich, Externals

Description

This task of the CodeCommons project aims to identify to which thematics a software project is related, by collecting metadata from research publications, referenced by several platforms (e.g. HAL, Open Alex).

The collected data will be structured in a unified data model.

Includes work

  • Design the unified data model for publications metadata, based on a

    benchmark of existing models like OpenAlex

  • Implement and deploy crawlers for publications metadata for each

    identified platform

  • Run a massive crawling and store the data in the unified data model

KPIs

  • List of supported publications platforms

  • Number of referenced publications

  • Number of origins covered in the archive

Software versions metadata#

  • Priority: High

  • Tags: CodeCommons, Enrich, Externals

Description

Many references to specific software versions use version name of software projects. The current Software Heritage model doesn’t provide explicit and formal version identification.

The goal of this task is to add version information to the Software Heritage data model, providing relevant information adapted to various levels of granularity.

Includes work

  • Identify external data sources providing accurate information

  • Identify and validate heuristics for Software Versions identification

    analysis in archive contents

  • Design a data model for Software versions Data model

  • Map software versions to objects in the archive

KPIs

  • Number of software projects identified

  • Number of versions identified

Catch up with GitHub lag#

  • Priority: High

  • Tags: CodeCommons, Enlarge, Archive Work Group, Externals

Description

GitHub growth is faster than Software Heritage’s current ingestion capacities, resulting in a lag of more than 140 million origins. In order to return to an up-to-date archive, the CodeCommons project includes the usage of CINES HPC infrastructure to massively clone and ingest the missing repositories.

Includes work

  • List the missing GitHub origins in Software Heritage archive

  • Implement and deploy massive ingestion tools at CINES

  • Clone and ingest the missing origins at CINES

  • Generate deduplicated datasets for retrieval in the main archive

KPIs

  • Number of ingested GitHub origins

  • Number of origins not archived

Expose full archive for large scale analysis#

  • Priority: High

  • Tags: CodeCommons, Enrich, Tooling, Data Work Group

Description

CINES’s Adastra HPC infrastructure has been made available to CodeCommons for providing the compute and storage capabilities required for CodeCommons massive data processing and additional metadata collection around Software Heritage. This item covers the prerequisite actions on CINES HPC, which consist of depositing a full copy of the main archive (contents and graph) and deploy the tooling for large scale archive access.

Includes work

  • Copy archive contents at CINES

  • Copy archive compressed graph at CINES

  • Improve and adapt SWH-Fuse for optimized large-scale access to the archive

KPIs

  • Full copy of the archive available at CINES

  • SWH-Fuse deployed at CINES

  • Performance metrics for SWH-Fuse

Similarity analysis#

  • Priority: Low

  • Tags: CodeCommons, Enrich, Externals

Description

Additionally to Software Heritage’s strong commitment to transparency and respect of the authors in training datasets for LLMs for code (as stated more than a year ago: https://www.softwareheritage.org/2023/10/19/swh-statement-on-llm-for-code/), CodeCommons includes to provide mechanisms of similarity detection for generated code, in order to ensure a proper attribution to the authors of the original source code. We are planning to use text and syntax analysis methods for similarity, but also to challenge machine learning approach that may complete the results.

Includes work

  • Design and implement tools for code Similarity analysis

  • Benchmark results from different approaches

  • Prepare the integration of provenance for attribution of generated

    code

KPIs

  • Documented benchmark results

Code Qualification#

  • Priority: Medium

  • Tags: CodeCommons, Enrich, Externals

Description

In order to provide qualified datasets according to multiple criteria based on the code qualification, the Software Heritage will be enriched with metadata extracted from an in-depth analysis of the source code archive, including the following topics: - Programming languages identification - Dependencies detection - Code quality metrics

Includes work

  • Programming languages:

    • Benchmark existing tools and select the most relevant ones

    • Run language identification analysis at scale on Software Heritage

      contents

    • Store and index the results in a unified data model

  • Dependencies detection

    • Customize ScanCode tools for scaling to Software Heritage

    • Run a file-level analysis on the archive contents

    • Run a project level analysis on the graph (projects filesystems

      browsing)

    • Store and index the results in a unified data model

  • Code quality metrics extraction

    • Identify relevant code quality metrics, possibly:

      • Static analysis

      • Code coverage

      • Design patterns identification

KPIs

  • % of the archive covered for each subject

Automate dataset generation#

  • Priority: Medium

  • Tags: CodeCommons, Enrich, Dataset factory, Data work group

Description

We need to produce datasets regularly and reliably to be more efficient and to clarify which datasets users can expect. Provide tooling for an automated production and publishing of derived datasets

Includes work

  • Design and implement the required automation tools

  • Setup and configure an automation pipeline

  • Provide a dashboard for monitoring

  • Document datasets for clear interface

KPIs

  • Number of derived datasets automatically published

Generate contents datasets#

  • Priority: High

  • Tags: CodeCommons, Enrich, Dataset factory, Data work group

Description

Create a tool that generates a dataset embedding file contents, based on a list of SWHIDs.

Includes work

  • Enable SWHID mapping on existing object storage (currently indexed by

    hash)

  • Design and implement a generation engine for datasets embedding

    contents

  • Benchmark and optimize performance for large-scale usage

KPIs

  • Performance metrics

Integrate CodeCommons in main archive#

  • Priority: High

  • Tags: CodeCommons, Enlarge, Next

Description

Most CodeCommons tools for metadata crawling and archive analysis will be run on Adastra HPC at CINES. On the one hand, the computed metadata will need to retrieved in the main archive, and on the other hand, the tools used for a massive processing on the whole archive copy will need to be integrated to Software Heritage standard ingestion pipeline in order to keep maintaining the CodeCommos metadata up-to-date on the long term. This task also includes the retrieval of the GitHub lag ingestion.

Includes work

  • Retrieve archive core data from CINES

  • Retrieve unified metadata from CINES

  • Design architecture and infrastructure for retrieving full archive

    and unified metadata

  • Integrate CodeCommons tools in the standard ingestion pipeline

KPIs

  • Main archive core data up-to-date with CINES

  • Main archive metadata up-to-date with CINES

  • Tools integrated to the ingestion pipeline

SWHSec#

Collect and store CVE metadata#

  • Priority: High

  • Tags: Data work group, SWHSec, Enrich

Description

Collect CVE metadata from relevant external data sources, map it to Software Heritage data model and link CVEs to relevant revisions (introducing and fixing revisions).

Includes work

  • Design a data model for CVEs

  • Implement crawlers for CVE data sources

  • Store metadata

KPIs

  • Number of CVEs stored

  • Number of Objects linked to a CVE

Vulnerability Dataset extraction#

  • Priority: High

  • Tags: Data work group, SWHSec, Enrich

Description

Develop a tool that extracts the relevant introducing/fixing commits from Software Heritage for a dataset of vulnerabilities

Includes work

  • Design and implement the detection mechanisms

  • Generate raw datasets

  • Iterate with people involved in the extracted data evaluation

KPIs

  • Introducing commits detection ratio

  • Fixing commits detection ratio

  • Number of CVEs supported