Skip to content

Creating a training set for ML competitions #13

@ccerv1

Description

@ccerv1

OSO Funding-Based Training Dataset

This dataset contains pairwise comparisons between open source projects, where the weights are derived from their relative funding amounts. Here's how it was created:

Data Sources

  • Project funding data from OSO's BigQuery tables (oss_funding_v0)
  • Repository information from OSO's BigQuery tables (repositories_v0)
  • Dependency graph containing repository URLs (unweighted_graph.json)

Process

  1. Graph Loading:

    • Loaded dependency graph from JSON
    • Extracted all repository URLs for analysis
  2. Funding Data Collection: For each project, we collected:

    • Quarterly funding amounts
    • Funder and grant pool information
    • Project names and IDs
    • Associated GitHub repository URLs
  3. Comparison Generation: For each funding round (defined by funder + quarter):

    • Found all projects that received funding (minimum 2 projects per round)
    • Generated all possible pairs using itertools.combinations
    • Calculated relative weights:
    weight_a = amount_a / (amount_a + amount_b)
    weight_b = 1 - weight_a  # Ensures weights sum to 1.0
  4. Deduplication:

    • Project pairs are stored consistently (alphabetically ordered URLs)
    • When the same pair appears in multiple rounds, weights are averaged
    • Final weights maintained to sum to 1.0

Output Files

The process generates several CSV files:

  1. funding-data.csv: Raw funding data
  2. training-data-preagg.csv: All pairwise comparisons before deduplication
  3. training-data.csv: Final deduplicated pairwise comparisons (this is what is used for the competition)
  4. training-data-by-dependent-node.csv: Filtered comparisons for projects sharing dependencies

Data Format

The final deduplicated CSV contains:

  • project_a: GitHub repository URL
  • project_b: GitHub repository URL
  • weight_a: Average relative funding weight for project_a
  • weight_b: Average relative funding weight for project_b

Key Assumptions

  • Projects are identified primarily by their GitHub repository URLs
  • Only rounds with 2+ projects generate comparisons
  • All funding amounts are in USD
  • No time-based weighting within quarters
  • For projects with multiple repositories, we use the one with most stars
  • Weights are relative within each funding round before averaging

Example

If Project A received $75 and Project B received $25 in a funding round:

{
    "project_a": "https://github.com/projectA",
    "project_b": "https://github.com/projectB",
    "weight_a": 0.75,  # (75/100)
    "weight_b": 0.25   # (25/100)
}

Note: The notebook also includes functionality to filter comparisons based on shared dependencies in the graph, available in the training-data-by-dependent-node.csv output.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions