Skip to content
py-gen-ml logo

py-gen-ml

A library for generating machine learning code from protobuf schemas.

🌟 Project Introduction

py-gen-ml simplifies the configuration and management of machine learning projects. It leverages Protocol Buffers (protobufs) to provide a robust, strongly typed, and extensible way to define and manipulate configuration schemas for machine learning projects. The protobuf schemas provide a single source of truth from which many things ✨ are generated automatically ✨.

✨ Brief Overview

A real quick overview of what you can do with py-gen-ml:

  • Define protos
    // Multi-layer perceptron configuration
    message MLPQuickstart {
        option (pgml.cli).enable = true;
        // Number of layers
        int64 num_layers = 1;
        // Number of units
        int64 num_units = 2;
        // Activation function
        string activation = 3;
    }
    
  • Generated Base Model
    class MLPQuickstart(pgml.YamlBaseModel):
        """Multi-layer perceptron configuration"""
    
        num_layers: int
        """Number of layers"""
    
        num_units: int
        """Number of units"""
    
        activation: str
        """Activation function"""
    
  • Generated Patch Config
    class MLPQuickstartPatch(pgml.YamlBaseModel):
        """Multi-layer perceptron configuration"""
    
        num_layers: typing.Optional[int] = None
        """Number of layers"""
    
        num_units: typing.Optional[int] = None
        """Number of units"""
    
        activation: typing.Optional[str] = None
        """Activation function"""
    

  • Generated Sweep Config
    class MLPQuickstartSweep(pgml.Sweeper[patch.MLPQuickstartPatch]):
        """Multi-layer perceptron configuration"""
    
        num_layers: typing.Optional[pgml.IntSweep] = None
        """Number of layers"""
    
        num_units: typing.Optional[pgml.IntSweep] = None
        """Number of units"""
    
        activation: typing.Optional[pgml.StrSweep] = None
        """Activation function"""
    

  • Generated CLI Parser
    class MLPQuickstartArgs(pgml.YamlBaseModel):
        """Multi-layer perceptron configuration"""
    
        num_layers: typing.Annotated[
            typing.Optional[int],
            typer.Option(help="Number of layers. Maps to 'num_layers'"),
            pydantic.Field(None),
            pgml.ArgRef("num_layers"),
        ]
        """Number of layers"""
    
        num_units: typing.Annotated[
            typing.Optional[int],
            typer.Option(help="Number of units. Maps to 'num_units'"),
            pydantic.Field(None),
            pgml.ArgRef("num_units"),
        ]
        """Number of units"""
    # Remaining code...
    
  • Generated Entrypoint
    @pgml.pgml_cmd(app=app)
    def main(
        config_paths: typing.List[str] = typer.Option(..., help="Paths to config files"),
        sweep_paths: typing.List[str] = typer.Option(
            default_factory=list,
            help="Paths to sweep files"
        ),
        cli_args: cli_args.MLPQuickstartArgs = typer.Option(...),
    ) -> None:
        mlp_quickstart = base.MLPQuickstart.from_yaml_files(config_paths)
        mlp_quickstart = mlp_quickstart.apply_cli_args(cli_args)
        if len(sweep_paths) == 0:
            run_trial(mlp_quickstart)
            return
        # Remaining code....
    
  • Flexible YAML Config
    # base.yaml
    layers:
    - num_units: 100
      activation: "#/_defs/activation"
    - num_units: 50
      activation: "#/_defs/activation"
    optimizer:
      type: adamw
      learning_rate: 1e-4
      schedule: '!cosine_schedule.yaml'
    _defs_:
      activation: relu
    
    # cosine_schedule.yaml
    min_lr: 1e-5
    max_lr: 1e-3
    
  • Flexible YAML sweeps
    layers:
    - num_units:  # Sample from a list
      - 100
      - 50
      activation: "#/_defs/activation"
    - num_units:  # Sample from a range
        low: 10
        high: 100
        step: 10
      activation: "#/_defs/activation"
    _defs_:
      activation: relu
    
  • Instant YAML validation w/ JSON schemas

🔑 Key Features

📌 Single Source of Truth:

  • The Protobuf schema provides a centralized definition for your configurations.

🔧 Flexible Configuration Management:

  • Minimal Change Amplification: Automatically generated code reduces cascading manual changes when modifying configurations.
  • Flexible Patching: Easily modify base configurations with patches for quick experimentation.
  • Flexible YAML: Use human-readable YAML with support for advanced references within and across files.
  • Hyperparameter Sweeps: Effortlessly define and manage hyperparameter tuning.
  • CLI Argument Parsing: Automatically generate command-line interfaces from your configuration schemas.

✅ Validation and Type Safety:

  • JSON Schema Generation: Easily validate your YAML content as you type.
  • Strong Typing: The generated code comes with strong typing that will help you, your IDE, the type checker and your team to better understand the codebase and to build more robust ML code.

🚦 Getting Started

To start using py-gen-ml, you can install it via pip:

pip install py-gen-ml

For a quick example of how to use py-gen-ml in your project, check out our Quick Start Guide.

💡 Motivation

Machine learning projects often involve complex configurations with many interdependent parameters. Changing one config (e.g., the dataset) might require adjusting several other parameters for optimal performance. Traditional approaches to organizing configs can become unwieldy and tightly coupled with code, making changes difficult.

py-gen-ml addresses these challenges by:

  1. 📊 Providing a single, strongly-typed schema definition for configurations.
  2. 🔄 Generating code to manage configuration changes automatically.
  3. 📝 Offering flexible YAML configurations with advanced referencing and variable support.
  4. 🛠️ Generating JSON schemas for real-time YAML validation.
  5. 🔌 Seamlessly integrating into your workflow with multiple experiment running options:
    • Single experiments with specific config values
    • Base config patching
    • Parameter sweeps via JSON schema validated YAML files
    • Quick value overrides via a generated CLI parser
    • Arbitrary combinations of the above options

This approach results in more robust ML code, leveraging strong typing and IDE support while avoiding the burden of change amplification in complex configuration structures.

🎯 When to use py-gen-ml

Consider using py-gen-ml when you need to:

  • 📈 Manage complex ML projects more efficiently
  • 🔬 Streamline experiment running and hyperparameter tuning
  • 🛡️ Reduce the impact of configuration changes on your workflow
  • 💻 Leverage type safety and IDE support in your ML workflows

📚 Where to go from here