🚀 Quick Start Guide
🌟 Introduction
py-gen-ml
leverages protobufs to define the schema for your configuration. py-gen-ml
uses the language agnostic schema to generate code and JSON schemas from the protobuf definitions, creating a robust and versatile configuration system for machine learning projects.
Note
While py-gen-ml
currently doesn't fully utilize the language-neutral or platform-neutral features of protobuf, these capabilities are available for future expansion. If you're new to protobufs, you can learn more about them here.
📝 Defining Your Protobuf
To create a protobuf schema, you'll need to write a .proto
file. This file contains the definition of the data structure you want to use in your configuration. The protobuf counterpart of a data object is called a message
. Most generated files we'll see later on will contain one class per message in the protobuf file.
Here's a simple example of a protobuf definition:
// quickstart_a.proto
syntax = "proto3";
package example;
// Multi-layer perceptron configuration
message MLPQuickstart {
// Number of layers
int64 num_layers = 1;
// Number of units
int64 num_units = 2;
// Activation function
string activation = 3;
}
🛠️ Generating Configuration Utilities
With your protobuf defined, you can now ✨ generate ✨ configuration objects using this command:
By default, the generated code will be written to src/pgml_out
. To customize this and explore other options, check out the py-gen-ml command documentation. The command will generate the following files:
quickstart_a_base.py
quickstart_a_patch.py
quickstart_a_sweep.py
Let's dive into the details of each file.
🧩 Generated Code
📊 Generated Base Model
One of the files generated is a Pydantic model for your main configuration.
# Autogenerated code. DO NOT EDIT.
import py_gen_ml as pgml
class MLPQuickstart(pgml.YamlBaseModel):
"""Multi-layer perceptron configuration"""
num_layers: int
"""Number of layers"""
num_units: int
"""Number of units"""
activation: str
"""Activation function"""
Use this file to load and validate configuration files written in YAML format. As you can see, it inherits from pgml.YamlBaseModel
which is a convenience base class that provides methods for loading configurations from YAML files.
For instance, the following YAML file will be validated according to the schema defined in quickstart_a_base.py
:
You can load the configuration like so:
# example.py
from pgml_out.quickstart_a_base import MLPQuickstart
config = MLPQuickstart.from_yaml_file("example.yaml")
🔧 Generated Patch
# Autogenerated code. DO NOT EDIT.
import typing
import py_gen_ml as pgml
class MLPQuickstartPatch(pgml.YamlBaseModel):
"""Multi-layer perceptron configuration"""
num_layers: typing.Optional[int] = None
"""Number of layers"""
num_units: typing.Optional[int] = None
"""Number of units"""
activation: typing.Optional[str] = None
"""Activation function"""
This file defines a Pydantic model for your patch configuration. All fields are optional. This allows you to express experiments in terms of changes with respect to a base configuration.
Consequently:
- Changes are small and additive
- You can easily compose multiple patches together
You can load a base configuration and apply patches using the .from_yaml_files
method. This method is automatically inherited from pgml.YamlBaseModel
:
# example.py
from pgml_out.quickstart_a_base import MLPQuickstart
from pgml_out.quickstart_a_patch import MLPQuickstartPatch
config_with_patches = MLPQuickstart.from_yaml_files(["example.yaml", "example_patch.yaml"])
🔍 Generated Sweep Configuration
Upon running the command, you'll also get a quickstart_a_sweep.py
file:
import typing
import py_gen_ml as pgml
from . import quickstart_a_patch as patch
from . import quickstart_a_base as base
class MLPQuickstartSweep(pgml.Sweeper[patch.MLPQuickstartPatch]):
"""Multi-layer perceptron configuration"""
num_layers: typing.Optional[pgml.IntSweep] = None
"""Number of layers"""
num_units: typing.Optional[pgml.IntSweep] = None
"""Number of units"""
activation: typing.Optional[pgml.StrSweep] = None
"""Activation function"""
MLPQuickstartSweepField = typing.Union[
MLPQuickstartSweep,
pgml.NestedChoice[MLPQuickstartSweep, patch.MLPQuickstartPatch], # type: ignore
]
This file defines a pgml.Sweeper
for your configuration, enabling you to sweep over the values of your configuration. py_gen_ml
comes with tooling to traverse the config and construct a search space for your trials. Currently, it supports Optuna but we'll add more frameworks in the future.
Here's an example YAML file that will be validated according to the schema in quickstart_a_sweep.py
:
To run a hyperparameter sweep, you can use the OptunaSampler:
# example.py
from pgml_out.quickstart_base import MLP
from pgml_out.quickstart_sweep import MLPSweep
def train_model(config: MLP) -> float:
"""Train a model and return the accuracy"""
if __name__ == "__main__":
config = MLP.from_yaml_file("example.yaml")
sweep = MLPSweep.from_yaml_file("example_sweep.yaml")
def objective(trial: optuna.Trial) -> float:
sampler = pgml.OptunaSampler(trial=trial)
patch = sampler.sample(sweep)
accuracy = train_model(config.merge(patch))
return accuracy
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100)
🪄 Generating a Command Line Interface
To generate a command line interface, you'll need to add the following option to your protobuf:
Like so:
// quickstart_b.proto
syntax = "proto3";
package example;
import "py_gen_ml/extensions.proto";
// Multi-layer perceptron configuration
message MLPQuickstart {
option (pgml.cli).enable = true;
// Number of layers
int64 num_layers = 1;
// Number of units
int64 num_units = 2;
// Activation function
string activation = 3;
}
When running py-gen-ml
, you'll now get a quickstart_b_cli_args.py
file:
💻 Generated CLI
# Autogenerated code. DO NOT EDIT.
import py_gen_ml as pgml
import typing
import pydantic
import typer
from . import quickstart_b_base as base
class MLPQuickstartArgs(pgml.YamlBaseModel):
"""Multi-layer perceptron configuration"""
num_layers: typing.Annotated[
typing.Optional[int],
typer.Option(help="Number of layers. Maps to 'num_layers'"),
pydantic.Field(None),
pgml.ArgRef("num_layers"),
]
"""Number of layers"""
num_units: typing.Annotated[
typing.Optional[int],
typer.Option(help="Number of units. Maps to 'num_units'"),
pydantic.Field(None),
pgml.ArgRef("num_units"),
]
"""Number of units"""
activation: typing.Annotated[
typing.Optional[str],
typer.Option(help="Activation function. Maps to 'activation'"),
pydantic.Field(None),
pgml.ArgRef("activation"),
]
"""Activation function"""
This file defines a Pydantic model for your command line arguments. We've chosen to use typer to handle command line arguments, and we've added a convenience function to simplify the use of this class.
The easiest way to use the CLI is to copy the generated entrypoint script. The entrypoint name is the snake case version of the name of the message with the pgml.cli
option with _entrypoint.py
appended.
🚀 Generated Entrypoint
import pgml_out.quickstart_b_base as base
import pgml_out.quickstart_b_sweep as sweep
import pgml_out.quickstart_b_cli_args as cli_args
import typer
import py_gen_ml as pgml
import optuna
import typing
app = typer.Typer(pretty_exceptions_enable=False)
def run_trial(
mlp_quickstart: base.MLPQuickstart,
trial: typing.Optional[optuna.Trial] = None
) -> typing.Union[float, typing.Sequence[float]]:
"""
Run a trial with the given values for mlp_quickstart. The sampled
hyperparameters have already been added to the trial.
"""
# TODO: Implement this function
return 0.0
@pgml.pgml_cmd(app=app)
def main(
config_paths: typing.List[str] = typer.Option(..., help="Paths to config files"),
sweep_paths: typing.List[str] = typer.Option(
default_factory=list,
help="Paths to sweep files"
),
cli_args: cli_args.MLPQuickstartArgs = typer.Option(...),
) -> None:
mlp_quickstart = base.MLPQuickstart.from_yaml_files(config_paths)
mlp_quickstart = mlp_quickstart.apply_cli_args(cli_args)
if len(sweep_paths) == 0:
run_trial(mlp_quickstart)
return
mlp_quickstart_sweep = sweep.MLPQuickstartSweep.from_yaml_files(sweep_paths)
def objective(trial: optuna.Trial) -> typing.Union[
float,
typing.Sequence[float]
]:
optuna_sampler = pgml.OptunaSampler(trial)
mlp_quickstart_patch = optuna_sampler.sample(mlp_quickstart_sweep)
mlp_quickstart_patched = mlp_quickstart.merge(mlp_quickstart_patch)
objective_value = run_trial(mlp_quickstart_patched, trial)
return objective_value
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)
if __name__ == "__main__":
app()
The magic happens in the pgml.pgml_cmd
decorator. This decorator is used to wrap the main
function and add the necessary arguments and options to the CLI.
Now you can run your script with command line arguments and configuration files:
You can set parameters via both command line arguments and configuration files:
With these tools at your disposal, you're now ready to create flexible and powerful configurations for your machine learning projects using py-gen-ml
! If you're looking for a more complex example, check out the CIFAR 10 example project.