📝 Defining YAML Files
YAML files are the backbone of your project's configuration in py-gen-ml
. To make working with these files a breeze, py-gen-ml
automatically generates JSON schemas for each protobuf model. These schemas are your secret weapon for validating YAML files with ease!
🏗️ Default Project Structure
When you use py-gen-ml
, it sets up a neat and organized structure for your schemas:
<project_root>/
configs/
base/
schemas/
<message_name_a>.json
<message_name_b>.json
...
patch/
schemas/
<message_name_a>.json
<message_name_b>.json
...
sweep/
schemas/
<message_name_a>.json
<message_name_b>.json
...
🛠️ Putting Schemas to Work
Want to leverage these schemas in Visual Studio Code? It's simple! Just install the YAML plugin and add this line to the top of your YAML file:
(We're assuming your file is located under <project_root>/configs/base/
.)
Let's take the following proto as an example:
syntax = "proto3";
package mlp;
import "py_gen_ml/extensions.proto";
// Activation is an enum of activation functions.
enum Activation {
// ReLU is the Rectified Linear Unit activation function.
RELU = 0;
// TANH is the hyperbolic tangent activation function.
TANH = 1;
// SIGMOID is the sigmoid activation function.
SIGMOID = 2;
}
// MLP is a simple multi-layer perceptron.
message MLPParsingDemo {
// Number of layers in the MLP.
uint32 num_layers = 1 [(pgml.default).uint32 = 2];
// Number of units in each layer.
uint32 num_units = 2;
// Activation function to use.
Activation activation = 3;
}
Here's a quick example of what your YAML file might look like:
Now, if you accidentally misconfigure your YAML file, Visual Studio Code will give you a friendly heads-up with a validation error.
The video below shows how the editor leverages the schema to know exactly which fields can be added and when the field is invalid:
You can see that:
- The leading comment we have added to the message in the proto shows at the top of the file.
- By pressing Cmd+Space (or Ctrl+Space on Linux) with an empty file, we see the list of possible fields.
- By pressing Cmd+Space (or Ctrl+Space on Linux) after typing
activation:
, we get a list of possible values for the field. - By entering an invalid value for
num_units
, we get a validation error.
🧩 Handling Nested Messages
Let's kick things up a notch with a more complex protobuf that includes some nesting:
// advanced.proto
syntax = "proto3";
package advanced;
import "py_gen_ml/extensions.proto";
// Linear block configuration
message LinearBlock {
// Number of units
int32 num_units = 1;
// Activation function
string activation = 2;
}
// Multi-layer perceptron configuration
message MLP {
// List of linear blocks
repeated LinearBlock layers = 1;
}
// Optimizer configuration
message Optimizer {
// Type of optimizer
string type = 1;
// Learning rate
float learning_rate = 2;
}
// Training configuration
message Training {
option (pgml.cli).enable = true;
// Multi-layer perceptron configuration
MLP mlp = 1;
// Optimizer configuration
Optimizer optimizer = 2;
}
You can define a YAML file for this structure like so:
# configs/base/default.yaml
# yaml-language-server: $schema=schemas/training.json
mlp:
layers:
- num_units: 100
activation: relu
- num_units: 200
activation: relu
- num_units: 100
activation: relu
optimizer:
type: sgd
learning_rate: 0.01
As you can see, the nesting in the YAML file mirrors the nesting in the protobuf.
Now, let's put this config to work by creating a model and an optimizer:
from pgml_out.advanced_base import Training
def create_model(config: Training) -> torch.nn.Module:
layers = []
for layer in config.mlp.layers:
layers.append(torch.nn.Linear(layer.num_units, layer.num_units))
layers.append(torch.nn.ReLU() if layer.activation == "relu" else torch.nn.Tanh())
return torch.nn.Sequential(*layers)
def create_optimizer(model: torch.nn.Module, config: Training) -> torch.optim.Optimizer:
return torch.optim.SGD(model.parameters(), lr=config.optimizer.learning_rate)
if __name__ == "__main__":
config = Training.from_yaml_file("configs/base/default.yaml")
model = create_model(config)
optimizer = create_optimizer(model, config)
🔗 Internal References with #
Want to reuse values in your YAML file? py-gen-ml
has got you covered! You can replace a value with a reference to another value using the #<path_to_value>
syntax. Here's how it works:
In this example, the second and third layers will mirror the number of units and activation function of the first layer.
🎯 Using the _defs_
Field
For even more flexibility, you can use the _defs_
field. It's perfect for reusing values with shorter paths and a more centralized definition:
📊 Using Indices in Lists
Need to reference specific elements in a list? No problem! You can use indices like this:
👪 Relative Internal References
You can also use relative internal references. This is useful if you want to reuse values in a nested structure and the reference is close to the reused value.
This allows you to skip the /foo/bar
prefix to get to the data
field. It also makes this part of the YAML
file more self-contained: you can safely copy this part to a different YAML that follows a different schema
yet the same relative structure for the data
field.
🌐 External References with !
Want to reuse values across multiple YAML files? External references by prefixing the path with !
is the way to go:
The referenced files might look like this:
🔀 Combining External and Internal References
For the ultimate flexibility, you can mix and match external and internal references:
With the corresponding layer.yaml
:
# configs/base/layer.yaml
layer0:
num_units: 100
activation: relu
layer1:
num_units: 200
activation: relu
layer2:
num_units: 100
activation: relu
And there you have it! With these powerful YAML configuration techniques at your fingertips, you're all set to create flexible and maintainable machine learning projects using py-gen-ml
. Happy coding! 🚀