Objective:
Learning how ML solutions and their lifecycles can be managed as software projects.
Principles:
ForML based solutions are standard Python projects typically with the following minimal structure:
Component | Location |
---|---|
Project Descriptor | pyproject.toml |
Dependencies | pyproject.toml |
Data Requirements | <module>/source.py |
Evaluation Spec | <module>/evaluation.py |
Model Pipeline | <module>/pipeline.py |
Tests | tests/ |
Additional typical components not used directly by ForML (see for example Cookiecutter Datascience):
data/
docs/
notebooks/
For the sake of this tutorial, let's start a new project called dummy
:
! forml project init dummy
%cd dummy
/opt/forml/workspace/2-tutorial/dummy
! tree .
. ├── dummy │ ├── __init__.py │ ├── evaluation.py │ ├── pipeline.py │ └── source.py ├── pyproject.toml └── tests └── __init__.py 2 directories, 6 files
ForML is adopting the standard Python pyproject.toml
descriptor:
from IPython import display
display.Code('pyproject.toml')
# Dummy project.
#
# Generated on 2023-05-31 04:20:30.693934 by forml using ForML 0.93.
[project]
name = "dummy"
version = "0.1.dev1"
dependencies = [
"forml==0.93"
]
[tool.forml]
package = "dummy"
Let's keep our Dummy project under version control:
! git init .
! git add .
Initialized empty Git repository in /opt/forml/workspace/2-tutorial/dummy/.git/
Let's define the dummy/source.py:
from forml import project
from forml.pipeline import payload
from dummycatalog import Foo
FEATURES = Foo.select(Foo.Level, Foo.Value)
OUTCOMES = Foo.Label
SOURCE = project.Source.query(FEATURES, OUTCOMES) >> payload.ToPandas()
project.setup(SOURCE)
! git add dummy/source.py
Let's configure the dummy/evaluation.py:
from sklearn import metrics
from sklearn import model_selection
from forml import evaluation, project
EVALUATION = project.Evaluation(
evaluation.Function(metrics.log_loss),
evaluation.CrossVal(
crossvalidator=model_selection.StratifiedKFold(
n_splits=3, shuffle=True, random_state=42
)
),
)
project.setup(EVALUATION)
! git add dummy/evaluation.py
Let's setup all the dummy/pipeline.py workflow:
import typing
import pandas
from imblearn import over_sampling
from forml import project, flow
from forml.pipeline import payload, wrap
with wrap.importer():
from sklearn.linear_model import LogisticRegression
@wrap.Actor.apply
def OrdActor(data: pandas.DataFrame, *, column: str) -> pandas.Series:
return data[column].apply(lambda v: ord(v[0].lower()))
@wrap.Actor.train
def CenterActor(
state: typing.Optional[float],
data: pandas.DataFrame,
labels: pandas.Series,
*,
column: str
) -> float:
return data[column].mean()
@CenterActor.apply
def CenterActor(
state: float, data: pandas.DataFrame, *, column: str
) -> pandas.DataFrame:
return data[column] - state
@wrap.Actor.train
def MinMax(
state: typing.Optional[tuple[float, float]],
data: pandas.DataFrame,
labels: pandas.Series,
*,
column: str
) -> tuple[float, float]:
min_ = data[column].min()
return min_, data[column].max() - min_
@wrap.Operator.mapper
@MinMax.apply
def MinMax(
state: tuple[float, float], data: pandas.DataFrame, *, column: str
) -> pandas.DataFrame:
data[column] = (data[column] - state[0]) / state[1]
return data
@wrap.Actor.apply
def OverSampler(
features: pandas.DataFrame,
labels: pandas.Series,
*,
random_state: typing.Optional[int] = None
):
"""Stateless actor with two input and two output ports for oversampling the features/labels of the minor class."""
return over_sampling.RandomOverSampler(random_state=random_state).fit_resample(
features, labels
)
class Balancer(flow.Operator):
"""Balancer operator inserting the provided sampler into the ``train`` & ``label`` paths."""
def __init__(self, sampler: flow.Builder = OverSampler.builder(random_state=42)):
self._sampler = sampler
def compose(self, scope: flow.Composable) -> flow.Trunk:
left = scope.expand()
sampler = flow.Worker(self._sampler, 2, 2)
sampler[0].subscribe(left.train.publisher)
new_features = flow.Future()
new_features[0].subscribe(sampler[0])
sampler[1].subscribe(left.label.publisher)
new_labels = flow.Future()
new_labels[0].subscribe(sampler[1])
return left.use(
train=left.train.extend(tail=new_features),
label=left.label.extend(tail=new_labels),
)
PIPELINE = (
Balancer()
>> payload.MapReduce(
OrdActor.builder(column="Level"), CenterActor.builder(column="Value")
)
>> MinMax(column="Level")
>> LogisticRegression(random_state=42)
)
project.setup(PIPELINE)
! git add dummy/pipeline.py
Let's add the explicit dependencies used in this project into the pyproject.toml:
imbalanced-learn==0.10.1
:[project]
name = "dummy"
version = "0.1.dev1"
dependencies = [
"forml==0.93",
"imbalanced-learn==0.10.1"
]
[tool.forml]
package = "dummy"
! git add pyproject.toml
! touch tests/test_pipeline.py
Edit the created test_pipeline.py and implement the unit test:
TestBalancer
unit test implementation based on the chapter 2-task-dependency-management:from forml import testing
from dummy import pipeline
class TestBalancer(testing.operator(pipeline.Balancer)):
"""Balancer unit tests."""
default_oversample = (
testing.Case()
.train([[1], [1], [0]], [1, 1, 0])
.returns([[1], [1], [0], [0]], labels=[1, 1, 0, 0])
)
! git add tests/test_pipeline.py
The development lifecycle covers all the project development phases leading to a model release.
! forml project train -R graphviz
running train
This produces an SVG file under dummy/forml.dot.svg visualizing the given train workflow:
! forml project eval
running eval 0.3772729843470399
! forml project test
running test running egg_info creating dummy.egg-info writing dummy.egg-info/PKG-INFO writing dependency_links to dummy.egg-info/dependency_links.txt writing requirements to dummy.egg-info/requires.txt writing top-level names to dummy.egg-info/top_level.txt writing manifest file 'dummy.egg-info/SOURCES.txt' reading manifest file 'dummy.egg-info/SOURCES.txt' writing manifest file 'dummy.egg-info/SOURCES.txt' running build_ext test_default_oversample (tests.test_pipeline.TestBalancer) Test of Default Oversample ... ok ---------------------------------------------------------------------- Ran 1 test in 2.195s OK
Once we are happy with the achieved results (good evaluation metric, unit tests passing), we can proceed to release the model version.
Let's start by committing and tagging the project codebase:
! git commit -m 'Released 0.1.dev1'
! git tag 0.1.dev1
[main (root-commit) 86a56dd] Released 0.1.dev1 8 files changed, 158 insertions(+) create mode 100644 .gitignore create mode 100644 dummy/__init__.py create mode 100644 dummy/evaluation.py create mode 100644 dummy/pipeline.py create mode 100644 dummy/source.py create mode 100644 pyproject.toml create mode 100644 tests/__init__.py create mode 100644 tests/test_pipeline.py
Now we can kick off the release process to package the model artifact and publish it into the model registry:
! forml project release
running bdist_4ml Collecting forml==0.93 Using cached forml-0.93-py3-none-any.whl (283 kB) Collecting imbalanced-learn==0.10.1 Using cached imbalanced_learn-0.10.1-py3-none-any.whl (226 kB) Collecting click (from forml==0.93) Using cached click-8.1.3-py3-none-any.whl (96 kB) Collecting cloudpickle (from forml==0.93) Using cached cloudpickle-2.2.1-py3-none-any.whl (25 kB) Collecting jinja2 (from forml==0.93) Using cached Jinja2-3.1.2-py3-none-any.whl (133 kB) Collecting numpy (from forml==0.93) Using cached numpy-1.24.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB) Collecting packaging>=20.0 (from forml==0.93) Using cached packaging-23.1-py3-none-any.whl (48 kB) Collecting pandas (from forml==0.93) Using cached pandas-2.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.3 MB) Collecting pip (from forml==0.93) Using cached pip-23.1.2-py3-none-any.whl (2.1 MB) Collecting scikit-learn (from forml==0.93) Using cached scikit_learn-1.2.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.6 MB) Collecting setuptools (from forml==0.93) Using cached setuptools-67.8.0-py3-none-any.whl (1.1 MB) Collecting toml (from forml==0.93) Using cached toml-0.10.2-py2.py3-none-any.whl (16 kB) Collecting tomli (from forml==0.93) Using cached tomli-2.0.1-py3-none-any.whl (12 kB) Collecting scipy>=1.3.2 (from imbalanced-learn==0.10.1) Using cached scipy-1.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.4 MB) Collecting joblib>=1.1.1 (from imbalanced-learn==0.10.1) Using cached joblib-1.2.0-py3-none-any.whl (297 kB) Collecting threadpoolctl>=2.0.0 (from imbalanced-learn==0.10.1) Using cached threadpoolctl-3.1.0-py3-none-any.whl (14 kB) Collecting MarkupSafe>=2.0 (from jinja2->forml==0.93) Using cached MarkupSafe-2.1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25 kB) Collecting python-dateutil>=2.8.2 (from pandas->forml==0.93) Using cached python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB) Collecting pytz>=2020.1 (from pandas->forml==0.93) Using cached pytz-2023.3-py2.py3-none-any.whl (502 kB) Collecting tzdata>=2022.1 (from pandas->forml==0.93) Using cached tzdata-2023.3-py2.py3-none-any.whl (341 kB) Collecting six>=1.5 (from python-dateutil>=2.8.2->pandas->forml==0.93) Using cached six-1.16.0-py2.py3-none-any.whl (11 kB) Installing collected packages: pytz, tzdata, tomli, toml, threadpoolctl, six, setuptools, pip, packaging, numpy, MarkupSafe, joblib, cloudpickle, click, scipy, python-dateutil, jinja2, scikit-learn, pandas, imbalanced-learn, forml Successfully installed MarkupSafe-2.1.2 click-8.1.3 cloudpickle-2.2.1 forml-0.93 imbalanced-learn-0.10.1 jinja2-3.1.2 joblib-1.2.0 numpy-1.24.3 packaging-23.1 pandas-2.0.2 pip-23.1.2 python-dateutil-2.8.2 pytz-2023.3 scikit-learn-1.2.2 scipy-1.10.1 setuptools-67.8.0 six-1.16.0 threadpoolctl-3.1.0 toml-0.10.2 tomli-2.0.1 tzdata-2023.3 running upload
Model registry serves as a crucial interface for managing published models throughout the production lifecycle. It can be provided through a number of different implementations.
The registry has a tree hierarchy with levels of project
/ release
/ generation
:
! forml model list
dummy
! forml model list dummy
0.1.dev1
! forml model list dummy 0.1.dev1
! tree /opt/forml/assets/registry/
/opt/forml/assets/registry/ └── dummy └── 0.1.dev1 └── package.4ml 2 directories, 1 file
The production lifecycle takes care of all the necessary model operations after its release.
! forml model train dummy
! forml model list dummy 0.1.dev1
1
! tree /opt/forml/assets/registry/
/opt/forml/assets/registry/ └── dummy └── 0.1.dev1 ├── 1 │ ├── 21c560d1-dac8-41f0-b5bd-246e76e00bd6.bin │ ├── 4da96680-dcd0-44f9-b8f6-74d8b7d12360.bin │ ├── 643066e9-aad2-41f5-bd8f-73fdf736d1b8.bin │ └── tag.toml └── package.4ml 3 directories, 5 files
We will leave the remaining steps of the production lifecycle to the final chapter 3-solution which works with a real dataset.