Project Resources Management¶

Objective:

Learning how ML solutions and their lifecycles can be managed as software projects.

Principles:

  1. Project resources are organized to facilitate development, collaboration, and maintainability.
  2. Development lifecycle handles all the project development phases leading to a model release.
  3. Model registry represents the persistence layer holding the published project artifacts at rest.
  4. Production lifecycle takes care of all the necessary model operations after its release.

Project Setup¶

ForML based solutions are standard Python projects typically with the following minimal structure:

Component Location
Project Descriptor pyproject.toml
Dependencies pyproject.toml
Data Requirements <module>/source.py
Evaluation Spec <module>/evaluation.py
Model Pipeline <module>/pipeline.py
Tests tests/

Additional typical components not used directly by ForML (see for example Cookiecutter Datascience):

  • data/
  • docs/
  • notebooks/
  • CI/CD descriptors

Starting a New Project¶

For the sake of this tutorial, let's start a new project called dummy:

In [1]:
! forml project init dummy
In [2]:
%cd dummy
/opt/forml/workspace/2-tutorial/dummy
In [3]:
! tree .
.
├── dummy
│   ├── __init__.py
│   ├── evaluation.py
│   ├── pipeline.py
│   └── source.py
├── pyproject.toml
└── tests
    └── __init__.py

2 directories, 6 files

ForML is adopting the standard Python pyproject.toml descriptor:

In [4]:
from IPython import display
display.Code('pyproject.toml')
Out[4]:
# Dummy project.
#
# Generated on 2023-05-31 04:20:30.693934 by forml using ForML 0.93.

[project]
name = "dummy"
version = "0.1.dev1"
dependencies = [
    "forml==0.93"
]


[tool.forml]
package = "dummy"

Let's keep our Dummy project under version control:

In [5]:
! git init .
! git add .
Initialized empty Git repository in /opt/forml/workspace/2-tutorial/dummy/.git/

Filling-in the Project Components¶

Data Requirements¶

Let's define the dummy/source.py:

  1. Open the dummy/source.py component.
  2. Update it with the final query DSL used previously in the chapter 2-task-dependency-management:
from forml import project
from forml.pipeline import payload

from dummycatalog import Foo

FEATURES = Foo.select(Foo.Level, Foo.Value)
OUTCOMES = Foo.Label

SOURCE = project.Source.query(FEATURES, OUTCOMES) >> payload.ToPandas()

project.setup(SOURCE)
  1. SAVE THE dummy/source.py FILE!
In [6]:
! git add dummy/source.py

Evaluation¶

Let's configure the dummy/evaluation.py:

  1. Open the dummy/evaluation.py component.
  2. Update it with the evaluation descriptor shown previously in the chapter 3-evaluation:
from sklearn import metrics
from sklearn import model_selection

from forml import evaluation, project

EVALUATION = project.Evaluation(
    evaluation.Function(metrics.log_loss),
    evaluation.CrossVal(
        crossvalidator=model_selection.StratifiedKFold(
            n_splits=3, shuffle=True, random_state=42
        )
    ),
)

project.setup(EVALUATION)
  1. SAVE THE dummy/evaluation.py FILE!
In [7]:
! git add dummy/evaluation.py

Pipeline¶

Let's setup all the dummy/pipeline.py workflow:

  1. Open the dummy/pipeline.py component.
  2. Update it with all the actors, operators, and their composition as explored previously in chapter 2-task-dependency-management.
  3. Save the file!
import typing

import pandas
from imblearn import over_sampling

from forml import project, flow
from forml.pipeline import payload, wrap

with wrap.importer():
    from sklearn.linear_model import LogisticRegression


@wrap.Actor.apply
def OrdActor(data: pandas.DataFrame, *, column: str) -> pandas.Series:
    return data[column].apply(lambda v: ord(v[0].lower()))


@wrap.Actor.train
def CenterActor(
    state: typing.Optional[float],
    data: pandas.DataFrame,
    labels: pandas.Series,
    *,
    column: str
) -> float:
    return data[column].mean()


@CenterActor.apply
def CenterActor(
    state: float, data: pandas.DataFrame, *, column: str
) -> pandas.DataFrame:
    return data[column] - state


@wrap.Actor.train
def MinMax(
    state: typing.Optional[tuple[float, float]],
    data: pandas.DataFrame,
    labels: pandas.Series,
    *,
    column: str
) -> tuple[float, float]:
    min_ = data[column].min()
    return min_, data[column].max() - min_


@wrap.Operator.mapper
@MinMax.apply
def MinMax(
    state: tuple[float, float], data: pandas.DataFrame, *, column: str
) -> pandas.DataFrame:
    data[column] = (data[column] - state[0]) / state[1]
    return data


@wrap.Actor.apply
def OverSampler(
    features: pandas.DataFrame,
    labels: pandas.Series,
    *,
    random_state: typing.Optional[int] = None
):
    """Stateless actor  with two input and two output ports for oversampling the features/labels of the minor class."""
    return over_sampling.RandomOverSampler(random_state=random_state).fit_resample(
        features, labels
    )


class Balancer(flow.Operator):
    """Balancer operator inserting the provided sampler into the ``train`` & ``label`` paths."""

    def __init__(self, sampler: flow.Builder = OverSampler.builder(random_state=42)):
        self._sampler = sampler

    def compose(self, scope: flow.Composable) -> flow.Trunk:
        left = scope.expand()
        sampler = flow.Worker(self._sampler, 2, 2)
        sampler[0].subscribe(left.train.publisher)
        new_features = flow.Future()
        new_features[0].subscribe(sampler[0])
        sampler[1].subscribe(left.label.publisher)
        new_labels = flow.Future()
        new_labels[0].subscribe(sampler[1])
        return left.use(
            train=left.train.extend(tail=new_features),
            label=left.label.extend(tail=new_labels),
        )


PIPELINE = (
    Balancer()
    >> payload.MapReduce(
        OrdActor.builder(column="Level"), CenterActor.builder(column="Value")
    )
    >> MinMax(column="Level")
    >> LogisticRegression(random_state=42)
)

project.setup(PIPELINE)
In [8]:
! git add dummy/pipeline.py

Dependencies¶

Let's add the explicit dependencies used in this project into the pyproject.toml:

  1. Open the pyproject.toml.
  2. Update it with the code below adding the new dependency of imbalanced-learn==0.10.1:
[project]
name = "dummy"
version = "0.1.dev1"
dependencies = [
    "forml==0.93",
    "imbalanced-learn==0.10.1"
]

[tool.forml]
package = "dummy"
  1. SAVE THE pyproject.toml FILE!
In [9]:
! git add pyproject.toml

Adding Unit Test for our Balancer Operator¶

In [10]:
! touch tests/test_pipeline.py

Edit the created test_pipeline.py and implement the unit test:

  1. Open the test_pipeline.py.
  2. Update it with the code below providing the TestBalancer unit test implementation based on the chapter 2-task-dependency-management:
from forml import testing

from dummy import pipeline

class TestBalancer(testing.operator(pipeline.Balancer)):
    """Balancer unit tests."""

    default_oversample = (
        testing.Case()
        .train([[1], [1], [0]], [1, 1, 0])
        .returns([[1], [1], [0], [0]], labels=[1, 1, 0, 0])
    )
  1. SAVE THE test_pipeline.py FILE!
In [11]:
! git add tests/test_pipeline.py

Development Lifecycle¶

The development lifecycle covers all the project development phases leading to a model release.

Visualising the Train DAG¶

In [12]:
! forml project train -R graphviz
running train

This produces an SVG file under dummy/forml.dot.svg visualizing the given train workflow:

dummy/forml.dot.svg: train flow

Performing Development Evaluation¶

In [13]:
! forml project eval
running eval
0.3772729843470399

Running Tests¶

In [14]:
! forml project test
running test
running egg_info
creating dummy.egg-info
writing dummy.egg-info/PKG-INFO
writing dependency_links to dummy.egg-info/dependency_links.txt
writing requirements to dummy.egg-info/requires.txt
writing top-level names to dummy.egg-info/top_level.txt
writing manifest file 'dummy.egg-info/SOURCES.txt'
reading manifest file 'dummy.egg-info/SOURCES.txt'
writing manifest file 'dummy.egg-info/SOURCES.txt'
running build_ext
test_default_oversample (tests.test_pipeline.TestBalancer)
Test of Default Oversample ... ok

----------------------------------------------------------------------
Ran 1 test in 2.195s

OK

Releasing¶

Once we are happy with the achieved results (good evaluation metric, unit tests passing), we can proceed to release the model version.

Let's start by committing and tagging the project codebase:

In [15]:
! git commit -m 'Released 0.1.dev1'
! git tag 0.1.dev1
[main (root-commit) 86a56dd] Released 0.1.dev1
 8 files changed, 158 insertions(+)
 create mode 100644 .gitignore
 create mode 100644 dummy/__init__.py
 create mode 100644 dummy/evaluation.py
 create mode 100644 dummy/pipeline.py
 create mode 100644 dummy/source.py
 create mode 100644 pyproject.toml
 create mode 100644 tests/__init__.py
 create mode 100644 tests/test_pipeline.py

Now we can kick off the release process to package the model artifact and publish it into the model registry:

In [16]:
! forml project release
running bdist_4ml
Collecting forml==0.93
  Using cached forml-0.93-py3-none-any.whl (283 kB)
Collecting imbalanced-learn==0.10.1
  Using cached imbalanced_learn-0.10.1-py3-none-any.whl (226 kB)
Collecting click (from forml==0.93)
  Using cached click-8.1.3-py3-none-any.whl (96 kB)
Collecting cloudpickle (from forml==0.93)
  Using cached cloudpickle-2.2.1-py3-none-any.whl (25 kB)
Collecting jinja2 (from forml==0.93)
  Using cached Jinja2-3.1.2-py3-none-any.whl (133 kB)
Collecting numpy (from forml==0.93)
  Using cached numpy-1.24.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
Collecting packaging>=20.0 (from forml==0.93)
  Using cached packaging-23.1-py3-none-any.whl (48 kB)
Collecting pandas (from forml==0.93)
  Using cached pandas-2.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.3 MB)
Collecting pip (from forml==0.93)
  Using cached pip-23.1.2-py3-none-any.whl (2.1 MB)
Collecting scikit-learn (from forml==0.93)
  Using cached scikit_learn-1.2.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.6 MB)
Collecting setuptools (from forml==0.93)
  Using cached setuptools-67.8.0-py3-none-any.whl (1.1 MB)
Collecting toml (from forml==0.93)
  Using cached toml-0.10.2-py2.py3-none-any.whl (16 kB)
Collecting tomli (from forml==0.93)
  Using cached tomli-2.0.1-py3-none-any.whl (12 kB)
Collecting scipy>=1.3.2 (from imbalanced-learn==0.10.1)
  Using cached scipy-1.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.4 MB)
Collecting joblib>=1.1.1 (from imbalanced-learn==0.10.1)
  Using cached joblib-1.2.0-py3-none-any.whl (297 kB)
Collecting threadpoolctl>=2.0.0 (from imbalanced-learn==0.10.1)
  Using cached threadpoolctl-3.1.0-py3-none-any.whl (14 kB)
Collecting MarkupSafe>=2.0 (from jinja2->forml==0.93)
  Using cached MarkupSafe-2.1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25 kB)
Collecting python-dateutil>=2.8.2 (from pandas->forml==0.93)
  Using cached python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)
Collecting pytz>=2020.1 (from pandas->forml==0.93)
  Using cached pytz-2023.3-py2.py3-none-any.whl (502 kB)
Collecting tzdata>=2022.1 (from pandas->forml==0.93)
  Using cached tzdata-2023.3-py2.py3-none-any.whl (341 kB)
Collecting six>=1.5 (from python-dateutil>=2.8.2->pandas->forml==0.93)
  Using cached six-1.16.0-py2.py3-none-any.whl (11 kB)
Installing collected packages: pytz, tzdata, tomli, toml, threadpoolctl, six, setuptools, pip, packaging, numpy, MarkupSafe, joblib, cloudpickle, click, scipy, python-dateutil, jinja2, scikit-learn, pandas, imbalanced-learn, forml
Successfully installed MarkupSafe-2.1.2 click-8.1.3 cloudpickle-2.2.1 forml-0.93 imbalanced-learn-0.10.1 jinja2-3.1.2 joblib-1.2.0 numpy-1.24.3 packaging-23.1 pandas-2.0.2 pip-23.1.2 python-dateutil-2.8.2 pytz-2023.3 scikit-learn-1.2.2 scipy-1.10.1 setuptools-67.8.0 six-1.16.0 threadpoolctl-3.1.0 toml-0.10.2 tomli-2.0.1 tzdata-2023.3
running upload

Model Registry¶

Model registry serves as a crucial interface for managing published models throughout the production lifecycle. It can be provided through a number of different implementations.

The registry has a tree hierarchy with levels of project / release / generation:

In [17]:
! forml model list
dummy  
In [18]:
! forml model list dummy
0.1.dev1  
In [19]:
! forml model list dummy 0.1.dev1
In [20]:
! tree /opt/forml/assets/registry/
/opt/forml/assets/registry/
└── dummy
    └── 0.1.dev1
        └── package.4ml

2 directories, 1 file

Production Lifecycle¶

The production lifecycle takes care of all the necessary model operations after its release.

Model Training¶

In [21]:
! forml model train dummy
In [22]:
! forml model list dummy 0.1.dev1
1  
In [23]:
! tree /opt/forml/assets/registry/
/opt/forml/assets/registry/
└── dummy
    └── 0.1.dev1
        ├── 1
        │   ├── 21c560d1-dac8-41f0-b5bd-246e76e00bd6.bin
        │   ├── 4da96680-dcd0-44f9-b8f6-74d8b7d12360.bin
        │   ├── 643066e9-aad2-41f5-bd8f-73fdf736d1b8.bin
        │   └── tag.toml
        └── package.4ml

3 directories, 5 files

Followup Steps¶

We will leave the remaining steps of the production lifecycle to the final chapter 3-solution which works with a real dataset.