Deriving Evaluation Workflows¶

Objective:

Learning how evaluation workflows are derived from the main ML pipeline.

Principles:

  1. ML metric is a measure to assess the performance of an ML solution.
  2. Development evaluation (of the logical model - a.k.a. the solution) is derived from both the train mode and apply mode workflows combined according to a selected backtesting method.
  3. The workflow for evaluating an already trained (physical) model is composed simply from the model predictions and the (eventually observed) true outcomes.

Development Evaluation¶

  • Essential feedback during the model development process defined as a function of true and predicted outcomes.
  • Indicating a relative change in the solution quality induced by the particular change in its implementation (code).
  • Working with historical data with known outcomes arranged using a particular evaluation method.

Holdout Method¶

Evaluation method based on part of a training dataset being withheld for testing the predictions.

In [1]:
from sklearn import metrics
from forml import evaluation, project

EVALUATION = project.Evaluation(
    evaluation.Function(metrics.log_loss), # LogLoss metric function
    evaluation.HoldOut(                    # HoldOut evaluation method
        test_size=0.2, stratify=True, random_state=42
    ),
)

Based on the known SOURCE and PIPELINE components, ForML can produce a task graph to evaluate that solution using the provided definition:

In [2]:
from forml.pipeline import payload, wrap
from dummycatalog import Foo
with wrap.importer():
    from sklearn.linear_model import LogisticRegression

SOURCE = project.Source.query(Foo.select(Foo.Value), Foo.Label)
PIPELINE = LogisticRegression(random_state=42)
In [3]:
SOURCE.bind(PIPELINE, evaluation=EVALUATION).launcher(
    runner="graphviz"
).eval()
Out[3]:
%3 140579985354928 TableDriver(Reader(con=LazyReaderBackend), Statement(Foo[Foo.Value, Foo.Label])).apply 140579985356208 Slicer(range(0, 1), 1).apply 140579985354928->140579985356208 0 140579985690336 Getter#0 140579985356208->140579985690336 0 140579985690480 Getter#1 140579985356208->140579985690480 0 140579985355968 PandasCVFolds(crossvalidator=StratifiedShuffleSplit(n_splits=2, random_state=42, test_size=0.2,            train_size=None)).train 140579985690336->140579985355968 0 140580002793008 PandasCVFolds(crossvalidator=StratifiedShuffleSplit(n_splits=2, random_state=42, test_size=0.2,            train_size=None)).setstate.apply 140579985690336->140580002793008 1 140579985690480->140579985355968 1 140580006000848 Captor(Value).train 140579985690480->140580006000848 1 140580002792928 PandasCVFolds(crossvalidator=StratifiedShuffleSplit(n_splits=2, random_state=42, test_size=0.2,            train_size=None)).setstate.apply 140579985690480->140580002792928 1 140579985355968->140580002793008 0 140579985355968->140580002792928 0 140579985414496 Getter#0 140580002793008->140579985414496 0 140579985417328 Getter#1 140580002793008->140579985417328 0 140580002792848 LogisticRegression(random_state=42).train 140579985414496->140580002792848 0 140580002793728 LogisticRegression(random_state=42).setstate.apply 140579985417328->140580002793728 1 140580002792848->140580002793728 0 140579985462976 Getter#0 140579985462976->140580002792848 1 140580002793568 Apply(function=log_loss).apply 140580002793728->140580002793568 1 140580002793568->140580006000848 0 140579985472288 Getter#1 140579985472288->140580002793568 0 140580002792928->140579985462976 0 140580002792928->140579985472288 0

Cross-validation Method¶

Evaluation method based on a number of independent train-test trials using different parts of the same training dataset.

In [4]:
from sklearn import model_selection

EVALUATION = project.Evaluation(
    evaluation.Function(metrics.log_loss),  # LogLoss metric function
    evaluation.CrossVal(                    # CrossValidation method
        crossvalidator=model_selection.StratifiedKFold(
            n_splits=3, shuffle=True, random_state=42
        )
    ),
)
In [5]:
SOURCE.bind(PIPELINE, evaluation=EVALUATION).launcher(
    runner="graphviz"
).eval()
Out[5]:
%3 140580002660656 TableDriver(Reader(con=LazyReaderBackend), Statement(Foo[Foo.Value, Foo.Label])).apply 140580002664896 Slicer(range(0, 1), 1).apply 140580002660656->140580002664896 0 140579985699744 Getter#0 140580002664896->140579985699744 0 140581338583136 Getter#1 140580002664896->140581338583136 0 140580002676576 PandasCVFolds(crossvalidator=StratifiedKFold(n_splits=3, random_state=42, shuffle=True)).train 140579985699744->140580002676576 0 140579985639440 PandasCVFolds(crossvalidator=StratifiedKFold(n_splits=3, random_state=42, shuffle=True)).setstate.apply 140579985699744->140579985639440 1 140581338583136->140580002676576 1 140579985639360 Captor(Value).train 140581338583136->140579985639360 1 140580002668816 PandasCVFolds(crossvalidator=StratifiedKFold(n_splits=3, random_state=42, shuffle=True)).setstate.apply 140581338583136->140580002668816 1 140580002676576->140579985639440 0 140580002676576->140580002668816 0 140581338592304 Getter#0 140579985639440->140581338592304 0 140581338592976 Getter#1 140579985639440->140581338592976 0 140581338593120 Getter#2 140579985639440->140581338593120 0 140581338593264 Getter#3 140579985639440->140581338593264 0 140581338593408 Getter#4 140579985639440->140581338593408 0 140581338593552 Getter#5 140579985639440->140581338593552 0 140579985644160 LogisticRegression(random_state=42).train 140581338592304->140579985644160 0 140579985638480 LogisticRegression(random_state=42).setstate.apply 140581338592976->140579985638480 1 140579985644400 LogisticRegression(random_state=42).train 140581338593120->140579985644400 0 140579985652480 LogisticRegression(random_state=42).setstate.apply 140581338593264->140579985652480 1 140579985650880 LogisticRegression(random_state=42).train 140581338593408->140579985650880 0 140579985640000 LogisticRegression(random_state=42).setstate.apply 140581338593552->140579985640000 1 140579985644160->140579985638480 0 140581338606672 Getter#0 140581338606672->140579985644160 1 140579985646960 Apply(function=log_loss).apply 140579985638480->140579985646960 1 140579985643600 Apply(function=mean).apply 140579985646960->140579985643600 0 140581338606816 Getter#1 140581338606816->140579985646960 0 140579985643600->140579985639360 0 140579985637520 Apply(function=log_loss).apply 140579985637520->140579985643600 1 140579985642800 Apply(function=log_loss).apply 140579985642800->140579985643600 2 140579985644400->140579985652480 0 140581338606960 Getter#2 140581338606960->140579985644400 1 140579985652480->140579985637520 1 140581338607104 Getter#3 140581338607104->140579985637520 0 140579985650880->140579985640000 0 140581338607248 Getter#4 140581338607248->140579985650880 1 140579985640000->140579985642800 1 140581338607392 Getter#5 140581338607392->140579985642800 0 140580002668816->140581338606672 0 140580002668816->140581338606816 0 140580002668816->140581338606960 0 140580002668816->140581338607104 0 140580002668816->140581338607248 0 140580002668816->140581338607392 0

Production Performance Tracking¶

  • Evaluating performance of a physical (trained) model typically making true future predictions.
  • Critical for operational monitoring.
  • The concept of evaluation methods doesn't apply here.
  • Depends on a presence of a feedback loop eventually delivering the eventual true outcomes.

This will be demonstrated later in scope of the final solution of Avazu CTR Prediction.