Formal Base Model¶

After the initial exploration, we are going to write down all the pieces of our base pipeline as proper project components.

Updating the Project Code Base¶

Let's start with the pipeline code produced on-the-fly during our exploration.

Adding TimeExtractor to Source.py¶

Since the TimeExtractor is a stateless row-wise operator, it is possible to move it already to avazuctr/source.py where it gets applied before any splitting:

  1. Open the avazuctr/source.py component.
  2. Add the TimeExtractor operator definition and the relevant imports:
import pandas
from forml import project
from forml.pipeline import payload, wrap
from openschema import kaggle as schema


@wrap.Operator.mapper
@wrap.Actor.apply
def TimeExtractor(features: pandas.DataFrame) -> pandas.DataFrame:
    """Transformer extracting temporal features from the original ``hour`` column."""
    assert "hour" in features.columns, "Missing column: hour"
    time = features["hour"]
    features["dayofweek"] = time.dt.dayofweek
    features["day"] = time.dt.day
    features["hour"] = time.dt.hour  # replacing the original column
    features["month"] = time.dt.month
    return features
  1. Apply the TimeExtractor to the SOURCE:
OUTCOMES = ...   # Keep original
ORDINAL = ...    # Keep original
STATEMENT = ...  # Keep original

# Setting up the source descriptor:
SOURCE = (
    project.Source.query(STATEMENT, OUTCOMES, ordinal=ORDINAL)
    >> payload.ToPandas()
    >> TimeExtractor()  # Applying the temporal feature extraction
)

# Registering the descriptor
project.setup(SOURCE)
  1. SAVE THE avazuctr/source.py FILE!
In [2]:
! git add avazuctr/source.py

Adding the Base Model to Pipeline.py¶

Add the base model pipeline code to the avazuctr/pipeline.py:

  1. Open the avazuctr/pipeline.py component.
  2. Add all the required imports:
from forml import project
from forml.pipeline import wrap

with wrap.importer():
    from category_encoders import TargetEncoder
    from sklearn.linear_model import LogisticRegression
    from sklearn.preprocessing import MinMaxScaler
  1. Enlist the categorical columns and compose the pipeline:
CATEGORICAL_COLUMNS = [
    "C1", "banner_pos", "site_id", "site_domain",
    "site_category", "app_id", "app_domain", "app_category",
    "device_model", "device_type", "device_conn_type",
    "C14", "C15", "C16", "C17", "C18", "C19", "C20", "C21"
]

PIPELINE = (
    TargetEncoder(cols=CATEGORICAL_COLUMNS)
    >> MinMaxScaler()
    >> LogisticRegression(max_iter=1000, random_state=42)
)

# Registering the pipeline
project.setup(PIPELINE)
  1. SAVE THE avazuctr/pipeline.py FILE!
In [3]:
! git add avazuctr/pipeline.py

Perform the Development Evaluation¶

To confirm we filled-in the project with the same implementation we got to during our initial exploration, let's perform the project evaluation and compare the reported metric:

In [4]:
! forml project eval
running eval
0.39313604609251457

Bingo!

Adding Unit Test for TimeExtractor¶

In [5]:
! touch tests/test_source.py

Edit the created test_source.py and implement the unit test:

  1. Open the test_source.py.
  2. Add all the required imports:
import pandas
from forml import testing

from avazuctr import source
  1. Provide the TestTimeExtractor unit test implementation:
class TestTimeExtractor(testing.operator(source.TimeExtractor)):
    """Unit testing the stateless TimeExtractor transformer."""

    # Dataset fixtures
    EMPTY = pandas.DataFrame()
    INPUT = pandas.DataFrame({"hour": [
        pandas.Timestamp("2023-02-01 14:12:10"),
        pandas.Timestamp("2023-03-04 06:13:27"),
        pandas.Timestamp("2023-04-10 12:00:00")
    ]})
    EXPECTED = pandas.DataFrame({
        "hour": [14, 6, 12], "dayofweek": [2, 5, 0],
        "day": [1, 4, 10], "month": [2, 3, 4]
    }).astype("int32")

    # Test scenarios
    missing_column = (
        testing.Case().apply(EMPTY).raises(AssertionError, "Missing column: hour")
    )
    valid_extraction = (
        testing.Case().apply(INPUT).returns(EXPECTED, testing.pandas_equals)
    )
  1. SAVE THE test_source.py FILE!
In [6]:
! git add tests/test_source.py

Let's trigger the project tests:

In [7]:
! forml project test 2>&1 | tail -n 20
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/forml/flow/_code/target/__init__.py", line 56, in __call__
    result = self.execute(*args)
  File "/usr/local/lib/python3.10/site-packages/forml/flow/_code/target/user.py", line 196, in execute
    return self.action(self.builder(), *args)
  File "/usr/local/lib/python3.10/site-packages/forml/flow/_code/target/user.py", line 150, in __call__
    result = actor.apply(*args)
  File "/usr/local/lib/python3.10/site-packages/forml/pipeline/wrap/_actor.py", line 166, in apply
    return self.Apply(*features, **self._kwargs)
  File "/opt/forml/workspace/3-solution/forml-solution-avazuctr/avazuctr/source.py", line 11, in TimeExtractor
    assert "hour" in features.columns, "Missing column: hour"
AssertionError: Missing column: hour
ok
test_valid_extraction (tests.test_source.TestTimeExtractor)
Test of Valid Extraction ... ok

----------------------------------------------------------------------
Ran 2 tests in 3.565s

OK