Let's start by setting up the project skeleton and the main components.
We are going to initialize new ForML project with the following parameters:
forml-solution-avazuctr
avazuctr
0.1
openschema
and pandas
! forml project init "forml-solution-avazuctr" \
--version "0.1" \
--package "avazuctr" \
--requirements="openschema==0.7,pandas==2.0.1"
! tree forml-solution-avazuctr
forml-solution-avazuctr ├── avazuctr │ ├── __init__.py │ ├── evaluation.py │ ├── pipeline.py │ └── source.py ├── pyproject.toml └── tests └── __init__.py 2 directories, 6 files
%cd forml-solution-avazuctr
/opt/forml/workspace/3-solution/forml-solution-avazuctr
Going to keep the project under version control from the beginning:
! git init .
! git add .
Initialized empty Git repository in /opt/forml/workspace/3-solution/forml-solution-avazuctr/.git/
We use the Openschema catalog to specify the data requirements.
The schema contains the following set of fields (see the schema page for their descriptions):
from openschema import kaggle
print([f.name for f in kaggle.Avazu.schema])
['id', 'click', 'hour', 'C1', 'banner_pos', 'site_id', 'site_domain', 'site_category', 'app_id', 'app_domain', 'app_category', 'device_id', 'device_ip', 'device_model', 'device_type', 'device_conn_type', 'C14', 'C15', 'C16', 'C17', 'C18', 'C19', 'C20', 'C21']
Let's define the avazuctr/source.py by using this schema:
FEATURES
(excluding id
, device_ip
, and device_id
):from openschema import kaggle as schema
from forml import project
from forml.pipeline import payload
# Listing the feature columns
FEATURES = (
schema.Avazu.hour,
schema.Avazu.C1,
schema.Avazu.banner_pos,
schema.Avazu.site_id,
schema.Avazu.site_domain,
schema.Avazu.site_category,
schema.Avazu.app_id,
schema.Avazu.app_domain,
schema.Avazu.app_category,
schema.Avazu.device_model,
schema.Avazu.device_type,
schema.Avazu.device_conn_type,
schema.Avazu.C14,
schema.Avazu.C15,
schema.Avazu.C16,
schema.Avazu.C17,
schema.Avazu.C18,
schema.Avazu.C19,
schema.Avazu.C20,
schema.Avazu.C21,
)
OUTCOMES
to schema.Avazu.click
.schema.Avazu.hour
.payload.ToPandas
:OUTCOMES = schema.Avazu.click
ORDINAL = schema.Avazu.hour
STATEMENT = (
schema.Avazu.select(*FEATURES)
.orderby(schema.Avazu.hour)
.limit(500000)
)
# Setting up the source descriptor:
SOURCE = (
project.Source.query(STATEMENT, OUTCOMES, ordinal=ORDINAL)
>> payload.ToPandas()
)
# Registering the descriptor
project.setup(SOURCE)
! git add avazuctr/source.py
The generated avazuctr/evaluation.py contains some default evaluation logic (calculating accuracy using 20% holdout). Let's modify the file changing the metric to logloss
:
logloss
metric:from forml import evaluation, project
from sklearn import metrics
# Using LogLoss on a 20% holdout dataset:
EVALUATION = project.Evaluation(
evaluation.Function(metrics.log_loss),
evaluation.HoldOut(test_size=0.2, stratify=True, random_state=42),
)
# Registering the descriptor
project.setup(EVALUATION)
! git add avazuctr/evaluation.py
We can now interactively use our project skeleton to peek into the data:
from forml import project
PROJECT = project.open(path='.', package='avazuctr')
PROJECT.launcher.apply()
hour | C1 | banner_pos | site_id | site_domain | site_category | app_id | app_domain | app_category | device_model | device_type | device_conn_type | C14 | C15 | C16 | C17 | C18 | C19 | C20 | C21 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2014-10-31 00:00:00 | 1005 | 0 | 235ba823 | f6ebf28e | f028772b | ecad2386 | 7801e8d9 | 07d7df22 | 0eb711ec | 1 | 0 | 8330 | 320 | 50 | 761 | 3 | 175 | 100075 | 23 |
1 | 2014-10-31 00:00:00 | 1005 | 0 | 1fbe01fe | f3845767 | 28905ebd | ecad2386 | 7801e8d9 | 07d7df22 | ecb851b2 | 1 | 0 | 22676 | 320 | 50 | 2616 | 0 | 35 | 100083 | 51 |
2 | 2014-10-31 00:00:00 | 1005 | 0 | 1fbe01fe | f3845767 | 28905ebd | ecad2386 | 7801e8d9 | 07d7df22 | 1f0bc64f | 1 | 0 | 22676 | 320 | 50 | 2616 | 0 | 35 | 100083 | 51 |
3 | 2014-10-31 00:00:00 | 1005 | 0 | 85f751fd | c4e18dd6 | 50e219e0 | 51cedd4e | aefc06bd | 0f2161f8 | 542422a7 | 1 | 0 | 18648 | 320 | 50 | 1092 | 3 | 809 | 100156 | 61 |
4 | 2014-10-31 00:00:00 | 1005 | 0 | 85f751fd | c4e18dd6 | 50e219e0 | 9c13b419 | 2347f47a | f95efa07 | 1f0bc64f | 1 | 0 | 23160 | 320 | 50 | 2667 | 0 | 47 | -1 | 221 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
499995 | 2014-10-31 04:00:00 | 1005 | 1 | b7e9786d | b12b9f85 | f028772b | ecad2386 | 7801e8d9 | 07d7df22 | 0eb711ec | 1 | 0 | 22681 | 320 | 50 | 2528 | 0 | 167 | -1 | 221 |
499996 | 2014-10-31 04:00:00 | 1005 | 0 | 85f751fd | c4e18dd6 | 50e219e0 | 9c13b419 | 2347f47a | f95efa07 | 7abbbd5c | 1 | 0 | 23728 | 320 | 50 | 2717 | 2 | 47 | 100233 | 23 |
499997 | 2014-10-31 04:00:00 | 1005 | 0 | 5b08c53b | 7687a86e | 3e814130 | ecad2386 | 7801e8d9 | 07d7df22 | 8a4875bd | 1 | 0 | 20093 | 300 | 250 | 2295 | 2 | 35 | 100075 | 23 |
499998 | 2014-10-31 04:00:00 | 1002 | 0 | 887a4754 | e3d9ca35 | 50e219e0 | ecad2386 | 7801e8d9 | 07d7df22 | fc10a0d3 | 0 | 0 | 22701 | 320 | 50 | 2624 | 0 | 35 | -1 | 221 |
499999 | 2014-10-31 04:00:00 | 1005 | 1 | 5ee41ff2 | 17d996e6 | f028772b | ecad2386 | 7801e8d9 | 07d7df22 | 8a4875bd | 1 | 0 | 22523 | 320 | 50 | 2599 | 3 | 167 | -1 | 23 |
500000 rows × 20 columns
Launching it in the train mode allows us to explore the trainset:
trainset = PROJECT.launcher.train()
trainset.features.isnull().sum()
hour 0 C1 0 banner_pos 0 site_id 0 site_domain 0 site_category 0 app_id 0 app_domain 0 app_category 0 device_model 0 device_type 0 device_conn_type 0 C14 0 C15 0 C16 0 C17 0 C18 0 C19 0 C20 0 C21 0 dtype: int64
trainset.features.describe()
hour | C1 | banner_pos | device_type | device_conn_type | C14 | C15 | C16 | C17 | C18 | C19 | C20 | C21 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 500000 | 500000.000000 | 500000.000000 | 500000.000000 | 500000.000000 | 500000.000000 | 500000.000000 | 500000.000000 | 500000.000000 | 500000.000000 | 500000.000000 | 500000.000000 | 500000.000000 |
mean | 2014-10-21 01:19:16.514400256 | 1005.033822 | 0.217334 | 1.036556 | 0.223266 | 18189.274888 | 319.185264 | 56.551532 | 2031.249854 | 1.109050 | 201.353280 | 42643.678638 | 74.032270 |
min | 2014-10-21 00:00:00 | 1001.000000 | 0.000000 | 0.000000 | 0.000000 | 375.000000 | 120.000000 | 20.000000 | 112.000000 | 0.000000 | 33.000000 | -1.000000 | 13.000000 |
25% | 2014-10-21 01:00:00 | 1005.000000 | 0.000000 | 1.000000 | 0.000000 | 15706.000000 | 320.000000 | 50.000000 | 1722.000000 | 0.000000 | 35.000000 | -1.000000 | 48.000000 |
50% | 2014-10-21 01:00:00 | 1005.000000 | 0.000000 | 1.000000 | 0.000000 | 18993.000000 | 320.000000 | 50.000000 | 2161.000000 | 0.000000 | 39.000000 | -1.000000 | 61.000000 |
75% | 2014-10-21 02:00:00 | 1005.000000 | 0.000000 | 1.000000 | 0.000000 | 20632.000000 | 320.000000 | 50.000000 | 2351.000000 | 3.000000 | 297.000000 | 100084.000000 | 79.000000 |
max | 2014-10-21 03:00:00 | 1012.000000 | 7.000000 | 5.000000 | 5.000000 | 21705.000000 | 1024.000000 | 1024.000000 | 2497.000000 | 3.000000 | 1835.000000 | 100248.000000 | 195.000000 |
std | NaN | 0.966157 | 0.443991 | 0.489265 | 0.669214 | 3349.867073 | 21.013136 | 36.144704 | 417.909361 | 1.278022 | 273.283478 | 49498.049418 | 40.816648 |
trainset.labels.value_counts()
click 0 417919 1 82081 Name: count, dtype: int64
import pandas
from forml.pipeline import wrap
@wrap.Operator.mapper
@wrap.Actor.apply
def TimeExtractor(features: pandas.DataFrame) -> pandas.DataFrame:
"""Transformer extracting temporal features from the original ``hour`` column."""
assert 'hour' in features.columns, 'Missing column: hour'
time = features['hour']
features['dayofweek'] = time.dt.dayofweek
features['day'] = time.dt.day
features['hour'] = time.dt.hour # replacing the original column
features['month'] = time.dt.month
return features
SOURCE = PROJECT.components.source
SOURCE.bind(TimeExtractor()).launcher.apply()
hour | C1 | banner_pos | site_id | site_domain | site_category | app_id | app_domain | app_category | device_model | ... | C15 | C16 | C17 | C18 | C19 | C20 | C21 | dayofweek | day | month | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1005 | 0 | 235ba823 | f6ebf28e | f028772b | ecad2386 | 7801e8d9 | 07d7df22 | 0eb711ec | ... | 320 | 50 | 761 | 3 | 175 | 100075 | 23 | 4 | 31 | 10 |
1 | 0 | 1005 | 0 | 1fbe01fe | f3845767 | 28905ebd | ecad2386 | 7801e8d9 | 07d7df22 | ecb851b2 | ... | 320 | 50 | 2616 | 0 | 35 | 100083 | 51 | 4 | 31 | 10 |
2 | 0 | 1005 | 0 | 1fbe01fe | f3845767 | 28905ebd | ecad2386 | 7801e8d9 | 07d7df22 | 1f0bc64f | ... | 320 | 50 | 2616 | 0 | 35 | 100083 | 51 | 4 | 31 | 10 |
3 | 0 | 1005 | 0 | 85f751fd | c4e18dd6 | 50e219e0 | 51cedd4e | aefc06bd | 0f2161f8 | 542422a7 | ... | 320 | 50 | 1092 | 3 | 809 | 100156 | 61 | 4 | 31 | 10 |
4 | 0 | 1005 | 0 | 85f751fd | c4e18dd6 | 50e219e0 | 9c13b419 | 2347f47a | f95efa07 | 1f0bc64f | ... | 320 | 50 | 2667 | 0 | 47 | -1 | 221 | 4 | 31 | 10 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
499995 | 4 | 1005 | 1 | b7e9786d | b12b9f85 | f028772b | ecad2386 | 7801e8d9 | 07d7df22 | 0eb711ec | ... | 320 | 50 | 2528 | 0 | 167 | -1 | 221 | 4 | 31 | 10 |
499996 | 4 | 1005 | 0 | 85f751fd | c4e18dd6 | 50e219e0 | 9c13b419 | 2347f47a | f95efa07 | 7abbbd5c | ... | 320 | 50 | 2717 | 2 | 47 | 100233 | 23 | 4 | 31 | 10 |
499997 | 4 | 1005 | 0 | 5b08c53b | 7687a86e | 3e814130 | ecad2386 | 7801e8d9 | 07d7df22 | 8a4875bd | ... | 300 | 250 | 2295 | 2 | 35 | 100075 | 23 | 4 | 31 | 10 |
499998 | 4 | 1002 | 0 | 887a4754 | e3d9ca35 | 50e219e0 | ecad2386 | 7801e8d9 | 07d7df22 | fc10a0d3 | ... | 320 | 50 | 2624 | 0 | 35 | -1 | 221 | 4 | 31 | 10 |
499999 | 4 | 1005 | 1 | 5ee41ff2 | 17d996e6 | f028772b | ecad2386 | 7801e8d9 | 07d7df22 | 8a4875bd | ... | 320 | 50 | 2599 | 3 | 167 | -1 | 23 | 4 | 31 | 10 |
500000 rows × 23 columns
Let's apply the Target encoding technique to all the categorical columns. We can use the TargetEncoder implementation from the Category-encoders package. As a new dependency, we add it to the pyproject.toml together with Scikit-learn which we are going to need in the next step:
category-encoders==2.6.0
and scikit-learn==1.2.2
:[project]
name = "forml-solution-avazuctr"
version = "0.1"
dependencies = [
"category-encoders==2.6.0",
"forml==0.93",
"openschema==0.7",
"pandas==2.0.1",
"scikit-learn==1.2.2"
]
[tool.forml]
package = "avazuctr"
! git add pyproject.toml
Now we can add the encoder into our pipeline:
with wrap.importer():
from category_encoders import TargetEncoder
CATEGORICAL_COLUMNS = [
"C1", "banner_pos", "site_id", "site_domain",
"site_category", "app_id", "app_domain", "app_category",
"device_model", "device_type", "device_conn_type",
"C14", "C15", "C16", "C17", "C18", "C19", "C20", "C21"
]
SOURCE.bind(
TargetEncoder(cols=CATEGORICAL_COLUMNS)
).launcher.train().features
hour | C1 | banner_pos | site_id | site_domain | site_category | app_id | app_domain | app_category | device_model | device_type | device_conn_type | C14 | C15 | C16 | C17 | C18 | C19 | C20 | C21 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2014-10-21 00:00:00 | 0.164949 | 0.155877 | 0.211945 | 0.211945 | 0.208603 | 0.196316 | 0.190123 | 0.196119 | 0.225080 | 0.164528 | 0.124586 | 0.167300 | 0.153985 | 0.154188 | 0.208514 | 0.166794 | 0.166436 | 0.171947 | 0.208514 |
1 | 2014-10-21 00:00:00 | 0.164949 | 0.155877 | 0.211945 | 0.211945 | 0.208603 | 0.196316 | 0.190123 | 0.196119 | 0.236538 | 0.164528 | 0.169448 | 0.216279 | 0.153985 | 0.154188 | 0.208514 | 0.166794 | 0.166436 | 0.253048 | 0.208514 |
2 | 2014-10-21 00:00:00 | 0.164949 | 0.155877 | 0.211945 | 0.211945 | 0.208603 | 0.196316 | 0.190123 | 0.196119 | 0.131059 | 0.164528 | 0.169448 | 0.216279 | 0.153985 | 0.154188 | 0.208514 | 0.166794 | 0.166436 | 0.253048 | 0.208514 |
3 | 2014-10-21 00:00:00 | 0.164949 | 0.155877 | 0.211945 | 0.211945 | 0.208603 | 0.196316 | 0.190123 | 0.196119 | 0.293158 | 0.164528 | 0.169448 | 0.167300 | 0.153985 | 0.154188 | 0.208514 | 0.166794 | 0.166436 | 0.253048 | 0.208514 |
4 | 2014-10-21 00:00:00 | 0.164949 | 0.195663 | 0.036642 | 0.036642 | 0.036364 | 0.196316 | 0.190123 | 0.196119 | 0.227891 | 0.164528 | 0.169448 | 0.080279 | 0.153985 | 0.154188 | 0.074237 | 0.166794 | 0.166436 | 0.171947 | 0.086167 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
499995 | 2014-10-21 03:00:00 | 0.164949 | 0.155877 | 0.012195 | 0.012195 | 0.030204 | 0.196316 | 0.190123 | 0.196119 | 0.187754 | 0.164528 | 0.169448 | 0.074220 | 0.153985 | 0.154188 | 0.074220 | 0.109077 | 0.166436 | 0.150633 | 0.073100 |
499996 | 2014-10-21 03:00:00 | 0.164949 | 0.195663 | 0.442296 | 0.431840 | 0.195131 | 0.196316 | 0.190123 | 0.196119 | 0.127723 | 0.164528 | 0.169448 | 0.278302 | 0.153985 | 0.154188 | 0.298913 | 0.109077 | 0.230515 | 0.114082 | 0.217605 |
499997 | 2014-10-21 03:00:00 | 0.164949 | 0.195663 | 0.091357 | 0.092482 | 0.195131 | 0.196316 | 0.190123 | 0.196119 | 0.224570 | 0.164528 | 0.169448 | 0.123115 | 0.153985 | 0.154188 | 0.125073 | 0.109077 | 0.141789 | 0.177072 | 0.217605 |
499998 | 2014-10-21 03:00:00 | 0.164949 | 0.155877 | 0.211945 | 0.211945 | 0.208603 | 0.196316 | 0.190123 | 0.196119 | 0.200614 | 0.164528 | 0.169448 | 0.218173 | 0.153985 | 0.154188 | 0.208514 | 0.166794 | 0.166436 | 0.171947 | 0.208514 |
499999 | 2014-10-21 03:00:00 | 0.164949 | 0.155877 | 0.211945 | 0.211945 | 0.208603 | 0.196316 | 0.190123 | 0.196119 | 0.216730 | 0.164528 | 0.169448 | 0.211592 | 0.153985 | 0.154188 | 0.208514 | 0.166794 | 0.166436 | 0.253048 | 0.208514 |
500000 rows × 20 columns
Let's just append the MinMaxScaler to the pipeline and the LogisticRegression classifier:
with wrap.importer():
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler
PIPELINE = (
TimeExtractor()
>> TargetEncoder(cols=CATEGORICAL_COLUMNS)
>> MinMaxScaler()
>> LogisticRegression(max_iter=1000, random_state=42)
)
SOURCE.bind(PIPELINE).launcher(runner="graphviz").train()
Using our evaluation definition from avazuctr/evaluation.py, we get the logloss
of this our base model:
SOURCE.bind(PIPELINE, evaluation=PROJECT.components.evaluation).launcher.eval()
0.39313604609251457