Tune Documentation
Introduction
Tune is an abstraction layer for general parameter tuning. It is built on Fugue so it can seamlessly run on any backend supported by Fugue, such as Spark, Dask and local.
Installation
pip install tune
It’s recommended to also install Scikit-Learn (for all compatible models tuning) and Hyperopt (to enable Bayesian Optimization
pip install tune[hyperopt,sklearn]
Quick Start
To quickly start, please go through these tutorials on Kaggle:
Non-iterative Problems, such as Scikit-Learn model tuning
Iterative Problems, such as Keras model tuning
Design Philosophy
Tune does not follow Scikit-Learn’s model selection APIs and does not provide distributed backend for it. We believe that parameter tuning is a general problem that is not only for machine learning, so our abstractions are built from ground up, the lower level APIs do not assume the objective is a machine learning model, while the higher level APIs are dedicated to solve specific problems, such as Scikit-Learn compatible model tuning and Keras model tuning.
Although we didn’t base our solution on any of HyperOpt, Optuna, Ray Tune and Nevergrad etc., we are truly inspired by these wonderful solutions and their design. We also integrated with many of them for deeper level optimizations.
Tuning problems are never easy, here are our goals:
Provide the simplest and most intuitive APIs for major tuning cases. We always start from real tuning cases, figure out the minimal requirement for each of them and then determine the layers of abstraction. Read this tutorial, you can see how minimal the interfaces can be.
Be scale agnostic and platform agnostic. We want you to worry less about distributed computing, and just focus on the tuning logic itself. Built on Fugue, Tune let you develop your tuning process iteratively. You can test with small spaces on local machine, and then switch to larger spaces and run distributedly with no code change. It can effectively save time and cost and make the process fun and rewarding. And to run any tuning logic distributedly, you only need a core framework itself (Spark, Dask, etc.) and you do not need a database, a queue service or even an embeded cluster.
Be highly extendable and flexible on lower level. For example
you can extend on Fugue level, for example create an execution engine for Prefect to run the tuning jobs as a Prefect workflow
you can integrate third party optimizers and use Tune just as a distributed orchestrator.
you can start external instances (e.g. EC2 instances) for different training subtasks and to fully utilize your cloud
you can combine with distributed training as long as your have enough compute resource
Current Focuses
Here are our current focuses:
A flexible space design and can describe a hybrid space of grid search, random search and second level optimization such as bayesian optimization
Integrate with 3rd party tuning frameworks. We have integrated HyperOpt and Optuna. And Nevergrad is on the way.
Create generalized and distributed versions of Successive Halving, Hyperband and Asynchronous Successive Halving.
Collaboration
We are looking for collaborators, if you are interested, please let us know.
Please join our Slack channel.
Top Level API Reference
The Space Concept
Space
- class Space(*args, **kwargs)[source]
Bases:
object
Search space object
Important
Please read Space Tutorial.
- Parameters
kwargs (Any) – parameters in the search space
Space(a=1, b=1) # static space Space(a=1, b=Grid(1,2), c=Grid("a", "b")) # grid search Space(a=1, b=Grid(1,2), c=Rand(0, 1)) # grid search + level 2 search Space(a=1, b=Grid(1,2), c=Rand(0, 1)).sample(10, sedd=0) # grid + random search # union Space(a=1, b=Grid(2,3)) + Space(b=Rand(1,5)).sample(10) # cross product Space(a=1, b=Grid(2,3)) * Space(c=Rand(1,5), d=Grid("a","b")) # combo (grid + random + level 2) space1 = Space(a=1, b=Grid(2,4)) space2 = Space(b=RandInt(10, 20)) space3 = Space(c=Rand(0,1)).sample(10) space = (space1 + space2) * space3
assert Space(a=1, b=Rand(0,1)).has_stochastic assert not Space(a=1, b=Rand(0,1)).sample(10).has_stochastic assert not Space(a=1, b=Grid(0,1)).has_stochastic assert not Space(a=1, b=1).has_stochastic # get all configurations space = Space(a=Grid(2,4), b=Rand(0,1)).sample(100) for conf in space: print(conf) all_conf = list(space)
- property has_stochastic
Whether the space contains any
StochasticExpression
- sample(n, seed=None)[source]
Draw random samples from the current space. Please read Space Tutorial.
- Parameters
n (int) – number of samples to draw
seed (Optional[Any]) – random seed, defaults to None
- Returns
a new Space containing all samples
- Return type
Note
it only applies to
StochasticExpression
if
has_stochastic()
is False, then it will return the original spaceAfter sampling, no
StochasticExpression
will exist in the new space.
TuningParametersTemplate
- class TuningParametersTemplate(raw)[source]
Bases:
object
Parameter template to extract tuning parameter expressions from nested data structure
- Parameters
raw (Dict[str, Any]) – the dictionary of input parameters.
Note
Please use
to_template()
to initialize this class.# common cases to_template(dict(a=1, b=1)) to_template(dict(a=Rand(0, 1), b=1)) # expressions may nest in dicts or arrays template = to_template( dict(a=dict(x1=Rand(0, 1), x2=Rand(3,4)), b=[Grid("a", "b")])) assert [Rand(0, 1), Rand(3, 4), Grid("a", "b")] == template.params assert dict( p0=Rand(0, 1), p1=Rand(3, 4), p2=Grid("a", "b") ) == template.params_dict assert dict(a=1, x2=3), b=["a"]) == template.fill([1, 3, "a"]) assert dict(a=1, x2=3), b=["a"]) == template.fill_dict( dict(p2="a", p1=3, p0=1) )
- concat(other)[source]
Concatenate with another template and generate a new template.
Note
The other template must not have any key existed in this template, otherwise
ValueError
will be raised- Returns
the merged template
- Parameters
other (tune.concepts.space.parameters.TuningParametersTemplate) –
- Return type
- static decode(data)[source]
Retrieve the template from a base64 string
- Parameters
data (str) –
- Return type
- property empty: bool
Whether the template contains any tuning expression
- encode()[source]
Convert the template to a base64 string
- Return type
str
- fill(params)[source]
Fill the original data structure with values
- Parameters
params (List[Any]) – the list of values to be filled into the original data structure, in depth-first order
copy – whether to return a deeply copied paramters, defaults to False
- Returns
the original data structure filled with values
- Return type
Dict[str, Any]
- fill_dict(params)[source]
Fill the original data structure with dictionary of values
- Parameters
params (Dict[str, Any]) – the dictionary of values to be filled into the original data structure, keys must be p0, p1, p2, …
copy – whether to return a deeply copied paramters, defaults to False
- Returns
the original data structure filled with values
- Return type
Dict[str, Any]
- property has_grid: bool
Whether the template contains grid expressions
- property has_stochastic: bool
Whether the template contains stochastic expressions
- property params: List[tune.concepts.space.parameters.TuningParameterExpression]
Get all tuning parameter expressions in depth-first order
- property params_dict: Dict[str, tune.concepts.space.parameters.TuningParameterExpression]
Get all tuning parameter expressions in depth-first order, with correspondent made-up new keys p0, p1, p2, …
- product_grid()[source]
cross product all grid parameters
- Yield
new templates with the grid paramters filled
- Return type
Iterable[tune.concepts.space.parameters.TuningParametersTemplate]
assert [dict(a=1,b=Rand(0,1)), dict(a=2,b=Rand(0,1))] == list(to_template(dict(a=Grid(1,2),b=Rand(0,1))).product_grid())
- sample(n, seed=None)[source]
sample all stochastic parameters
- Parameters
n (int) – number of samples, must be a positive integer
seed (Optional[Any]) – random seed defaulting to None. It will take effect if it is not None.
- Yield
new templates with the grid paramters filled
- Return type
Iterable[tune.concepts.space.parameters.TuningParametersTemplate]
assert [dict(a=1.1,b=Grid(0,1)), dict(a=1.5,b=Grid(0,1))] == list(to_template(dict(a=Rand(1,2),b=Grid(0,1))).sample(2,0))
- property simple_value: Dict[str, Any]
If the template contains no tuning expression, it’s simple and it will return parameters dictionary, otherwise,
ValueError
will be raised
- property template: Dict[str, Any]
The template dictionary, all tuning expressions will be replaced by
None
Grid
- class Grid(*args)[source]
Bases:
tune.concepts.space.parameters.TuningParameterExpression
Grid search, every value will be used. Please read Space Tutorial.
- Parameters
args (Any) – values for the grid search
Choice
- class Choice(*args)[source]
Bases:
tune.concepts.space.parameters.StochasticExpression
A random choice of values. Please read Space Tutorial.
- Parameters
args (Any) – values to choose from
- generate(seed=None)[source]
Return a randomly chosen value.
- Parameters
seed (Optional[Any]) – if set, it will be used to call
seed()
, defaults to None- Return type
Any
- property jsondict: Dict[str, Any]
Dict representation of the expression that is json serializable
- property values: List[Any]
values to choose from
TransitionChoice
- class TransitionChoice(*args)[source]
Bases:
tune.concepts.space.parameters.Choice
An ordered random choice of values. Please read Space Tutorial.
- Parameters
args (Any) – values to choose from
- property jsondict: Dict[str, Any]
Dict representation of the expression that is json serializable
Rand
- class Rand(low, high, q=None, log=False, include_high=True)[source]
Bases:
tune.concepts.space.parameters.RandBase
Continuous uniform random variables. Please read Space Tutorial.
- Parameters
low (float) – range low bound (inclusive)
high (float) – range high bound (exclusive)
q (Optional[float]) – step between adjacent values, if set, the value will be rounded using
q
, defaults to Nonelog (bool) – whether to do uniform sampling in log space, defaults to False. If True,
low
must be positive and lower values get higher chance to be sampledinclude_high (bool) –
- generate(seed=None)[source]
Return a randomly chosen value.
- Parameters
seed (Optional[Any]) – if set, it will be used to call
seed()
, defaults to None- Return type
float
- property jsondict: Dict[str, Any]
Dict representation of the expression that is json serializable
RandInt
- class RandInt(low, high, q=1, log=False, include_high=True)[source]
Bases:
tune.concepts.space.parameters.RandBase
Uniform distributed random integer values. Please read Space Tutorial.
- Parameters
low (int) – range low bound (inclusive)
high (int) – range high bound (exclusive)
log (bool) – whether to do uniform sampling in log space, defaults to False. If True,
low
must be>=1
and lower values get higher chance to be sampledq (int) –
include_high (bool) –
- generate(seed=None)[source]
Return a randomly chosen value.
- Parameters
seed (Optional[Any]) – if set, it will be used to call
seed()
, defaults to None- Return type
float
- property jsondict: Dict[str, Any]
Dict representation of the expression that is json serializable
General Non-Iterative Problems
- suggest_for_noniterative_objective(objective, space, df=None, df_name='__tune__df_', temp_path='', partition_keys=None, top_n=1, local_optimizer=None, logger=None, monitor=None, stopper=None, stop_check_interval=None, distributed=None, shuffle_candidates=True, execution_engine=None, execution_engine_conf=None)[source]
Given non-iterative
objective
,space
and (optional) dataframe, suggest the best parameter combinations.Important
Please read Non-Iterative Tuning Guide
- Parameters
objective (Any) – a simple python function or
NonIterativeObjectiveFunc
compatible object, please read Non-Iterative Objective Explainedspace (tune.concepts.space.spaces.Space) – search space, please read Space Tutorial
df (Optional[Any]) – Pandas, Spark, Dask or any dataframe that can be converted to Fugue
DataFrame
, defaults to Nonedf_name (str) – dataframe name, defaults to the value of
TUNE_DATASET_DF_DEFAULT_NAME
temp_path (str) – temp path for serialized dataframe partitions. It can be empty if you preset using
TUNE_OBJECT_FACTORY.
set_temp_path()
. For details, read TuneDataset Tutorial, defaults to “”partition_keys (Optional[List[str]]) – partition keys for
df
, defaults to None. For details, please read TuneDataset Tutorialtop_n (int) – number of best results to return, defaults to 1. If <=0 all results will be returned
local_optimizer (Optional[Any]) – an object that can be converted to
NonIterativeObjectiveLocalOptimizer
, please read Non-Iterative Optimizers, defaults to Nonelogger (Optional[Any]) –
MetricLogger
object or a function producing it, defaults to Nonemonitor (Optional[Any]) – realtime monitor, defaults to None. Read Monitoring Guide
stopper (Optional[Any]) – early stopper, defaults to None. Read Early Stopping Guide
stop_check_interval (Optional[Any]) – an object that can be converted to timedelta, defaults to None. For details, read
to_timedelta()
distributed (Optional[bool]) – whether to use the exeuction engine to run different trials distributedly, defaults to None. If None, it’s equal to True.
shuffle_candidates (bool) – whether to shuffle the candidate configurations, defaults to True. This is no effect on final result.
execution_engine (Optional[Any]) – Fugue
ExecutionEngine
like object, defaults to None. If None,NativeExecutionEngine
will be used, the task will be running on local machine.execution_engine_conf (Optional[Any]) – Parameters like object, defaults to None
- Returns
a list of best results
- Return type
- optimize_noniterative(objective, dataset, optimizer=None, distributed=None, logger=None, monitor=None, stopper=None, stop_check_interval=None)[source]
- Parameters
objective (Any) –
dataset (tune.concepts.dataset.TuneDataset) –
optimizer (Optional[Any]) –
distributed (Optional[bool]) –
logger (Optional[Any]) –
monitor (Optional[Any]) –
stopper (Optional[Any]) –
stop_check_interval (Optional[Any]) –
- Return type
Level 2 Optimizers
Hyperopt
- class HyperoptLocalOptimizer(max_iter, seed=0, kwargs_func=None)[source]
Bases:
tune.noniterative.objective.NonIterativeObjectiveLocalOptimizer
- Parameters
max_iter (int) –
seed (int) –
kwargs_func (Optional[Callable[[tune.noniterative.objective.NonIterativeObjectiveFunc, tune.concepts.flow.trial.Trial], Dict[str, Any]]]) –
- run(func, trial, logger)[source]
- Parameters
func (tune.noniterative.objective.NonIterativeObjectiveFunc) –
trial (tune.concepts.flow.trial.Trial) –
logger (Any) –
- Return type
Optuna
- class OptunaLocalOptimizer(max_iter, create_study=None)[source]
Bases:
tune.noniterative.objective.NonIterativeObjectiveLocalOptimizer
- Parameters
max_iter (int) –
create_study (Optional[Callable[[], optuna.study.study.Study]]) –
- run(func, trial, logger)[source]
- Parameters
func (tune.noniterative.objective.NonIterativeObjectiveFunc) –
trial (tune.concepts.flow.trial.Trial) –
logger (Any) –
- Return type
General Iterative Problems
Successive Halving
- suggest_by_sha(objective, space, plan, train_df=None, temp_path='', partition_keys=None, top_n=1, monitor=None, distributed=None, execution_engine=None, execution_engine_conf=None)[source]
- Parameters
objective (Any) –
space (tune.concepts.space.spaces.Space) –
plan (List[Tuple[float, int]]) –
train_df (Optional[Any]) –
temp_path (str) –
partition_keys (Optional[List[str]]) –
top_n (int) –
monitor (Optional[Any]) –
distributed (Optional[bool]) –
execution_engine (Optional[Any]) –
execution_engine_conf (Optional[Any]) –
- Return type
- optimize_by_sha(objective, dataset, plan, checkpoint_path='', distributed=None, monitor=None)[source]
- Parameters
objective (Any) –
dataset (tune.concepts.dataset.TuneDataset) –
plan (List[Tuple[float, int]]) –
checkpoint_path (str) –
distributed (Optional[bool]) –
monitor (Optional[Any]) –
- Return type
Hyperband
- suggest_by_hyperband(objective, space, plans, train_df=None, temp_path='', partition_keys=None, top_n=1, monitor=None, distributed=None, execution_engine=None, execution_engine_conf=None)[source]
- Parameters
objective (Any) –
space (tune.concepts.space.spaces.Space) –
plans (List[List[Tuple[float, int]]]) –
train_df (Optional[Any]) –
temp_path (str) –
partition_keys (Optional[List[str]]) –
top_n (int) –
monitor (Optional[Any]) –
distributed (Optional[bool]) –
execution_engine (Optional[Any]) –
execution_engine_conf (Optional[Any]) –
- Return type
- optimize_by_hyperband(objective, dataset, plans, checkpoint_path='', distributed=None, monitor=None)[source]
- Parameters
objective (Any) –
dataset (tune.concepts.dataset.TuneDataset) –
plans (List[List[Tuple[float, int]]]) –
checkpoint_path (str) –
distributed (Optional[bool]) –
monitor (Optional[Any]) –
- Return type
Continuous ASHA
- suggest_by_continuous_asha(objective, space, plan, train_df=None, temp_path='', partition_keys=None, top_n=1, monitor=None, execution_engine=None, execution_engine_conf=None)[source]
- Parameters
objective (Any) –
space (tune.concepts.space.spaces.Space) –
plan (List[Tuple[float, int]]) –
train_df (Optional[Any]) –
temp_path (str) –
partition_keys (Optional[List[str]]) –
top_n (int) –
monitor (Optional[Any]) –
execution_engine (Optional[Any]) –
execution_engine_conf (Optional[Any]) –
- Return type
- optimize_by_continuous_asha(objective, dataset, plan, checkpoint_path='', always_checkpoint=False, study_early_stop=None, trial_early_stop=None, monitor=None)[source]
- Parameters
objective (Any) –
dataset (tune.concepts.dataset.TuneDataset) –
plan (List[Tuple[float, int]]) –
checkpoint_path (str) –
always_checkpoint (bool) –
study_early_stop (Optional[Callable[[List[Any], List[tune.iterative.asha.RungHeap]], bool]]) –
trial_early_stop (Optional[Callable[[tune.concepts.flow.report.TrialReport, List[tune.concepts.flow.report.TrialReport], List[tune.iterative.asha.RungHeap]], bool]]) –
monitor (Optional[Any]) –
- Return type
For Scikit-Learn
- sk_space(model, **params)[source]
- Parameters
model (str) –
params (Dict[str, Any]) –
- Return type
- suggest_sk_models_by_cv(space, train_df, scoring, cv=5, temp_path='', feature_prefix='', label_col='label', save_model=False, partition_keys=None, top_n=1, local_optimizer=None, monitor=None, stopper=None, stop_check_interval=None, distributed=None, execution_engine=None, execution_engine_conf=None)[source]
- Parameters
space (tune.concepts.space.spaces.Space) –
train_df (Any) –
scoring (str) –
cv (int) –
temp_path (str) –
feature_prefix (str) –
label_col (str) –
save_model (bool) –
partition_keys (Optional[List[str]]) –
top_n (int) –
local_optimizer (Optional[tune.noniterative.objective.NonIterativeObjectiveLocalOptimizer]) –
monitor (Optional[Any]) –
stopper (Optional[Any]) –
stop_check_interval (Optional[Any]) –
distributed (Optional[bool]) –
execution_engine (Optional[Any]) –
execution_engine_conf (Optional[Any]) –
- Return type
- suggest_sk_models(space, train_df, test_df, scoring, temp_path='', feature_prefix='', label_col='label', save_model=False, partition_keys=None, top_n=1, local_optimizer=None, monitor=None, stopper=None, stop_check_interval=None, distributed=None, execution_engine=None, execution_engine_conf=None)[source]
- Parameters
space (tune.concepts.space.spaces.Space) –
train_df (Any) –
test_df (Any) –
scoring (str) –
temp_path (str) –
feature_prefix (str) –
label_col (str) –
save_model (bool) –
partition_keys (Optional[List[str]]) –
top_n (int) –
local_optimizer (Optional[tune.noniterative.objective.NonIterativeObjectiveLocalOptimizer]) –
monitor (Optional[Any]) –
stopper (Optional[Any]) –
stop_check_interval (Optional[Any]) –
distributed (Optional[bool]) –
execution_engine (Optional[Any]) –
execution_engine_conf (Optional[Any]) –
- Return type
For Tensorflow Keras
- class KerasTrainingSpec(params, dfs)[source]
Bases:
object
- Parameters
params (Any) –
dfs (Dict[str, Any]) –
- compile_model(**add_kwargs)[source]
- Parameters
add_kwargs (Any) –
- Return type
keras.engine.training.Model
- compute_sort_metric(**add_kwargs)[source]
- Parameters
add_kwargs (Any) –
- Return type
float
- property dfs: Dict[str, Any]
- finalize()[source]
- Return type
None
- fit(**add_kwargs)[source]
- Parameters
add_kwargs (Any) –
- Return type
keras.callbacks.History
- generate_sort_metric(metric)[source]
- Parameters
metric (float) –
- Return type
float
- get_compile_params()[source]
- Return type
Dict[str, Any]
- get_fit_metric(history)[source]
- Parameters
history (keras.callbacks.History) –
- Return type
float
- get_fit_params()[source]
- Return type
Tuple[List[Any], Dict[str, Any]]
- get_model()[source]
- Return type
keras.engine.training.Model
- load_checkpoint(fs, model)[source]
- Parameters
fs (fs.base.FS) –
model (keras.engine.training.Model) –
- Return type
None
- property params: tune.concepts.space.parameters.TuningParametersTemplate
- save_checkpoint(fs, model)[source]
- Parameters
fs (fs.base.FS) –
model (keras.engine.training.Model) –
- Return type
None
- keras_space(model, **params)[source]
- Parameters
model (Any) –
params (Any) –
- Return type
- suggest_keras_models_by_continuous_asha(space, plan, train_df=None, temp_path='', partition_keys=None, top_n=1, monitor=None, execution_engine=None, execution_engine_conf=None)[source]
- Parameters
space (tune.concepts.space.spaces.Space) –
plan (List[Tuple[float, int]]) –
train_df (Optional[Any]) –
temp_path (str) –
partition_keys (Optional[List[str]]) –
top_n (int) –
monitor (Optional[Any]) –
execution_engine (Optional[Any]) –
execution_engine_conf (Optional[Any]) –
- Return type
- suggest_keras_models_by_hyperband(space, plans, train_df=None, temp_path='', partition_keys=None, top_n=1, monitor=None, distributed=None, execution_engine=None, execution_engine_conf=None)[source]
- Parameters
space (tune.concepts.space.spaces.Space) –
plans (List[List[Tuple[float, int]]]) –
train_df (Optional[Any]) –
temp_path (str) –
partition_keys (Optional[List[str]]) –
top_n (int) –
monitor (Optional[Any]) –
distributed (Optional[bool]) –
execution_engine (Optional[Any]) –
execution_engine_conf (Optional[Any]) –
- Return type
- suggest_keras_models_by_sha(space, plan, train_df=None, temp_path='', partition_keys=None, top_n=1, monitor=None, distributed=None, execution_engine=None, execution_engine_conf=None)[source]
- Parameters
space (tune.concepts.space.spaces.Space) –
plan (List[Tuple[float, int]]) –
train_df (Optional[Any]) –
temp_path (str) –
partition_keys (Optional[List[str]]) –
top_n (int) –
monitor (Optional[Any]) –
distributed (Optional[bool]) –
execution_engine (Optional[Any]) –
execution_engine_conf (Optional[Any]) –
- Return type
Complete API Reference
tune
tune.api
tune.api.factory
- class TuneObjectFactory[source]
Bases:
object
- make_dataset(dag, dataset, df=None, df_name='__tune__df_', test_df=None, test_df_name='__tune__df__validation_', partition_keys=None, shuffle=True, temp_path='')[source]
- Parameters
dataset (Any) –
df (Optional[Any]) –
df_name (str) –
test_df (Optional[Any]) –
test_df_name (str) –
partition_keys (Optional[List[str]]) –
shuffle (bool) –
temp_path (str) –
- Return type
tune.api.optimize
- optimize_by_continuous_asha(objective, dataset, plan, checkpoint_path='', always_checkpoint=False, study_early_stop=None, trial_early_stop=None, monitor=None)[source]
- Parameters
objective (Any) –
dataset (tune.concepts.dataset.TuneDataset) –
plan (List[Tuple[float, int]]) –
checkpoint_path (str) –
always_checkpoint (bool) –
study_early_stop (Optional[Callable[[List[Any], List[tune.iterative.asha.RungHeap]], bool]]) –
trial_early_stop (Optional[Callable[[tune.concepts.flow.report.TrialReport, List[tune.concepts.flow.report.TrialReport], List[tune.iterative.asha.RungHeap]], bool]]) –
monitor (Optional[Any]) –
- Return type
- optimize_by_hyperband(objective, dataset, plans, checkpoint_path='', distributed=None, monitor=None)[source]
- Parameters
objective (Any) –
dataset (tune.concepts.dataset.TuneDataset) –
plans (List[List[Tuple[float, int]]]) –
checkpoint_path (str) –
distributed (Optional[bool]) –
monitor (Optional[Any]) –
- Return type
- optimize_by_sha(objective, dataset, plan, checkpoint_path='', distributed=None, monitor=None)[source]
- Parameters
objective (Any) –
dataset (tune.concepts.dataset.TuneDataset) –
plan (List[Tuple[float, int]]) –
checkpoint_path (str) –
distributed (Optional[bool]) –
monitor (Optional[Any]) –
- Return type
- optimize_noniterative(objective, dataset, optimizer=None, distributed=None, logger=None, monitor=None, stopper=None, stop_check_interval=None)[source]
- Parameters
objective (Any) –
dataset (tune.concepts.dataset.TuneDataset) –
optimizer (Optional[Any]) –
distributed (Optional[bool]) –
logger (Optional[Any]) –
monitor (Optional[Any]) –
stopper (Optional[Any]) –
stop_check_interval (Optional[Any]) –
- Return type
tune.api.suggest
- suggest_by_continuous_asha(objective, space, plan, train_df=None, temp_path='', partition_keys=None, top_n=1, monitor=None, execution_engine=None, execution_engine_conf=None)[source]
- Parameters
objective (Any) –
space (tune.concepts.space.spaces.Space) –
plan (List[Tuple[float, int]]) –
train_df (Optional[Any]) –
temp_path (str) –
partition_keys (Optional[List[str]]) –
top_n (int) –
monitor (Optional[Any]) –
execution_engine (Optional[Any]) –
execution_engine_conf (Optional[Any]) –
- Return type
- suggest_by_hyperband(objective, space, plans, train_df=None, temp_path='', partition_keys=None, top_n=1, monitor=None, distributed=None, execution_engine=None, execution_engine_conf=None)[source]
- Parameters
objective (Any) –
space (tune.concepts.space.spaces.Space) –
plans (List[List[Tuple[float, int]]]) –
train_df (Optional[Any]) –
temp_path (str) –
partition_keys (Optional[List[str]]) –
top_n (int) –
monitor (Optional[Any]) –
distributed (Optional[bool]) –
execution_engine (Optional[Any]) –
execution_engine_conf (Optional[Any]) –
- Return type
- suggest_by_sha(objective, space, plan, train_df=None, temp_path='', partition_keys=None, top_n=1, monitor=None, distributed=None, execution_engine=None, execution_engine_conf=None)[source]
- Parameters
objective (Any) –
space (tune.concepts.space.spaces.Space) –
plan (List[Tuple[float, int]]) –
train_df (Optional[Any]) –
temp_path (str) –
partition_keys (Optional[List[str]]) –
top_n (int) –
monitor (Optional[Any]) –
distributed (Optional[bool]) –
execution_engine (Optional[Any]) –
execution_engine_conf (Optional[Any]) –
- Return type
- suggest_for_noniterative_objective(objective, space, df=None, df_name='__tune__df_', temp_path='', partition_keys=None, top_n=1, local_optimizer=None, logger=None, monitor=None, stopper=None, stop_check_interval=None, distributed=None, shuffle_candidates=True, execution_engine=None, execution_engine_conf=None)[source]
Given non-iterative
objective
,space
and (optional) dataframe, suggest the best parameter combinations.Important
Please read Non-Iterative Tuning Guide
- Parameters
objective (Any) – a simple python function or
NonIterativeObjectiveFunc
compatible object, please read Non-Iterative Objective Explainedspace (tune.concepts.space.spaces.Space) – search space, please read Space Tutorial
df (Optional[Any]) – Pandas, Spark, Dask or any dataframe that can be converted to Fugue
DataFrame
, defaults to Nonedf_name (str) – dataframe name, defaults to the value of
TUNE_DATASET_DF_DEFAULT_NAME
temp_path (str) – temp path for serialized dataframe partitions. It can be empty if you preset using
TUNE_OBJECT_FACTORY.
set_temp_path()
. For details, read TuneDataset Tutorial, defaults to “”partition_keys (Optional[List[str]]) – partition keys for
df
, defaults to None. For details, please read TuneDataset Tutorialtop_n (int) – number of best results to return, defaults to 1. If <=0 all results will be returned
local_optimizer (Optional[Any]) – an object that can be converted to
NonIterativeObjectiveLocalOptimizer
, please read Non-Iterative Optimizers, defaults to Nonelogger (Optional[Any]) – |LoggerLikeObject|, defaults to None
monitor (Optional[Any]) – realtime monitor, defaults to None. Read Monitoring Guide
stopper (Optional[Any]) – early stopper, defaults to None. Read Early Stopping Guide
stop_check_interval (Optional[Any]) – an object that can be converted to timedelta, defaults to None. For details, read
to_timedelta()
distributed (Optional[bool]) – whether to use the exeuction engine to run different trials distributedly, defaults to None. If None, it’s equal to True.
shuffle_candidates (bool) – whether to shuffle the candidate configurations, defaults to True. This is no effect on final result.
execution_engine (Optional[Any]) – Fugue
ExecutionEngine
like object, defaults to None. If None,NativeExecutionEngine
will be used, the task will be running on local machine.execution_engine_conf (Optional[Any]) – Parameters like object, defaults to None
- Returns
a list of best results
- Return type
tune.concepts
tune.concepts.flow
tune.concepts.flow.judge
- class Monitor[source]
Bases:
object
- on_get_budget(trial, rung, budget)[source]
- Parameters
trial (tune.concepts.flow.trial.Trial) –
rung (int) –
budget (float) –
- Return type
None
- on_judge(decision)[source]
- Parameters
decision (tune.concepts.flow.judge.TrialDecision) –
- Return type
None
- on_report(report)[source]
- Parameters
report (tune.concepts.flow.report.TrialReport) –
- Return type
None
- class NoOpTrailJudge(monitor=None)[source]
Bases:
tune.concepts.flow.judge.TrialJudge
- Parameters
monitor (Optional[Monitor]) –
- can_accept(trial)[source]
- Parameters
trial (tune.concepts.flow.trial.Trial) –
- Return type
bool
- get_budget(trial, rung)[source]
- Parameters
trial (tune.concepts.flow.trial.Trial) –
rung (int) –
- Return type
float
- judge(report)[source]
- Parameters
report (tune.concepts.flow.report.TrialReport) –
- Return type
- class RemoteTrialJudge(entrypoint)[source]
Bases:
tune.concepts.flow.judge.TrialJudge
- Parameters
entrypoint (Callable[[str, Dict[str, Any]], Any]) –
- can_accept(trial)[source]
- Parameters
trial (tune.concepts.flow.trial.Trial) –
- Return type
bool
- get_budget(trial, rung)[source]
- Parameters
trial (tune.concepts.flow.trial.Trial) –
rung (int) –
- Return type
float
- judge(report)[source]
- Parameters
report (tune.concepts.flow.report.TrialReport) –
- Return type
- property report: Optional[tune.concepts.flow.report.TrialReport]
- class TrialCallback(judge)[source]
Bases:
object
- Parameters
judge (tune.concepts.flow.judge.TrialJudge) –
- class TrialDecision(report, budget, should_checkpoint, reason='', metadata=None)[source]
Bases:
object
- Parameters
report (tune.concepts.flow.report.TrialReport) –
budget (float) –
should_checkpoint (bool) –
reason (str) –
metadata (Optional[Dict[str, Any]]) –
- property budget: float
- property metadata: Dict[str, Any]
- property reason: str
- property report: tune.concepts.flow.report.TrialReport
- property should_checkpoint: bool
- property should_stop: bool
- property trial: tune.concepts.flow.trial.Trial
- property trial_id: str
- class TrialJudge(monitor=None)[source]
Bases:
object
- Parameters
monitor (Optional[Monitor]) –
- can_accept(trial)[source]
- Parameters
trial (tune.concepts.flow.trial.Trial) –
- Return type
bool
- get_budget(trial, rung)[source]
- Parameters
trial (tune.concepts.flow.trial.Trial) –
rung (int) –
- Return type
float
- judge(report)[source]
- Parameters
report (tune.concepts.flow.report.TrialReport) –
- Return type
- property monitor: tune.concepts.flow.judge.Monitor
- reset_monitor(monitor=None)[source]
- Parameters
monitor (Optional[tune.concepts.flow.judge.Monitor]) –
- Return type
None
tune.concepts.flow.report
- class TrialReport(trial, metric, params=None, metadata=None, cost=1.0, rung=0, sort_metric=None, log_time=None)[source]
Bases:
object
The result from running the objective. It is immutable.
- Parameters
trial (tune.concepts.flow.trial.Trial) – the original trial sent to the objective
metric (Any) – the raw metric from the objective output
params (Any) – updated parameters based on the trial input, defaults to None. If none, it means the params from the trial was not updated, otherwise it is an object convertible to
TuningParametersTemplate
byto_template()
metadata (Optional[Dict[str, Any]]) – metadata from the objective output, defaults to None
cost (float) – cost to run the objective, defaults to 1.0
rung (int) – number of rungs in the current objective, defaults to 0. This is for iterative problems
sort_metric (Any) – the metric for comparison, defaults to None. It must be smaller better. If not set, it implies the
metric
issort_metric
and it is smaller betterlog_time (Any) – the time generating this report, defaults to None. If None, current time will be used
Attention
This class is not for users to construct directly.
- copy()[source]
Copy the current object.
- Returns
the copied object
- Return type
Note
This is shallow copy, but it is also used by __deepcopy__ of this object. This is because we disable deepcopy of TrialReport.
- property cost: float
The cost to run the objective
- fill_dict(data)[source]
Fill a row of
StudyResult
with the report information- Parameters
data (Dict[str, Any]) – a row (as dict) from
StudyResult
- Returns
the updated
data
- Return type
Dict[str, Any]
- generate_sort_metric(min_better, digits)[source]
Construct a new report object with the new derived``sort_metric``
- Parameters
min_better (bool) – whether the current
metric()
is smaller betterdigits (int) – number of digits to keep in
sort_metric
- Returns
a new object with the updated value
- Return type
- property log_time: datetime.datetime
The time generating this report
- property metadata: Dict[str, Any]
The metadata from the objective output
- property metric: float
The raw metric from the objective output
- property params: tune.concepts.space.parameters.TuningParametersTemplate
The parameters used by the objective to generate the
metric()
- reset_log_time()[source]
Reset
log_time()
to now- Return type
- property rung: int
The number of rungs in the current objective, defaults to 0. This is for iterative problems
- property sort_metric: float
The metric for comparison
- property trial: tune.concepts.flow.trial.Trial
The original trial sent to the objective
- property trial_id: str
- with_cost(cost)[source]
Construct a new report object with the new
cost
- Parameters
cost (float) – new cost
- Returns
a new object with the updated value
- Return type
- with_rung(rung)[source]
Construct a new report object with the new
rung
- Parameters
rung (int) – new rung
- Returns
a new object with the updated value
- Return type
- class TrialReportHeap(min_heap)[source]
Bases:
object
- Parameters
min_heap (bool) –
- push(report)[source]
- Parameters
report (tune.concepts.flow.report.TrialReport) –
- Return type
None
- values()[source]
- Return type
Iterable[tune.concepts.flow.report.TrialReport]
- class TrialReportLogger(new_best_only=False)[source]
Bases:
object
- Parameters
new_best_only (bool) –
- property best: Optional[tune.concepts.flow.report.TrialReport]
- log(report)[source]
- Parameters
report (tune.concepts.flow.report.TrialReport) –
- Return type
None
- on_report(report)[source]
- Parameters
report (tune.concepts.flow.report.TrialReport) –
- Return type
bool
tune.concepts.flow.trial
- class Trial(trial_id, params, metadata=None, keys=None, dfs=None)[source]
Bases:
object
The input data collection for running an objective. It is immutable.
- Parameters
trial_id (str) – the unique id for a trial
params (Any) – parameters for tuning, an object convertible to
TuningParametersTemplate
byto_template()
metadata (Optional[Dict[str, Any]]) – metadata for tuning, defaults to None. It is set during the construction of
TuneDataset
keys (Optional[List[str]]) – partitions keys of the
TuneDataset
, defaults to Nonedfs (Optional[Dict[str, Any]]) – dataframes extracted from
TuneDataset
, defaults to None
Attention
This class is not for users to construct directly. Use
Space
instead.- copy()[source]
Copy the current object.
- Returns
the copied object
- Return type
Note
This is shallow copy, but it is also used by __deepcopy__ of this object. This is because we disable deepcopy of Trial.
- property dfs: Dict[str, Any]
Dataframes extracted from
TuneDataset
- property keys: List[str]
Partitions keys of the
TuneDataset
- property metadata: Dict[str, Any]
Metadata of the trial
- property params: tune.concepts.space.parameters.TuningParametersTemplate
Parameters for tuning
- property trial_id: str
The unique id of this trial
- with_dfs(dfs)[source]
Set dataframes for the trial, a new Trial object will be constructed and with the new
dfs
- Parameters
dfs (Dict[str, Any]) – dataframes to attach to the trial
- Return type
tune.concepts.space
tune.concepts.space.parameters
- class Choice(*args)[source]
Bases:
tune.concepts.space.parameters.StochasticExpression
A random choice of values. Please read Space Tutorial.
- Parameters
args (Any) – values to choose from
- generate(seed=None)[source]
Return a randomly chosen value.
- Parameters
seed (Optional[Any]) – if set, it will be used to call
seed()
, defaults to None- Return type
Any
- property jsondict: Dict[str, Any]
Dict representation of the expression that is json serializable
- property values: List[Any]
values to choose from
- class FuncParam(func, *args, **kwargs)[source]
Bases:
object
Function paramter. It defers the function call after all its parameters are no longer tuning parameters
- Parameters
func (Callable) – function to generate parameter value
args (Any) – list arguments
kwargs (Any) – key-value arguments
s = Space(a=1, b=FuncParam(lambda x, y: x + y, x=Grid(0, 1), y=Grid(3, 4))) assert [ dict(a=1, b=3), dict(a=1, b=4), dict(a=1, b=4), dict(a=1, b=5), ] == list(s)
- class Grid(*args)[source]
Bases:
tune.concepts.space.parameters.TuningParameterExpression
Grid search, every value will be used. Please read Space Tutorial.
- Parameters
args (Any) – values for the grid search
- class NormalRand(mu, sigma, q=None)[source]
Bases:
tune.concepts.space.parameters.RandBase
Continuous normally distributed random variables. Please read Space Tutorial.
- Parameters
mu (float) – mean of the normal distribution
sigma (float) – standard deviation of the normal distribution
q (Optional[float]) – step between adjacent values, if set, the value will be rounded using
q
, defaults to None
- generate(seed=None)[source]
Return a randomly chosen value.
- Parameters
seed (Optional[Any]) – if set, it will be used to call
seed()
, defaults to None- Return type
float
- property jsondict: Dict[str, Any]
Dict representation of the expression that is json serializable
- class NormalRandInt(mu, sigma, q=1)[source]
Bases:
tune.concepts.space.parameters.RandBase
Normally distributed random integer values. Please read Space Tutorial.
- Parameters
mu (int) – mean of the normal distribution
sigma (float) – standard deviation of the normal distribution
q (int) –
- generate(seed=None)[source]
Return a randomly chosen value.
- Parameters
seed (Optional[Any]) – if set, it will be used to call
seed()
, defaults to None- Return type
int
- property jsondict: Dict[str, Any]
Dict representation of the expression that is json serializable
- class Rand(low, high, q=None, log=False, include_high=True)[source]
Bases:
tune.concepts.space.parameters.RandBase
Continuous uniform random variables. Please read Space Tutorial.
- Parameters
low (float) – range low bound (inclusive)
high (float) – range high bound (exclusive)
q (Optional[float]) – step between adjacent values, if set, the value will be rounded using
q
, defaults to Nonelog (bool) – whether to do uniform sampling in log space, defaults to False. If True,
low
must be positive and lower values get higher chance to be sampledinclude_high (bool) –
- generate(seed=None)[source]
Return a randomly chosen value.
- Parameters
seed (Optional[Any]) – if set, it will be used to call
seed()
, defaults to None- Return type
float
- property jsondict: Dict[str, Any]
Dict representation of the expression that is json serializable
- class RandBase(q=None, log=False)[source]
Bases:
tune.concepts.space.parameters.StochasticExpression
Base class for continuous random variables. Please read Space Tutorial.
- Parameters
q (Optional[float]) – step between adjacent values, if set, the value will be rounded using
q
, defaults to Nonelog (bool) – whether to do uniform sampling in log space, defaults to False. If True, lower values get higher chance to be sampled
- class RandInt(low, high, q=1, log=False, include_high=True)[source]
Bases:
tune.concepts.space.parameters.RandBase
Uniform distributed random integer values. Please read Space Tutorial.
- Parameters
low (int) – range low bound (inclusive)
high (int) – range high bound (exclusive)
log (bool) – whether to do uniform sampling in log space, defaults to False. If True,
low
must be>=1
and lower values get higher chance to be sampledq (int) –
include_high (bool) –
- generate(seed=None)[source]
Return a randomly chosen value.
- Parameters
seed (Optional[Any]) – if set, it will be used to call
seed()
, defaults to None- Return type
float
- property jsondict: Dict[str, Any]
Dict representation of the expression that is json serializable
- class StochasticExpression[source]
Bases:
tune.concepts.space.parameters.TuningParameterExpression
Stochastic search base class. Please read Space Tutorial.
- generate(seed=None)[source]
Return a randomly chosen value.
- Parameters
seed (Optional[Any]) – if set, it will be used to call
seed()
, defaults to None- Return type
Any
- generate_many(n, seed=None)[source]
Generate
n
randomly chosen values- Parameters
n (int) – number of random values to generate
seed (Optional[Any]) – random seed, defaults to None
- Returns
a list of values
- Return type
List[Any]
- property jsondict: Dict[str, Any]
Dict representation of the expression that is json serializable
- class TransitionChoice(*args)[source]
Bases:
tune.concepts.space.parameters.Choice
An ordered random choice of values. Please read Space Tutorial.
- Parameters
args (Any) – values to choose from
- property jsondict: Dict[str, Any]
Dict representation of the expression that is json serializable
- class TuningParameterExpression[source]
Bases:
object
Base class of all tuning parameter expressions
- class TuningParametersTemplate(raw)[source]
Bases:
object
Parameter template to extract tuning parameter expressions from nested data structure
- Parameters
raw (Dict[str, Any]) – the dictionary of input parameters.
Note
Please use
to_template()
to initialize this class.# common cases to_template(dict(a=1, b=1)) to_template(dict(a=Rand(0, 1), b=1)) # expressions may nest in dicts or arrays template = to_template( dict(a=dict(x1=Rand(0, 1), x2=Rand(3,4)), b=[Grid("a", "b")])) assert [Rand(0, 1), Rand(3, 4), Grid("a", "b")] == template.params assert dict( p0=Rand(0, 1), p1=Rand(3, 4), p2=Grid("a", "b") ) == template.params_dict assert dict(a=1, x2=3), b=["a"]) == template.fill([1, 3, "a"]) assert dict(a=1, x2=3), b=["a"]) == template.fill_dict( dict(p2="a", p1=3, p0=1) )
- concat(other)[source]
Concatenate with another template and generate a new template.
Note
The other template must not have any key existed in this template, otherwise
ValueError
will be raised- Returns
the merged template
- Parameters
other (tune.concepts.space.parameters.TuningParametersTemplate) –
- Return type
- static decode(data)[source]
Retrieve the template from a base64 string
- Parameters
data (str) –
- Return type
- property empty: bool
Whether the template contains any tuning expression
- fill(params)[source]
Fill the original data structure with values
- Parameters
params (List[Any]) – the list of values to be filled into the original data structure, in depth-first order
copy – whether to return a deeply copied paramters, defaults to False
- Returns
the original data structure filled with values
- Return type
Dict[str, Any]
- fill_dict(params)[source]
Fill the original data structure with dictionary of values
- Parameters
params (Dict[str, Any]) – the dictionary of values to be filled into the original data structure, keys must be p0, p1, p2, …
copy – whether to return a deeply copied paramters, defaults to False
- Returns
the original data structure filled with values
- Return type
Dict[str, Any]
- property has_grid: bool
Whether the template contains grid expressions
- property has_stochastic: bool
Whether the template contains stochastic expressions
- property params: List[tune.concepts.space.parameters.TuningParameterExpression]
Get all tuning parameter expressions in depth-first order
- property params_dict: Dict[str, tune.concepts.space.parameters.TuningParameterExpression]
Get all tuning parameter expressions in depth-first order, with correspondent made-up new keys p0, p1, p2, …
- product_grid()[source]
cross product all grid parameters
- Yield
new templates with the grid paramters filled
- Return type
Iterable[tune.concepts.space.parameters.TuningParametersTemplate]
assert [dict(a=1,b=Rand(0,1)), dict(a=2,b=Rand(0,1))] == list(to_template(dict(a=Grid(1,2),b=Rand(0,1))).product_grid())
- sample(n, seed=None)[source]
sample all stochastic parameters
- Parameters
n (int) – number of samples, must be a positive integer
seed (Optional[Any]) – random seed defaulting to None. It will take effect if it is not None.
- Yield
new templates with the grid paramters filled
- Return type
Iterable[tune.concepts.space.parameters.TuningParametersTemplate]
assert [dict(a=1.1,b=Grid(0,1)), dict(a=1.5,b=Grid(0,1))] == list(to_template(dict(a=Rand(1,2),b=Grid(0,1))).sample(2,0))
- property simple_value: Dict[str, Any]
If the template contains no tuning expression, it’s simple and it will return parameters dictionary, otherwise,
ValueError
will be raised
- property template: Dict[str, Any]
The template dictionary, all tuning expressions will be replaced by
None
tune.concepts.space.spaces
- class Space(*args, **kwargs)[source]
Bases:
object
Search space object
Important
Please read Space Tutorial.
- Parameters
kwargs (Any) – parameters in the search space
Space(a=1, b=1) # static space Space(a=1, b=Grid(1,2), c=Grid("a", "b")) # grid search Space(a=1, b=Grid(1,2), c=Rand(0, 1)) # grid search + level 2 search Space(a=1, b=Grid(1,2), c=Rand(0, 1)).sample(10, sedd=0) # grid + random search # union Space(a=1, b=Grid(2,3)) + Space(b=Rand(1,5)).sample(10) # cross product Space(a=1, b=Grid(2,3)) * Space(c=Rand(1,5), d=Grid("a","b")) # combo (grid + random + level 2) space1 = Space(a=1, b=Grid(2,4)) space2 = Space(b=RandInt(10, 20)) space3 = Space(c=Rand(0,1)).sample(10) space = (space1 + space2) * space3
assert Space(a=1, b=Rand(0,1)).has_stochastic assert not Space(a=1, b=Rand(0,1)).sample(10).has_stochastic assert not Space(a=1, b=Grid(0,1)).has_stochastic assert not Space(a=1, b=1).has_stochastic # get all configurations space = Space(a=Grid(2,4), b=Rand(0,1)).sample(100) for conf in space: print(conf) all_conf = list(space)
- property has_stochastic
Whether the space contains any
StochasticExpression
- sample(n, seed=None)[source]
Draw random samples from the current space. Please read Space Tutorial.
- Parameters
n (int) – number of samples to draw
seed (Optional[Any]) – random seed, defaults to None
- Returns
a new Space containing all samples
- Return type
Note
it only applies to
StochasticExpression
if
has_stochastic()
is False, then it will return the original spaceAfter sampling, no
StochasticExpression
will exist in the new space.
tune.concepts.checkpoint
- class Checkpoint(fs)[source]
Bases:
object
An abstraction for tuning checkpoint
- Parameters
fs (fs.base.FS) – the file system
Attention
Normally you don’t need to create a checkpoint by yourself, please read Checkpoint Tutorial if you want to understand how it works.
- property latest: fs.base.FS
latest checkpoint folder
- Raises
AssertionError – if there was no checkpoint
- class NewCheckpoint(checkpoint)[source]
Bases:
object
A helper class for adding new checkpoints
- Parameters
checkpoint (tune.concepts.checkpoint.Checkpoint) – the parent checkpoint
Attention
Do not construct this class directly, please read Checkpoint Tutorial for details
tune.concepts.dataset
- class StudyResult(dataset, result)[source]
Bases:
object
A collection of the input
TuneDataset
and the tuning result- Parameters
dataset (tune.concepts.dataset.TuneDataset) – input dataset for tuning
result (fugue.workflow.workflow.WorkflowDataFrame) – tuning result as a dataframe
Attention
Do not construct this class directly.
- next_tune_dataset(best_n=0)[source]
Convert the result back to a new
TuneDataset
to be used by the next steps.- Parameters
best_n (int) – top n result to extract, defaults to 0 (entire result)
- Returns
a new dataset for tuning
- Return type
- result(best_n=0)[source]
Get the top n results sorted by
tune.concepts.flow.report.TrialReport.sort_metric()
- Parameters
best_n (int) – number of result to get, defaults to 0. if <=0 then it will return the entire result
- Returns
result subset
- Return type
- union_with(other)[source]
Union with another result set and update itself
- Parameters
other (tune.concepts.dataset.StudyResult) – the other result dataset
- Return type
None
Note
This method also removes duplicated reports based on
tune.concepts.flow.trial.Trial.trial_id()
. Each trial will have only the best report in the updated result
- class TuneDataset(data, dfs, keys)[source]
Bases:
object
A Fugue
WorkflowDataFrame
with metadata representing all dataframes required for a tuning task.- Parameters
data (fugue.workflow.workflow.WorkflowDataFrame) – the Fugue
WorkflowDataFrame
containing all required dataframesdfs (List[str]) – the names of the dataframes
keys (List[str]) – the common partition keys of all dataframes
Attention
Do not construct this class directly, please read TuneDataset Tutorial to find the right way
- property data: fugue.workflow.workflow.WorkflowDataFrame
the Fugue
WorkflowDataFrame
containing all required dataframes
- property dfs: List[str]
All dataframe names (you can also find them part of the column names of
data()
)
- split(weights, seed)[source]
Split the dataset randomly to small partitions. This is useful for some algorithms such as Hyperband, because it needs different subset to run successive halvings with different parameters.
- Parameters
weights (List[float]) – a list of numeric values. The length represents the number of splitd partitions, and the values represents the proportion of each partition
seed (Any) – random seed for the split
- Returns
a list of sub-datasets
- Return type
# randomly split the data to two partitions 25% and 75% dataset.split([1, 3], seed=0) # same because weights will be normalized dataset.split([10, 30], seed=0)
- class TuneDatasetBuilder(space, path='')[source]
Bases:
object
Builder of
TuneDataset
, for details please read TuneDataset Tutorial- Parameters
space (tune.concepts.space.spaces.Space) – searching space, see Space Tutorial
path (str) – temp path to store searialized dataframe partitions , defaults to “”
- add_df(name, df, how='')[source]
Add a dataframe to the dataset
- Parameters
name (str) – name of the dataframe, it will also create a
__tune_df__<name>
column in the dataset dataframedf (fugue.workflow.workflow.WorkflowDataFrame) – the dataframe to add.
how (str) – join type, can accept
semi
,left_semi
,anti
,left_anti
,inner
,left_outer
,right_outer
,full_outer
,cross
- Returns
the builder itself
- Return type
Note
For the first dataframe you add,
how
should be empty. From the second dataframe you add,how
must be set.Note
If
df
is prepartitioned, the partition key will be used to join with the added dataframes. Read TuneDataset Tutorial for more details
- add_dfs(dfs, how='')[source]
Add multiple dataframes with the same join type
- Parameters
dfs (fugue.workflow.workflow.WorkflowDataFrames) – dictionary like dataframe collection. The keys will be used as the dataframe names
how (str) – join type, can accept
semi
,left_semi
,anti
,left_anti
,inner
,left_outer
,right_outer
,full_outer
,cross
- Returns
the builder itself
- Return type
- build(wf, batch_size=1, shuffle=True, trial_metadata=None)[source]
Build
TuneDataset
, for details please read TuneDataset Tutorial- Parameters
wf (fugue.workflow.workflow.FugueWorkflow) – the workflow associated with the dataset
batch_size (int) – how many configurations as a batch, defaults to 1
shuffle (bool) – whether to shuffle the entire dataset, defaults to True. This is to make the tuning process more even, it will look better. It should have slight benefit on speed, no effect on result.
trial_metadata (Optional[Dict[str, Any]]) – metadata to pass to each
Trial
, defaults to None
- Returns
the dataset for tuning
- Return type
tune.iterative
tune.iterative.asha
- class ASHAJudge(schedule, always_checkpoint=False, study_early_stop=None, trial_early_stop=None, monitor=None)[source]
Bases:
tune.concepts.flow.judge.TrialJudge
- Parameters
schedule (List[Tuple[float, int]]) –
always_checkpoint (bool) –
study_early_stop (Optional[Callable[[List[Any], List[tune.iterative.asha.RungHeap]], bool]]) –
trial_early_stop (Optional[Callable[[tune.concepts.flow.report.TrialReport, List[tune.concepts.flow.report.TrialReport], List[tune.iterative.asha.RungHeap]], bool]]) –
monitor (Optional[tune.concepts.flow.judge.Monitor]) –
- property always_checkpoint: bool
- can_accept(trial)[source]
- Parameters
trial (tune.concepts.flow.trial.Trial) –
- Return type
bool
- get_budget(trial, rung)[source]
- Parameters
trial (tune.concepts.flow.trial.Trial) –
rung (int) –
- Return type
float
- judge(report)[source]
- Parameters
report (tune.concepts.flow.report.TrialReport) –
- Return type
- property schedule: List[Tuple[float, int]]
- class RungHeap(n)[source]
Bases:
object
- Parameters
n (int) –
- property best: float
- property bests: List[float]
- property capacity: int
- property full: bool
- push(report)[source]
- Parameters
report (tune.concepts.flow.report.TrialReport) –
- Return type
bool
- values()[source]
- Return type
Iterable[tune.concepts.flow.report.TrialReport]
tune.iterative.objective
- class IterativeObjectiveFunc[source]
Bases:
object
- property current_trial: tune.concepts.flow.trial.Trial
- load_checkpoint(fs)[source]
- Parameters
fs (fs.base.FS) –
- Return type
None
- run(trial, judge, checkpoint_basedir_fs)[source]
- Parameters
trial (tune.concepts.flow.trial.Trial) –
judge (tune.concepts.flow.judge.TrialJudge) –
checkpoint_basedir_fs (fs.base.FS) –
- Return type
None
- property rung: int
- save_checkpoint(fs)[source]
- Parameters
fs (fs.base.FS) –
- Return type
None
- validate_iterative_objective(func, trial, budgets, validator, continuous=False, checkpoint_path='', monitor=None)[source]
- Parameters
trial (tune.concepts.flow.trial.Trial) –
budgets (List[float]) –
validator (Callable[[List[tune.concepts.flow.report.TrialReport]], None]) –
continuous (bool) –
checkpoint_path (str) –
monitor (Optional[tune.concepts.flow.judge.Monitor]) –
- Return type
None
tune.iterative.sha
tune.iterative.study
- class IterativeStudy(objective, checkpoint_path)[source]
Bases:
object
- Parameters
objective (tune.iterative.objective.IterativeObjectiveFunc) –
checkpoint_path (str) –
- optimize(dataset, judge)[source]
- Parameters
dataset (tune.concepts.dataset.TuneDataset) –
judge (tune.concepts.flow.judge.TrialJudge) –
- Return type
tune.noniterative
tune.noniterative.convert
- noniterative_objective(func=None, min_better=True)[source]
- Parameters
func (Optional[Callable]) –
min_better (bool) –
- Return type
Callable[[Any], tune.noniterative.objective.NonIterativeObjectiveFunc]
tune.noniterative.objective
- class NonIterativeObjectiveFunc[source]
Bases:
object
- run(trial)[source]
- Parameters
trial (tune.concepts.flow.trial.Trial) –
- Return type
- safe_run(trial)[source]
- Parameters
trial (tune.concepts.flow.trial.Trial) –
- Return type
- class NonIterativeObjectiveLocalOptimizer[source]
Bases:
object
- property distributable: bool
- run(func, trial, logger)[source]
- Parameters
func (tune.noniterative.objective.NonIterativeObjectiveFunc) –
trial (tune.concepts.flow.trial.Trial) –
logger (Any) –
- Return type
- run_monitored_process(func, trial, stop_checker, logger, interval='60sec')[source]
- Parameters
func (tune.noniterative.objective.NonIterativeObjectiveFunc) –
trial (tune.concepts.flow.trial.Trial) –
stop_checker (Callable[[], bool]) –
logger (Any) –
interval (Any) –
- Return type
- validate_noniterative_objective(func, trial, validator, optimizer=None, logger=None)[source]
- Parameters
func (tune.noniterative.objective.NonIterativeObjectiveFunc) –
trial (tune.concepts.flow.trial.Trial) –
validator (Callable[[tune.concepts.flow.report.TrialReport], None]) –
optimizer (Optional[tune.noniterative.objective.NonIterativeObjectiveLocalOptimizer]) –
logger (Optional[Any]) –
- Return type
None
tune.noniterative.stopper
- class NonIterativeStopper(log_best_only=False)[source]
Bases:
tune.concepts.flow.judge.TrialJudge
- Parameters
log_best_only (bool) –
- can_accept(trial)[source]
- Parameters
trial (tune.concepts.flow.trial.Trial) –
- Return type
bool
- get_reports(trial)[source]
- Parameters
trial (tune.concepts.flow.trial.Trial) –
- Return type
- judge(report)[source]
- Parameters
report (tune.concepts.flow.report.TrialReport) –
- Return type
- on_report(report)[source]
- Parameters
report (tune.concepts.flow.report.TrialReport) –
- Return type
bool
- should_stop(trial)[source]
- Parameters
trial (tune.concepts.flow.trial.Trial) –
- Return type
bool
- property updated: bool
- class NonIterativeStopperCombiner(left, right, is_and)[source]
Bases:
tune.noniterative.stopper.NonIterativeStopper
- Parameters
is_and (bool) –
- get_reports(trial)[source]
- Parameters
trial (tune.concepts.flow.trial.Trial) –
- Return type
- on_report(report)[source]
- Parameters
report (tune.concepts.flow.report.TrialReport) –
- Return type
bool
- should_stop(trial)[source]
- Parameters
trial (tune.concepts.flow.trial.Trial) –
- Return type
bool
- class SimpleNonIterativeStopper(partition_should_stop, log_best_only=False)[source]
Bases:
tune.noniterative.stopper.NonIterativeStopper
- Parameters
partition_should_stop (Callable[[tune.concepts.flow.report.TrialReport, bool, List[tune.concepts.flow.report.TrialReport]], bool]) –
log_best_only (bool) –
- on_report(report)[source]
- Parameters
report (tune.concepts.flow.report.TrialReport) –
- Return type
bool
- should_stop(trial)[source]
- Parameters
trial (tune.concepts.flow.trial.Trial) –
- Return type
bool
- class TrialReportCollection(new_best_only=False)[source]
Bases:
tune.concepts.flow.report.TrialReportLogger
- Parameters
new_best_only (bool) –
- log(report)[source]
- Parameters
report (tune.concepts.flow.report.TrialReport) –
- Return type
None
- property reports: List[tune.concepts.flow.report.TrialReport]
tune.noniterative.study
- class NonIterativeStudy(objective, optimizer)[source]
Bases:
object
- Parameters
objective (tune.noniterative.objective.NonIterativeObjectiveFunc) –
optimizer (tune.noniterative.objective.NonIterativeObjectiveLocalOptimizer) –
- optimize(dataset, distributed=None, monitor=None, stopper=None, stop_check_interval=None, logger=None)[source]
- Parameters
dataset (tune.concepts.dataset.TuneDataset) –
distributed (Optional[bool]) –
monitor (Optional[tune.concepts.flow.judge.Monitor]) –
stopper (Optional[tune.noniterative.stopper.NonIterativeStopper]) –
stop_check_interval (Optional[Any]) –
logger (Optional[Any]) –
- Return type
tune.constants
tune.exceptions
tune_hyperopt
tune_hyperopt.optimizer
- class HyperoptLocalOptimizer(max_iter, seed=0, kwargs_func=None)[source]
Bases:
tune.noniterative.objective.NonIterativeObjectiveLocalOptimizer
- Parameters
max_iter (int) –
seed (int) –
kwargs_func (Optional[Callable[[tune.noniterative.objective.NonIterativeObjectiveFunc, tune.concepts.flow.trial.Trial], Dict[str, Any]]]) –
- run(func, trial, logger)[source]
- Parameters
func (tune.noniterative.objective.NonIterativeObjectiveFunc) –
trial (tune.concepts.flow.trial.Trial) –
logger (Any) –
- Return type
tune_optuna
tune_optuna.optimizer
- class OptunaLocalOptimizer(max_iter, create_study=None)[source]
Bases:
tune.noniterative.objective.NonIterativeObjectiveLocalOptimizer
- Parameters
max_iter (int) –
create_study (Optional[Callable[[], optuna.study.study.Study]]) –
- run(func, trial, logger)[source]
- Parameters
func (tune.noniterative.objective.NonIterativeObjectiveFunc) –
trial (tune.concepts.flow.trial.Trial) –
logger (Any) –
- Return type
tune_sklearn
tune_sklearn.objective
- class SKCVObjective(scoring, cv=5, feature_prefix='', label_col='label', checkpoint_path=None)[source]
Bases:
tune_sklearn.objective.SKObjective
- Parameters
scoring (Any) –
cv (int) –
feature_prefix (str) –
label_col (str) –
checkpoint_path (Optional[str]) –
- Return type
None
- run(trial)[source]
- Parameters
trial (tune.concepts.flow.trial.Trial) –
- Return type
- class SKObjective(scoring, feature_prefix='', label_col='label', checkpoint_path=None)[source]
Bases:
tune.noniterative.objective.NonIterativeObjectiveFunc
- Parameters
scoring (Any) –
feature_prefix (str) –
label_col (str) –
checkpoint_path (Optional[str]) –
- Return type
None
- run(trial)[source]
- Parameters
trial (tune.concepts.flow.trial.Trial) –
- Return type
tune_sklearn.suggest
- suggest_sk_models(space, train_df, test_df, scoring, temp_path='', feature_prefix='', label_col='label', save_model=False, partition_keys=None, top_n=1, local_optimizer=None, monitor=None, stopper=None, stop_check_interval=None, distributed=None, execution_engine=None, execution_engine_conf=None)[source]
- Parameters
space (tune.concepts.space.spaces.Space) –
train_df (Any) –
test_df (Any) –
scoring (str) –
temp_path (str) –
feature_prefix (str) –
label_col (str) –
save_model (bool) –
partition_keys (Optional[List[str]]) –
top_n (int) –
local_optimizer (Optional[tune.noniterative.objective.NonIterativeObjectiveLocalOptimizer]) –
monitor (Optional[Any]) –
stopper (Optional[Any]) –
stop_check_interval (Optional[Any]) –
distributed (Optional[bool]) –
execution_engine (Optional[Any]) –
execution_engine_conf (Optional[Any]) –
- Return type
- suggest_sk_models_by_cv(space, train_df, scoring, cv=5, temp_path='', feature_prefix='', label_col='label', save_model=False, partition_keys=None, top_n=1, local_optimizer=None, monitor=None, stopper=None, stop_check_interval=None, distributed=None, execution_engine=None, execution_engine_conf=None)[source]
- Parameters
space (tune.concepts.space.spaces.Space) –
train_df (Any) –
scoring (str) –
cv (int) –
temp_path (str) –
feature_prefix (str) –
label_col (str) –
save_model (bool) –
partition_keys (Optional[List[str]]) –
top_n (int) –
local_optimizer (Optional[tune.noniterative.objective.NonIterativeObjectiveLocalOptimizer]) –
monitor (Optional[Any]) –
stopper (Optional[Any]) –
stop_check_interval (Optional[Any]) –
distributed (Optional[bool]) –
execution_engine (Optional[Any]) –
execution_engine_conf (Optional[Any]) –
- Return type
tune_sklearn.utils
tune_tensorflow
tune_tensorflow.objective
- class KerasObjective(type_dict)[source]
Bases:
tune.iterative.objective.IterativeObjectiveFunc
- Parameters
type_dict (Dict[str, Type[tune_tensorflow.spec.KerasTrainingSpec]]) –
- Return type
None
- load_checkpoint(fs)[source]
- Parameters
fs (fs.base.FS) –
- Return type
None
- property model: keras.engine.training.Model
- save_checkpoint(fs)[source]
- Parameters
fs (fs.base.FS) –
- Return type
None
- property spec: tune_tensorflow.spec.KerasTrainingSpec
tune_tensorflow.spec
- class KerasTrainingSpec(params, dfs)[source]
Bases:
object
- Parameters
params (Any) –
dfs (Dict[str, Any]) –
- compile_model(**add_kwargs)[source]
- Parameters
add_kwargs (Any) –
- Return type
keras.engine.training.Model
- property dfs: Dict[str, Any]
- load_checkpoint(fs, model)[source]
- Parameters
fs (fs.base.FS) –
model (keras.engine.training.Model) –
- Return type
None
- property params: tune.concepts.space.parameters.TuningParametersTemplate
- save_checkpoint(fs, model)[source]
- Parameters
fs (fs.base.FS) –
model (keras.engine.training.Model) –
- Return type
None
tune_tensorflow.suggest
- suggest_keras_models_by_continuous_asha(space, plan, train_df=None, temp_path='', partition_keys=None, top_n=1, monitor=None, execution_engine=None, execution_engine_conf=None)[source]
- Parameters
space (tune.concepts.space.spaces.Space) –
plan (List[Tuple[float, int]]) –
train_df (Optional[Any]) –
temp_path (str) –
partition_keys (Optional[List[str]]) –
top_n (int) –
monitor (Optional[Any]) –
execution_engine (Optional[Any]) –
execution_engine_conf (Optional[Any]) –
- Return type
- suggest_keras_models_by_hyperband(space, plans, train_df=None, temp_path='', partition_keys=None, top_n=1, monitor=None, distributed=None, execution_engine=None, execution_engine_conf=None)[source]
- Parameters
space (tune.concepts.space.spaces.Space) –
plans (List[List[Tuple[float, int]]]) –
train_df (Optional[Any]) –
temp_path (str) –
partition_keys (Optional[List[str]]) –
top_n (int) –
monitor (Optional[Any]) –
distributed (Optional[bool]) –
execution_engine (Optional[Any]) –
execution_engine_conf (Optional[Any]) –
- Return type
- suggest_keras_models_by_sha(space, plan, train_df=None, temp_path='', partition_keys=None, top_n=1, monitor=None, distributed=None, execution_engine=None, execution_engine_conf=None)[source]
- Parameters
space (tune.concepts.space.spaces.Space) –
plan (List[Tuple[float, int]]) –
train_df (Optional[Any]) –
temp_path (str) –
partition_keys (Optional[List[str]]) –
top_n (int) –
monitor (Optional[Any]) –
distributed (Optional[bool]) –
execution_engine (Optional[Any]) –
execution_engine_conf (Optional[Any]) –
- Return type
tune_tensorflow.utils
- extract_keras_spec(params, type_dict)[source]
- Parameters
params (tune.concepts.space.parameters.TuningParametersTemplate) –
type_dict (Dict[str, Any]) –
- Return type
tune_notebook
tune_notebook.monitors
- class NotebookSimpleChart(interval='1sec', best_only=True, always_update=False)[source]
Bases:
tune.concepts.flow.judge.Monitor
- Parameters
interval (Any) –
best_only (bool) –
always_update (bool) –
- on_report(report)[source]
- Parameters
report (tune.concepts.flow.report.TrialReport) –
- Return type
None
- class NotebookSimpleHist(interval='1sec')[source]
Bases:
tune_notebook.monitors.NotebookSimpleChart
- Parameters
interval (Any) –
- class NotebookSimpleRungs(interval='1sec')[source]
Bases:
tune_notebook.monitors.NotebookSimpleChart
- Parameters
interval (Any) –
- class NotebookSimpleTimeSeries(interval='1sec')[source]
Bases:
tune_notebook.monitors.NotebookSimpleChart
- Parameters
interval (Any) –
- class PrintBest[source]
Bases:
tune.concepts.flow.judge.Monitor
- on_report(report)[source]
- Parameters
report (tune.concepts.flow.report.TrialReport) –
- Return type
None
tune_test
tune_test.local_optmizer
- class NonIterativeObjectiveLocalOptimizerTests[source]
Bases:
object
DataFrame level general test suite. All new DataFrame types should pass this test suite.
- class Tests(methodName='runTest')[source]
Bases:
unittest.case.TestCase
- make_optimizer(**kwargs)[source]
- Parameters
kwargs (Any) –
- Return type
tune.noniterative.objective.NonIterativeObjectiveLocalOptimizer
Short Tutorials
Search Space
THIS IS THE MOST IMPORTANT CONCEPT OF TUNE, MUST READ
Tune defines its own searching space concept and different expressions. It inherits the Fugue philosophy: one expression for all frameworks. For the underlying optimizers (e.g. HyperOpt, Optuna), tune unifies the behaviors. For example Rand(1.0, 5.0, q=1.5)
will uniformly search on [1.0 , 2.5, 4.0]
no matter you use HyperOpt or Optuna as the underlying optimizer.
In Tune, spaces are predefined before search, it is opposite to Optuna where you get variables inside objectives during runtime. In this way, your space definition is totally separated from objective definition, and your objectives may be just simple python functions independent from Tune.
[1]:
from tune import Space, Grid, Rand, RandInt, Choice
import pandas as pd
Simple Cases
The simplest cases are spaces with only static variables. So the spaces will always generate single configuration.
[2]:
space = Space(a=1, b=1)
print(list(space))
[{'a': 1, 'b': 1}]
Grid Search
You can replace the static variables to Grid
expression. We will cross product all grid expressions in the space, so you see in the second example, it generates 6 configurations.
[3]:
print(list(Space(a=1, b=Grid("a","b"))))
print(list(Space(a=Grid(1,2), b=Grid("x","y","z"))))
[{'a': 1, 'b': 'a'}, {'a': 1, 'b': 'b'}]
[{'a': 1, 'b': 'x'}, {'a': 1, 'b': 'y'}, {'a': 1, 'b': 'z'}, {'a': 2, 'b': 'x'}, {'a': 2, 'b': 'y'}, {'a': 2, 'b': 'z'}]
Random Expressions
Random search requires .sample
method after you define the original space to specify how many random combinations you want to draw from the expression.
Choice
Choice
refers to discrete unordered set of values. So Choice(1, 2, 3)
is equivalent to Choide(2, 1, 3)
. When you do random sampling from Choice
, every value has equal chance. When you do advanced search such as Bayesian Optimization, it also assumes no relation between values.
[4]:
space = Space(a=1, b=Choice("aa", "bb", "cc")).sample(2, seed=1)
print(list(space))
[{'a': 1, 'b': 'bb'}, {'a': 1, 'b': 'aa'}]
Rand
Rand
is the most common expression for a variable. It refers to sampling from a range of value.
Rand(low, high)
uniformly search between [low, high)
[5]:
samples = Rand(10.1, 20.2).generate_many(10000, seed=0)
pd.DataFrame(samples).hist();

Rand(low, high, log=True)
search in the log space, but still in [low, high)
so the smaller values get higher chance to be selected.
For log space searching, low must be greater or equal to 1.
The algorithm: exp(uniform(log(low), log(high)))
[6]:
samples = Rand(10.1, 1000, log=True).generate_many(10000, seed=0)
pd.DataFrame(samples).hist();

Rand(low, high, q, include_high)
uniformly search between low
and high
with step q
. include_high
(default True
) indicates whether the high value can be a candidate.
[7]:
print(Rand(-1.0,4.0,q=2.5).generate_many(10, seed=0))
print(Rand(-1.0,4.0,q=2.5,include_high=False).generate_many(10, seed=0))
samples = Rand(1.0,2.0,q=0.3).generate_many(10000, seed=0)
pd.DataFrame(samples).hist();
[1.5, 4.0, 1.5, 1.5, 1.5, 1.5, 1.5, 4.0, 4.0, 1.5]
[1.5, 1.5, 1.5, 1.5, -1.0, 1.5, -1.0, 1.5, 1.5, -1.0]

Rand(low, high, q, include_high, log=True)
search between low
and high
with step q
in log space. include_high
(default True
) indicates whether the high value can be a candidate.
[8]:
samples = Rand(1.0,16.0,q=5, log=True).generate_many(10000, seed=0)
pd.DataFrame(samples).hist()
samples = Rand(1.0,16.0,q=5, log=True, include_high=False).generate_many(10000, seed=0)
pd.DataFrame(samples).hist();


RandInt
RandInt
can be considered as a special case of Rand
where the low
, high
and q
are all integers
RandInt(low, high, include_high)
[9]:
samples = RandInt(-2,2).generate_many(10000, seed=0)
pd.DataFrame(samples).hist()
samples = RandInt(-2,2,include_high=False).generate_many(10000, seed=0)
pd.DataFrame(samples).hist();


RandInt(low, high, include_high, q)
Search starting from low
with step q
to high
[10]:
samples = RandInt(-2,4,q=2).generate_many(10000, seed=0)
pd.DataFrame(samples).hist()
samples = RandInt(-2,4,include_high=False,q=2).generate_many(10000, seed=0)
pd.DataFrame(samples).hist();


RandInt(low, high, include_high, q, log)
Search starting from low
with step q
to high
. The difference is it’s in log space, so lower values get higher chance.
Also for log searching space, low
must be >=1
[11]:
samples = RandInt(1,7,q=2,log=True).generate_many(10000, seed=0)
pd.DataFrame(samples).hist()
samples = RandInt(1,7,include_high=False,q=2,log=True).generate_many(10000, seed=0)
pd.DataFrame(samples).hist();


Random Search
In Tune, you have two options to search on random expressions
As Level 1 Search
Level 1 means before execution. So given a combination of random expressions, we draw certain number of parameter combinations before execution. So the system will only deal with static parameters during runtime.
Grid search is also Level 1 search, and Level 1 search determines max parallelism. To also treat random expressions as Level 1, we must use .sample
[12]:
space = Space(a=Rand(0,1), b=Choice("x", "y")).sample(10, seed=0)
list(space)
[12]:
[{'a': 0.5488135039273248, 'b': 'x'},
{'a': 0.7151893663724195, 'b': 'y'},
{'a': 0.6027633760716439, 'b': 'y'},
{'a': 0.5448831829968969, 'b': 'x'},
{'a': 0.4236547993389047, 'b': 'x'},
{'a': 0.6458941130666561, 'b': 'y'},
{'a': 0.4375872112626925, 'b': 'y'},
{'a': 0.8917730007820798, 'b': 'y'},
{'a': 0.9636627605010293, 'b': 'y'},
{'a': 0.3834415188257777, 'b': 'x'}]
If in space, you have both grid and random expressions, .sample
will only apply to random samples, and then cross product with all grid combinations
[13]:
space = Space(a=Grid(0,1), b=Rand(0,1), c=Grid("a", "b"), d=Rand(0,1)).sample(3, seed=1)
list(space) # 2*2 *3 configs
[13]:
[{'a': 0, 'b': 0.417022004702574, 'c': 'a', 'd': 0.30233257263183977},
{'a': 0, 'b': 0.417022004702574, 'c': 'b', 'd': 0.30233257263183977},
{'a': 1, 'b': 0.417022004702574, 'c': 'a', 'd': 0.30233257263183977},
{'a': 1, 'b': 0.417022004702574, 'c': 'b', 'd': 0.30233257263183977},
{'a': 0, 'b': 0.7203244934421581, 'c': 'a', 'd': 0.14675589081711304},
{'a': 0, 'b': 0.7203244934421581, 'c': 'b', 'd': 0.14675589081711304},
{'a': 1, 'b': 0.7203244934421581, 'c': 'a', 'd': 0.14675589081711304},
{'a': 1, 'b': 0.7203244934421581, 'c': 'b', 'd': 0.14675589081711304},
{'a': 0, 'b': 0.00011437481734488664, 'c': 'a', 'd': 0.0923385947687978},
{'a': 0, 'b': 0.00011437481734488664, 'c': 'b', 'd': 0.0923385947687978},
{'a': 1, 'b': 0.00011437481734488664, 'c': 'a', 'd': 0.0923385947687978},
{'a': 1, 'b': 0.00011437481734488664, 'c': 'b', 'd': 0.0923385947687978}]
As Level 2 Search
Level 2 search happens during runtime, and bases on each level 1 search candidate. A common scenario is that we want to do grid search on one parameter, and do Bayesian Optimization on another parameter. Then we can parallelize on the choices of the first parameter and do sequential Bayesian Optimization on the second parameter.
We will use 3rd party solutions for Level 2 search, such as HyperOpt and Optuna. To pass random expression to Level 2, we simply don’t use .sample
[14]:
space = Space(a=Grid(0,1), b=Rand(0,1), c=Grid("a", "b"), d=Rand(0,1))
list(space) # 2*2 configs, each of the config still contains the Rand expression
[14]:
[{'a': 0, 'b': Rand(low=0, high=1, q=None, log=False, include_high=True), 'c': 'a', 'd': Rand(low=0, high=1, q=None, log=False, include_high=True)},
{'a': 0, 'b': Rand(low=0, high=1, q=None, log=False, include_high=True), 'c': 'b', 'd': Rand(low=0, high=1, q=None, log=False, include_high=True)},
{'a': 1, 'b': Rand(low=0, high=1, q=None, log=False, include_high=True), 'c': 'a', 'd': Rand(low=0, high=1, q=None, log=False, include_high=True)},
{'a': 1, 'b': Rand(low=0, high=1, q=None, log=False, include_high=True), 'c': 'b', 'd': Rand(low=0, high=1, q=None, log=False, include_high=True)}]
Space Operations, Conditional Search and Hybrid Search
Almost all popular tuning frameworks support conditional search. Tune
approaches conditional search in a totally different way.
Instead using if-else at runtime or using nested dictionaries to represent conditions, we introduce space operations:
[15]:
space1 = Space(a=1, b=Grid(2,3))
space2 = Space(c=Grid("a","b"))
union_space = space1 + space2
print(list(union_space))
product_space = space1 * space2
print(list(product_space))
[{'a': 1, 'b': 2}, {'a': 1, 'b': 3}, {'c': 'a'}, {'c': 'b'}]
[{'a': 1, 'b': 2, 'c': 'a'}, {'a': 1, 'b': 2, 'c': 'b'}, {'a': 1, 'b': 3, 'c': 'a'}, {'a': 1, 'b': 3, 'c': 'b'}]
Operator +
will union the configurations from two spaces, it can solve most of the conditional search problems
Operator *
will cross product the configurations from two spaces, it can solve most of the hybrid search problems
Conditional Search
[16]:
space1 = Space(model="LogisticRegression")
space2 = Space(model="RandomForestClassifier", max_depth=Grid(3,4))
space3 = Space(model="XGBClassifier", n_estimators=Grid(10,100,1000))
sweep = sum([space1, space2, space3]) # sum is another way to union
list(sweep)
[16]:
[{'model': 'LogisticRegression'},
{'model': 'RandomForestClassifier', 'max_depth': 3},
{'model': 'RandomForestClassifier', 'max_depth': 4},
{'model': 'XGBClassifier', 'n_estimators': 10},
{'model': 'XGBClassifier', 'n_estimators': 100},
{'model': 'XGBClassifier', 'n_estimators': 1000}]
All 3 models have a parameter random_state
, we want also want to do a grid search on it for every model. We just use *
[17]:
sweep_with_random_state = sweep * Space(random_state=Grid(0,1))
list(sweep_with_random_state)
[17]:
[{'model': 'LogisticRegression', 'random_state': 0},
{'model': 'LogisticRegression', 'random_state': 1},
{'model': 'RandomForestClassifier', 'max_depth': 3, 'random_state': 0},
{'model': 'RandomForestClassifier', 'max_depth': 3, 'random_state': 1},
{'model': 'RandomForestClassifier', 'max_depth': 4, 'random_state': 0},
{'model': 'RandomForestClassifier', 'max_depth': 4, 'random_state': 1},
{'model': 'XGBClassifier', 'n_estimators': 10, 'random_state': 0},
{'model': 'XGBClassifier', 'n_estimators': 10, 'random_state': 1},
{'model': 'XGBClassifier', 'n_estimators': 100, 'random_state': 0},
{'model': 'XGBClassifier', 'n_estimators': 100, 'random_state': 1},
{'model': 'XGBClassifier', 'n_estimators': 1000, 'random_state': 0},
{'model': 'XGBClassifier', 'n_estimators': 1000, 'random_state': 1}]
Hybrid Search (Grid + Random + Bayesian Optimization)
For XGBClassifier
, we want to do a hybrid search: grid search on random_state
, random search on n_estimators
and Level 2 (Bayesian Optimization) search on learning_rate
[18]:
xgb = Space(model="XGBClassifier", learning_rate=Rand(0,1), random_state=Grid(0,1)) * Space(n_estimators=RandInt(10,1000)).sample(3, seed=0)
list(xgb)
[18]:
[{'model': 'XGBClassifier', 'learning_rate': Rand(low=0, high=1, q=None, log=False, include_high=True), 'random_state': 0, 'n_estimators': 553},
{'model': 'XGBClassifier', 'learning_rate': Rand(low=0, high=1, q=None, log=False, include_high=True), 'random_state': 0, 'n_estimators': 718},
{'model': 'XGBClassifier', 'learning_rate': Rand(low=0, high=1, q=None, log=False, include_high=True), 'random_state': 0, 'n_estimators': 607},
{'model': 'XGBClassifier', 'learning_rate': Rand(low=0, high=1, q=None, log=False, include_high=True), 'random_state': 1, 'n_estimators': 553},
{'model': 'XGBClassifier', 'learning_rate': Rand(low=0, high=1, q=None, log=False, include_high=True), 'random_state': 1, 'n_estimators': 718},
{'model': 'XGBClassifier', 'learning_rate': Rand(low=0, high=1, q=None, log=False, include_high=True), 'random_state': 1, 'n_estimators': 607}]
Hybrid search and conditional search can also be used together
[19]:
list(Space(model="LogisticRegression")+xgb)
[19]:
[{'model': 'LogisticRegression'},
{'model': 'XGBClassifier', 'learning_rate': Rand(low=0, high=1, q=None, log=False, include_high=True), 'random_state': 0, 'n_estimators': 553},
{'model': 'XGBClassifier', 'learning_rate': Rand(low=0, high=1, q=None, log=False, include_high=True), 'random_state': 0, 'n_estimators': 718},
{'model': 'XGBClassifier', 'learning_rate': Rand(low=0, high=1, q=None, log=False, include_high=True), 'random_state': 0, 'n_estimators': 607},
{'model': 'XGBClassifier', 'learning_rate': Rand(low=0, high=1, q=None, log=False, include_high=True), 'random_state': 1, 'n_estimators': 553},
{'model': 'XGBClassifier', 'learning_rate': Rand(low=0, high=1, q=None, log=False, include_high=True), 'random_state': 1, 'n_estimators': 718},
{'model': 'XGBClassifier', 'learning_rate': Rand(low=0, high=1, q=None, log=False, include_high=True), 'random_state': 1, 'n_estimators': 607}]
[ ]:
Non-Iterative Tuning Guide
Hello World
Let’s do a hybrid parameter tuning with grid search + random search, and run it distributedly
[1]:
def objective(a, b) -> float:
return a**2 + b**2
[2]:
from tune import Space, Grid, Rand, RandInt, Choice
space = Space(a=Grid(-1,0,1), b=Rand(-10,10)).sample(100, seed=0)
[4]:
from tune import suggest_for_noniterative_objective
result = suggest_for_noniterative_objective(objective, space, top_n=1)[0]
print(result.sort_metric, result)
NativeExecutionEngine doesn't respect num_partitions ROWCOUNT
0.1909396653178624 {'trial': {'trial_id': '58c94f4f-011e-53da-a85b-7e696ced6600', 'params': {'a': 0, 'b': 0.43696643500143395}, 'metadata': {}, 'keys': []}, 'metric': 0.1909396653178624, 'params': {'a': 0, 'b': 0.43696643500143395}, 'metadata': {}, 'cost': 1.0, 'rung': 0, 'sort_metric': 0.1909396653178624, 'log_time': datetime.datetime(2021, 10, 6, 23, 35, 53, 24547)}
Now run it distributedly, let’s use dask as as the example
[6]:
from fugue_dask import DaskExecutionEngine
result = suggest_for_noniterative_objective(
objective, space, top_n=1,
execution_engine = DaskExecutionEngine
)[0]
print(result.sort_metric, result)
0.1909396653178624 {'trial': {'trial_id': '58c94f4f-011e-53da-a85b-7e696ced6600', 'params': {'a': 0, 'b': 0.43696643500143395}, 'metadata': {}, 'keys': []}, 'metric': 0.1909396653178624, 'params': {'a': 0, 'b': 0.43696643500143395}, 'metadata': {}, 'cost': 1.0, 'rung': 0, 'sort_metric': 0.1909396653178624, 'log_time': datetime.datetime(2021, 10, 6, 23, 36, 16, 996725)}
In order to use tune
in a more elegant and easier way, let’s firstly see how to configure the system.
Configuration
Configuring the system is not necessary but it has great benefit for simpifying your following works.
suggest_for_noniterative_objective
and optimize_noniterative
have a lot of parameters due to the complexity of tuning operations. But tune
let you do global configuration so you don’t need to repeat the same configuration for every tuning task.
Customize Optimizer Converter
[7]:
from tune import TUNE_OBJECT_FACTORY
from tune import NonIterativeObjectiveLocalOptimizer
from tune_hyperopt import HyperoptLocalOptimizer
from tune_optuna import OptunaLocalOptimizer
import optuna
optuna.logging.disable_default_handler()
def to_optimizer(obj):
if isinstance(obj, NonIterativeObjectiveLocalOptimizer):
return obj
if obj is None or "hyperopt"==obj:
return HyperoptLocalOptimizer(max_iter=20, seed=0)
if "optuna" == obj:
return OptunaLocalOptimizer(max_iter=20)
raise NotImplementedError
# make default level 2 optimizer HyperoptLocalOptimizer, so you will not need to set again
TUNE_OBJECT_FACTORY.set_noniterative_local_optimizer_converter(to_optimizer)
Customize Monitor
Monitor is to collect and render information in real time, there are builtin monitors, you can also create your own.
[9]:
from typing import Optional
from tune import TUNE_OBJECT_FACTORY
from tune import Monitor
from tune_notebook import (
NotebookSimpleHist,
NotebookSimpleRungs,
NotebookSimpleTimeSeries,
PrintBest,
)
def to_monitor(obj) -> Optional[Monitor]:
if obj is None:
return None
if isinstance(obj, Monitor):
return obj
if isinstance(obj, str):
if obj == "hist":
return NotebookSimpleHist()
if obj == "rungs":
return NotebookSimpleRungs()
if obj == "ts":
return NotebookSimpleTimeSeries()
if obj == "text":
return PrintBest()
raise NotImplementedError(obj)
TUNE_OBJECT_FACTORY.set_monitor_converter(to_monitor)
Set Temp Path For Tuning
Temp path can be used to store serialized partitions or checkpoints. Most top level API usage requires a valid temporary path. We can use factory method to set a global value.
Notice if you want to tune distributedly, you should set the path to a distributed file system, for example s3.
[10]:
TUNE_OBJECT_FACTORY.set_temp_path("/tmp")
Tuning Examples
Sometimes, your objective function requires a input dataframe. There are two ways to use dataframes in general:
Pros |
Cons |
|
---|---|---|
Take them as real dataframes, for example pandas dataframes. |
Simple and intuitive |
Either the datas ize can’t scale or you have to couple with a distributed solution such as Spark |
Take them from parameters, for example paths as parameters. |
You have the full control how and when and whether to load the data. More scalable. |
More code to make it work |
In general, the second way is a better idea. But if your case can fit in the first scenario, then tune
has a simple solution letting you take the pandas dataframes as input.
[11]:
from sklearn.datasets import load_diabetes
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
import pandas as pd
import numpy as np
diabetes = load_diabetes(as_frame=True)["frame"]
def evaluate(train_df:pd.DataFrame, **kwargs) -> float:
x, y = train_df.drop("target", axis=1), train_df["target"]
model = RandomForestRegressor(**kwargs)
# pay attention here, score is larger better so we return the negative value
return -np.mean(cross_val_score(model, x, y, scoring="neg_mean_absolute_error", cv=4))
evaluate(diabetes)
[11]:
46.646344389844394
With the given diabetes
dataset and the objective function evaluate
let’s tune it in different ways
Hybrid Tuning
[13]:
# Grid search only
space = Space(n_estimators=Grid(100,200), random_state=0)
result = suggest_for_noniterative_objective(
evaluate, space, top_n=1,
df = diabetes, df_name = "train_df"
)[0]
print(result.sort_metric, result)
NativeExecutionEngine doesn't respect num_partitions ROWCOUNT
46.63103787878788 {'trial': {'trial_id': '5d719fa7-9537-58b1-86cd-fa69a4e75272', 'params': {'n_estimators': 100, 'random_state': 0}, 'metadata': {}, 'keys': []}, 'metric': 46.63103787878788, 'params': {'n_estimators': 100, 'random_state': 0}, 'metadata': {}, 'cost': 1.0, 'rung': 0, 'sort_metric': 46.63103787878788, 'log_time': datetime.datetime(2021, 10, 6, 23, 37, 11, 450017)}
[14]:
# grid + random
space = Space(n_estimators=Grid(100,200), max_depth=RandInt(2,10), random_state=0).sample(3, seed=0)
result = suggest_for_noniterative_objective(
evaluate, space, top_n=1,
df = diabetes, df_name = "train_df"
)[0]
print(result.sort_metric, result)
NativeExecutionEngine doesn't respect num_partitions ROWCOUNT
46.52677715635581 {'trial': {'trial_id': '0a53519f-576b-5a9f-8ef9-4a7e7f69de1a', 'params': {'n_estimators': 200, 'max_depth': 6, 'random_state': 0}, 'metadata': {}, 'keys': []}, 'metric': 46.52677715635581, 'params': {'n_estimators': 200, 'max_depth': 6, 'random_state': 0}, 'metadata': {}, 'cost': 1.0, 'rung': 0, 'sort_metric': 46.52677715635581, 'log_time': datetime.datetime(2021, 10, 6, 23, 37, 26, 492058)}
[16]:
# random + bayesian optimization (hyperopt is used by default)
space = Space(n_estimators=RandInt(50,200))* Space(max_depth=RandInt(2,10), random_state=0).sample(2, seed=0)
result = suggest_for_noniterative_objective(
evaluate, space, top_n=1,
df = diabetes, df_name = "train_df"
)[0]
print(result.sort_metric, result)
result = suggest_for_noniterative_objective(
evaluate, space, top_n=1,
df = diabetes, df_name = "train_df",
local_optimizer="optuna" # switch to optuna for bayesian optimization
)[0]
print(result.sort_metric, result)
NativeExecutionEngine doesn't respect num_partitions ROWCOUNT
NativeExecutionEngine doesn't respect num_partitions ROWCOUNT
46.419699856089416 {'trial': {'trial_id': '52919031-4f17-58d2-8cfc-e4a1d0e4555a', 'params': {'n_estimators': 175, 'max_depth': 6, 'random_state': 0}, 'metadata': {}, 'keys': []}, 'metric': 46.419699856089416, 'params': {'n_estimators': 175, 'max_depth': 6, 'random_state': 0}, 'metadata': {}, 'cost': 1.0, 'rung': 0, 'sort_metric': 46.419699856089416, 'log_time': datetime.datetime(2021, 10, 6, 23, 38, 37, 355059)}
46.41622613826187 {'trial': {'trial_id': '52919031-4f17-58d2-8cfc-e4a1d0e4555a', 'params': {'n_estimators': 176, 'max_depth': 6, 'random_state': 0}, 'metadata': {}, 'keys': []}, 'metric': 46.41622613826187, 'params': {'n_estimators': 176, 'max_depth': 6, 'random_state': 0}, 'metadata': {}, 'cost': 1.0, 'rung': 0, 'sort_metric': 46.41622613826187, 'log_time': datetime.datetime(2021, 10, 6, 23, 39, 9, 442020)}
Partition And Train And Tune
This is a very important feature of tune
. Sometimes, partitioning the data and train and tune small independent models separately can generate better result. This is not necessarily true, but at least we make it very simple for you to try. You only need to specify partition_keys
. And with a distributed engine, all independent tasks are fully parallelized.
[17]:
space = Space(n_estimators=Grid(50,200), max_depth=RandInt(2,10), random_state=0).sample(2, seed=0)
result = suggest_for_noniterative_objective(
evaluate, space, top_n=1,
df = diabetes, df_name = "train_df",
partition_keys = ["sex"] # for male and females, we train and tune separately
)
for r in result:
print(r.trial.keys, r.sort_metric, r)
NativeExecutionEngine doesn't respect num_partitions ROWCOUNT
[0.0506801187398187] 42.48208345425722 {'trial': {'trial_id': '83f593dd-a3a2-5ac0-b389-ee19f8cc1134', 'params': {'n_estimators': 200, 'max_depth': 8, 'random_state': 0}, 'metadata': {}, 'keys': [0.0506801187398187]}, 'metric': 42.48208345425722, 'params': {'n_estimators': 200, 'max_depth': 8, 'random_state': 0}, 'metadata': {}, 'cost': 1.0, 'rung': 0, 'sort_metric': 42.48208345425722, 'log_time': datetime.datetime(2021, 10, 6, 23, 40, 38, 579320)}
[-0.044641636506989] 46.66399292343497 {'trial': {'trial_id': '1759366d-de55-5418-b1b5-48cf91f529a0', 'params': {'n_estimators': 50, 'max_depth': 8, 'random_state': 0}, 'metadata': {}, 'keys': [-0.044641636506989]}, 'metric': 46.66399292343497, 'params': {'n_estimators': 50, 'max_depth': 8, 'random_state': 0}, 'metadata': {}, 'cost': 1.0, 'rung': 0, 'sort_metric': 46.66399292343497, 'log_time': datetime.datetime(2021, 10, 6, 23, 40, 33, 356186)}
Distributed Tuning
tune
is based on Fugue so it can run seamlessly using all Fugue supported execution engines and in the same way Fugue uses them.
[18]:
# This space is a combination of grid and random search
# all level 1 searches, so it can be fully distributed
space = Space(n_estimators=Grid(50,200), max_depth=RandInt(2,10), random_state=0).sample(2, seed=0)
result = suggest_for_noniterative_objective(
evaluate, space, top_n=1,
df = diabetes, df_name = "train_df",
partition_keys = ["sex"],
execution_engine = DaskExecutionEngine # this makes the tuning process distributed
)
for r in result:
print(r.trial.keys, r.sort_metric, r)
[0.0506801187398187] 42.79742975473356 {'trial': {'trial_id': '0f2053de-71b2-514d-b4ff-8495b93a042b', 'params': {'n_estimators': 200, 'max_depth': 6, 'random_state': 0}, 'metadata': {}, 'keys': [0.0506801187398187]}, 'metric': 42.79742975473356, 'params': {'n_estimators': 200, 'max_depth': 6, 'random_state': 0}, 'metadata': {}, 'cost': 1.0, 'rung': 0, 'sort_metric': 42.79742975473356, 'log_time': datetime.datetime(2021, 10, 6, 23, 40, 57, 795165)}
[-0.044641636506989] 47.480845528260254 {'trial': {'trial_id': '46da77b5-089d-57b9-8036-0ca2e3646fdb', 'params': {'n_estimators': 200, 'max_depth': 6, 'random_state': 0}, 'metadata': {}, 'keys': [-0.044641636506989]}, 'metric': 47.480845528260254, 'params': {'n_estimators': 200, 'max_depth': 6, 'random_state': 0}, 'metadata': {}, 'cost': 1.0, 'rung': 0, 'sort_metric': 47.480845528260254, 'log_time': datetime.datetime(2021, 10, 6, 23, 41, 0, 714602)}
Realtime Monitoring
Fugue framework can let workers communicate with driver in realtime (see this). So tune
leverages this feature for monitoring and iterative problems.
[19]:
space = Space(n_estimators=RandInt(1,20), max_depth=RandInt(2,10), random_state=0).sample(100, seed=0)
result = suggest_for_noniterative_objective(
evaluate, space, top_n=1,
df = diabetes, df_name = "train_df",
monitor="ts"
)
for r in result:
print(r.trial.keys, r.sort_metric, r)

[] 46.84555314021837 {'trial': {'trial_id': '2c9456ad-f8a7-56df-9195-3266ffabd941', 'params': {'n_estimators': 20, 'max_depth': 3, 'random_state': 0}, 'metadata': {}, 'keys': []}, 'metric': 46.84555314021837, 'params': {'n_estimators': 20, 'max_depth': 3, 'random_state': 0}, 'metadata': {}, 'cost': 1.0, 'rung': 0, 'sort_metric': 46.84555314021837, 'log_time': datetime.datetime(2021, 10, 6, 23, 41, 19, 488640)}
[] 46.84555314021837 {'trial': {'trial_id': '2c9456ad-f8a7-56df-9195-3266ffabd941', 'params': {'n_estimators': 20, 'max_depth': 3, 'random_state': 0}, 'metadata': {}, 'keys': []}, 'metric': 46.84555314021837, 'params': {'n_estimators': 20, 'max_depth': 3, 'random_state': 0}, 'metadata': {}, 'cost': 1.0, 'rung': 0, 'sort_metric': 46.84555314021837, 'log_time': datetime.datetime(2021, 10, 6, 23, 41, 23, 761028)}
To enable monitoring on a distributed engine, you must also enable remote call back. Without shortcut, you have to set multiple configs. Here is an example with the fuggle
package who sets the shortcuts for callbacks on Kaggle, it’s as simple as one config: callback: True
[20]:
space = Space(n_estimators=RandInt(1,20), max_depth=RandInt(2,10), random_state=0, n_jobs=1).sample(200, seed=0)
callback_conf = {
"fugue.rpc.server": "fugue.rpc.flask.FlaskRPCServer",
"fugue.rpc.flask_server.host": "0.0.0.0",
"fugue.rpc.flask_server.port": "1234",
"fugue.rpc.flask_server.timeout": "2 sec",
}
result = suggest_for_noniterative_objective(
evaluate, space, top_n=1,
df = diabetes, df_name = "train_df",
monitor="ts",
execution_engine = DaskExecutionEngine,
execution_engine_conf=callback_conf
)
for r in result:
print(r.trial.keys, r.sort_metric, r)

[] 46.89339381813802 {'trial': {'trial_id': 'af51195c-3da6-59e5-a4ab-9802041ab314', 'params': {'n_estimators': 20, 'max_depth': 5, 'random_state': 0, 'n_jobs': 1}, 'metadata': {}, 'keys': []}, 'metric': 46.89339381813802, 'params': {'n_estimators': 20, 'max_depth': 5, 'random_state': 0, 'n_jobs': 1}, 'metadata': {}, 'cost': 1.0, 'rung': 0, 'sort_metric': 46.89339381813802, 'log_time': datetime.datetime(2021, 10, 6, 23, 42, 0, 265059)}
[] 46.89339381813802 {'trial': {'trial_id': 'af51195c-3da6-59e5-a4ab-9802041ab314', 'params': {'n_estimators': 20, 'max_depth': 5, 'random_state': 0, 'n_jobs': 1}, 'metadata': {}, 'keys': []}, 'metric': 46.89339381813802, 'params': {'n_estimators': 20, 'max_depth': 5, 'random_state': 0, 'n_jobs': 1}, 'metadata': {}, 'cost': 1.0, 'rung': 0, 'sort_metric': 46.89339381813802, 'log_time': datetime.datetime(2021, 10, 6, 23, 42, 0, 265059)}
For the shortcuts of monitoring
ts
to monitor the up-to-date best metric collectedhist
to motitor the histogram of metrics collected
Early Stopping
When you enable monitoring, you often see the curve flattens quickly, so it can save significant time if it can stop trying the remaining trials. To do early stopping, it is required to enable callbacks for distributed engine (for monitoring, if you don’t monitor, you don’t need to enable callback).
In tune
, you can also combine stoppers with logical operators
[21]:
from tune import small_improvement, n_updates
space = Space(n_estimators=RandInt(1,20), max_depth=RandInt(2,10), random_state=0, n_jobs=1).sample(200, seed=0)
callback_conf = {
"fugue.rpc.server": "fugue.rpc.flask.FlaskRPCServer",
"fugue.rpc.flask_server.host": "0.0.0.0",
"fugue.rpc.flask_server.port": "1234",
"fugue.rpc.flask_server.timeout": "2 sec",
}
result = suggest_for_noniterative_objective(
evaluate, space, top_n=1,
df = diabetes, df_name = "train_df",
monitor="ts",
# stop if at least 5 updates on best
# AND the last update on best improved less than 0.1 (abs value)
stopper= n_updates(5) & small_improvement(0.1,1),
execution_engine = DaskExecutionEngine,
execution_engine_conf=callback_conf
)
for r in result:
print(r.trial.keys, r.sort_metric, r)

[] 47.01773216903467 {'trial': {'trial_id': 'f84ce5f5-207b-5ab7-a81a-be80879d5431', 'params': {'n_estimators': 19, 'max_depth': 4, 'random_state': 0, 'n_jobs': 1}, 'metadata': {}, 'keys': []}, 'metric': 47.01773216903467, 'params': {'n_estimators': 19, 'max_depth': 4, 'random_state': 0, 'n_jobs': 1}, 'metadata': {}, 'cost': 1.0, 'rung': 0, 'sort_metric': 47.01773216903467, 'log_time': datetime.datetime(2021, 10, 6, 23, 42, 40, 406040)}
[] 47.01773216903467 {'trial': {'trial_id': 'f84ce5f5-207b-5ab7-a81a-be80879d5431', 'params': {'n_estimators': 19, 'max_depth': 4, 'random_state': 0, 'n_jobs': 1}, 'metadata': {}, 'keys': []}, 'metric': 47.01773216903467, 'params': {'n_estimators': 19, 'max_depth': 4, 'random_state': 0, 'n_jobs': 1}, 'metadata': {}, 'cost': 1.0, 'rung': 0, 'sort_metric': 47.01773216903467, 'log_time': datetime.datetime(2021, 10, 6, 23, 42, 40, 406040)}
The above example combined a warmup period n_updates(5)
and improvement check small_improvement(0.1,1)
so it does not stop too early or too late.
You can also customize a simple stopper
[22]:
from typing import List
from tune.noniterative.stopper import SimpleNonIterativeStopper
from tune import TrialReport
def less_than(v: float) -> SimpleNonIterativeStopper:
def func(current: TrialReport, updated: bool, reports: List[TrialReport]):
return current.sort_metric <= v
return SimpleNonIterativeStopper(func, log_best_only=True)
[23]:
result = suggest_for_noniterative_objective(
evaluate, space, top_n=1,
df = diabetes, df_name = "train_df",
monitor="ts",
stopper= less_than(49),
execution_engine = DaskExecutionEngine,
execution_engine_conf=callback_conf
)
for r in result:
print(r.trial.keys, r.sort_metric, r)

[] 47.74170052753941 {'trial': {'trial_id': 'b9ab0d11-991d-53d2-ad41-246dcbe23c22', 'params': {'n_estimators': 17, 'max_depth': 2, 'random_state': 0, 'n_jobs': 1}, 'metadata': {}, 'keys': []}, 'metric': 47.74170052753941, 'params': {'n_estimators': 17, 'max_depth': 2, 'random_state': 0, 'n_jobs': 1}, 'metadata': {}, 'cost': 1.0, 'rung': 0, 'sort_metric': 47.74170052753941, 'log_time': datetime.datetime(2021, 10, 6, 23, 43, 15, 891806)}
[] 47.74170052753941 {'trial': {'trial_id': 'b9ab0d11-991d-53d2-ad41-246dcbe23c22', 'params': {'n_estimators': 17, 'max_depth': 2, 'random_state': 0, 'n_jobs': 1}, 'metadata': {}, 'keys': []}, 'metric': 47.74170052753941, 'params': {'n_estimators': 17, 'max_depth': 2, 'random_state': 0, 'n_jobs': 1}, 'metadata': {}, 'cost': 1.0, 'rung': 0, 'sort_metric': 47.74170052753941, 'log_time': datetime.datetime(2021, 10, 6, 23, 43, 15, 891806)}
The stopper will try to do graceful stop, so after the stop criteria, some running trials may still finish in with a distributed engine and report back, that is normal. If you want to stop faster, for example set: stop_check_interval: "5sec"
. But if you have a lot of workers, the frequent check may be a burden on the driver side, it also depends on how heavy compute your custom stopper is using.
Notice: You must create new stoppers everytime you call suggest_for_noniterative_objective
because SimpleNonIterativeStopper
is stateful.
[ ]:
Non-Iterative Objective
Non-Iterative Objective refers to the objective functions with single iteration. They do not report progress during the execution to get a pruning decision.
Interfaceless
The simplest way to construct a Tune
compatible non-iterative objective is to wirte a native python function with type annotations.
[3]:
from typing import Tuple, Dict, Any
def objective1(a, b) -> float:
return a**2 + b**2
def objective2(a, b) -> Tuple[float, Dict[str, Any]]:
return a**2 + b**2, {"metadata":"x"}
If you function as float
or Tuple[float, Dict[str, Any]]
as output annotation, they are valid non-iterative objectives for tune
Tuple[float, Dict[str, Any]]
is to return both the metric and metadata.
The following code demos how it works on the backend to convert your simple functions to tune
compatible objects. You normally don’t need to do that by yourself.
[5]:
from tune import to_noniterative_objective, Trial
f1 = to_noniterative_objective(objective1)
f2 = to_noniterative_objective(objective2, min_better=False)
trial = Trial("id", params=dict(a=1,b=1))
report1 = f1.safe_run(trial)
report2 = f2.safe_run(trial)
print(type(f1))
print(report1.metric, report1.sort_metric, report1.metadata)
print(report2.metric, report2.sort_metric, report2.metadata)
<class 'tune.noniterative.convert._NonIterativeObjectiveFuncWrapper'>
2.0 2.0 {}
2.0 -2.0 {'metadata': 'x'}
Decorator Approach
It is equivalent to use decorator on top of the functions. But now your functions depend on tune
package.
[7]:
from tune import noniterative_objective
@noniterative_objective
def objective_3(a, b) -> float:
return a**2 + b**2
@noniterative_objective(min_better=False)
def objective_4(a, b) -> Tuple[float, Dict[str, Any]]:
return a**2 + b**2, {"metadata":"x"}
report3 = objective_3.safe_run(trial)
report4 = objective_4.safe_run(trial)
print(report3.metric, report3.sort_metric, report3.metadata)
print(report4.metric, report4.sort_metric, report4.metadata)
2.0 2.0 {}
2.0 -2.0 {'metadata': 'x'}
Interface Approach
With interface approach, you can access all properties of a trial. Also you can use more flexible logic to generate sort metric.
[9]:
from tune import NonIterativeObjectiveFunc, TrialReport
class Objective(NonIterativeObjectiveFunc):
def generate_sort_metric(self, value: float) -> float:
return - value * 10
def run(self, trial: Trial) -> TrialReport:
params = trial.params.simple_value
metric = params["a"]**2 + params["b"]**2
return TrialReport(trial, metric, metadata=dict(m="x"))
report = Objective().safe_run(trial)
print(report.metric, report.sort_metric, report.metadata)
2.0 -20.0 {'m': 'x'}
Factory Method
Almost all higher level APIs of tune
are using TUNE_OBJECT_FACTORY
to convert various objects to NonIterativeObjectiveFunc
.
[10]:
from tune import TUNE_OBJECT_FACTORY
assert isinstance(TUNE_OBJECT_FACTORY.make_noniterative_objective(objective1), NonIterativeObjectiveFunc)
assert isinstance(TUNE_OBJECT_FACTORY.make_noniterative_objective(objective_4), NonIterativeObjectiveFunc)
assert isinstance(TUNE_OBJECT_FACTORY.make_noniterative_objective(Objective()), NonIterativeObjectiveFunc)
That is why in the higher level APIs, you can just pass in a very simple python function as objective but tune
is still able to recognize.
Actually you can make it even more flexible by configuring the factory.
[11]:
def to_obj(obj):
if obj == "test":
return to_noniterative_objective(objective1, min_better=False)
if isinstance(obj, NonIterativeObjectiveFunc):
return obj
raise NotImplementedError
TUNE_OBJECT_FACTORY.set_noniterative_objective_converter(to_obj) # user to_obj to replace the built-in default converter
assert isinstance(TUNE_OBJECT_FACTORY.make_noniterative_objective("test"), NonIterativeObjectiveFunc)
If you customize in this way, then you can pass in test
to the higher level tuning APIs, and it will be recognized as a compatible objective.
This is a common approach in Fugue projects. It enables you to use mostly primitive data types to represent what you want to do. For advanced users, if you spend some time on such configuration (one time effort), you will find the code is even simpler and less dependent on fugue
and tune
.
[ ]:
Non-Iterative Optimizers
AKA Level 2 optimizers, are unified 3rd party solutions for random expressions. Look at this space:
[1]:
from tune import Space, Grid, Rand
space = Space(a=Grid(1,2), b=Rand(0,1))
list(space)
[1]:
[{'a': 1, 'b': Rand(low=0, high=1, q=None, log=False, include_high=True)},
{'a': 2, 'b': Rand(low=0, high=1, q=None, log=False, include_high=True)}]
Grid
is for level 1 optimization, all level 1 parameters will be converted to static values before execution. And level 2 parameters will be optimized during runtime using level 2 optimizers. So for the above example, if we have a Spark cluster and Hyperopt, then we can use Hyperot to search for the best b
on each of the 2 configurations. And the 2 jobs are parallelized by Spark.
[3]:
from tune import noniterative_objective, Trial
@noniterative_objective
def objective(a ,b) -> float:
return a**2 + b**2
trial = Trial("dummy", params=list(space)[0])
Use Directly
Notice normally you don’t use them directly, instead you should use them through top level APIs. This is just to demo how they work.
Hyperopt
[5]:
from tune_hyperopt import HyperoptLocalOptimizer
hyperopt_optimizer = HyperoptLocalOptimizer(max_iter=200, seed=0)
report = hyperopt_optimizer.run(objective, trial)
print(report.sort_metric, report)
1.0000000001665414 {'trial': {'trial_id': 'dummy', 'params': {'a': 1, 'b': 1.2905089873156781e-05}, 'metadata': {}, 'keys': []}, 'metric': 1.0000000001665414, 'params': {'a': 1, 'b': 1.2905089873156781e-05}, 'metadata': {}, 'cost': 1.0, 'rung': 0, 'sort_metric': 1.0000000001665414, 'log_time': datetime.datetime(2021, 10, 6, 23, 30, 51, 970344)}
Optuna
[7]:
from tune_optuna import OptunaLocalOptimizer
import optuna
optuna.logging.disable_default_handler()
optuna_optimizer = OptunaLocalOptimizer(max_iter=200)
report = optuna_optimizer.run(objective, trial)
print(report.sort_metric, report)
1.0000000003655019 {'trial': {'trial_id': 'dummy', 'params': {'a': 1, 'b': 1.9118105424729645e-05}, 'metadata': {}, 'keys': []}, 'metric': 1.0000000003655019, 'params': {'a': 1, 'b': 1.9118105424729645e-05}, 'metadata': {}, 'cost': 1.0, 'rung': 0, 'sort_metric': 1.0000000003655019, 'log_time': datetime.datetime(2021, 10, 6, 23, 31, 26, 6566)}
As you see, we have unified the interfaces for using these frameworks. In addition, we also unified the semantic of the random expressions, so the random sampling behavior will be highly consistent on different 3rd party solutions.
Use Top Level API
In the following example, we directly use the entire space
where you can mix grid search, random search and Bayesian Optimization.
[8]:
from tune import suggest_for_noniterative_objective
report = suggest_for_noniterative_objective(
objective, space, top_n=1,
local_optimizer=hyperopt_optimizer
)[0]
print(report.sort_metric, report)
NativeExecutionEngine doesn't respect num_partitions ROWCOUNT
1.0000000001665414 {'trial': {'trial_id': '971ef4a5-71a9-5bf2-b2a4-f0f1acd02b78', 'params': {'a': 1, 'b': 1.2905089873156781e-05}, 'metadata': {}, 'keys': []}, 'metric': 1.0000000001665414, 'params': {'a': 1, 'b': 1.2905089873156781e-05}, 'metadata': {}, 'cost': 1.0, 'rung': 0, 'sort_metric': 1.0000000001665414, 'log_time': datetime.datetime(2021, 10, 6, 23, 31, 43, 784128)}
You can also provide only random expressions in space, and use in the same way so it looks like a common case similar to the examples
[14]:
report = suggest_for_noniterative_objective(
objective, Space(a=Rand(-1,1), b=Rand(-100,100)), top_n=1,
local_optimizer=optuna_optimizer
)[0]
print(report.sort_metric, report)
NativeExecutionEngine doesn't respect num_partitions ROWCOUNT
0.04085386621249434 {'trial': {'trial_id': '45179c01-7358-5546-8f41-d7c6f120523f', 'params': {'a': 0.01604913454189394, 'b': 0.20148521408021614}, 'metadata': {}, 'keys': []}, 'metric': 0.04085386621249434, 'params': {'a': 0.01604913454189394, 'b': 0.20148521408021614}, 'metadata': {}, 'cost': 1.0, 'rung': 0, 'sort_metric': 0.04085386621249434, 'log_time': datetime.datetime(2021, 10, 6, 23, 34, 47, 379901)}
Factory Method
In the above example, if we don’t set local_optimizer
, then the default level 2 optimizer will be used which can’t handle a configuration with random expressions.
So we have a nice way to make certain optimizer the default one.
[10]:
from tune import NonIterativeObjectiveLocalOptimizer, TUNE_OBJECT_FACTORY
def to_optimizer(obj):
if isinstance(obj, NonIterativeObjectiveLocalOptimizer):
return obj
if obj is None or "hyperopt"==obj:
return HyperoptLocalOptimizer(max_iter=200, seed=0)
if "optuna" == obj:
return OptunaLocalOptimizer(max_iter=200)
raise NotImplementedError
TUNE_OBJECT_FACTORY.set_noniterative_local_optimizer_converter(to_optimizer)
Now Hyperopt becomes the default level 2 optimizer, and you can switch to Optuna by specifying a string parameter
[16]:
report = suggest_for_noniterative_objective(
objective, Space(a=Rand(-1,1), b=Rand(-100,100)), top_n=1
)[0] # using hyperopt
print(report.sort_metric, report)
report = suggest_for_noniterative_objective(
objective, Space(a=Rand(-1,1), b=Rand(-100,100)), top_n=1,
local_optimizer="optuna"
)[0] # using hyperopt
print(report.sort_metric, report)
NativeExecutionEngine doesn't respect num_partitions ROWCOUNT
NativeExecutionEngine doesn't respect num_partitions ROWCOUNT
0.02788888054657708 {'trial': {'trial_id': '45179c01-7358-5546-8f41-d7c6f120523f', 'params': {'a': -0.13745463941867586, 'b': -0.09484251498594332}, 'metadata': {}, 'keys': []}, 'metric': 0.02788888054657708, 'params': {'a': -0.13745463941867586, 'b': -0.09484251498594332}, 'metadata': {}, 'cost': 1.0, 'rung': 0, 'sort_metric': 0.02788888054657708, 'log_time': datetime.datetime(2021, 10, 6, 23, 35, 19, 961138)}
0.010490219126635992 {'trial': {'trial_id': '45179c01-7358-5546-8f41-d7c6f120523f', 'params': {'a': 0.06699961867542388, 'b': -0.07746786575079878}, 'metadata': {}, 'keys': []}, 'metric': 0.010490219126635992, 'params': {'a': 0.06699961867542388, 'b': -0.07746786575079878}, 'metadata': {}, 'cost': 1.0, 'rung': 0, 'sort_metric': 0.010490219126635992, 'log_time': datetime.datetime(2021, 10, 6, 23, 35, 21, 593974)}
[ ]:
Tune Dataset
TuneDataset
contains searching space and all related dataframes with metadata for a tuning task.
TuneDataset
should not to be constructed by users directly. Instead, you should use TuneDatasetBuilder
or the factory method to construct TuneDataset
.
[1]:
from fugue_notebook import setup
setup(is_lab=True)
import pandas as pd
from tune import TUNE_OBJECT_FACTORY, TuneDatasetBuilder, Space, Grid
from fugue import FugueWorkflow
TUNE_OBJECT_FACTORY.make_dataset
is a wrapper of TuneDatasetBuilder
, making the dataset construction even easier. But TuneDatasetBuilder
still has the most flexibility. For example, it can add multiple dataframes with different join types while TUNE_OBJECT_FACTORY.make_dataset
can add at most two dataframes (nomrally train and validations dataframes).
[2]:
with FugueWorkflow() as dag:
builder = TuneDatasetBuilder(Space(a=1, b=2))
dataset = builder.build(dag)
dataset.data.show();
with FugueWorkflow() as dag:
dataset = TUNE_OBJECT_FACTORY.make_dataset(dag, Space(a=1, b=2))
dataset.data.show();
__tune_trials__ | |
---|---|
0 | gASVXwEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln... |
__tune_trials__ | |
---|---|
0 | gASVXwEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln... |
Here are the equivalent ways to construct TuneDataset
with space and two dataframes.
In TuneDataset
, every dataframe will be partition by certain keys, and each partition will be saved into a temp parquet file. The temp path must be specified. Using the factory, you can call set_temp_path
once so you no longer need to provide the temp path explicitly, if you still provide a path, it will be used.
[3]:
pdf1 = pd.DataFrame([[0,1],[1,1],[0,2]], columns = ["a", "b"])
pdf2 = pd.DataFrame([[0,0.5],[2,0.1],[0,0.1],[1,0.3]], columns = ["a", "c"])
space = Space(a=1, b=Grid(1,2,3))
with FugueWorkflow() as dag:
builder = TuneDatasetBuilder(space, path="/tmp")
# here we must make pdf1 pdf2 the FugueWorkflowDataFrame, and they
# both need to be partitioned by the same keys so each partition
# will be saved to a temp parquet file, and the chunks of data are
# replaced by file paths before join.
builder.add_df("df1", dag.df(pdf1).partition_by("a"))
builder.add_df("df2", dag.df(pdf2).partition_by("a"), how="inner")
dataset = builder.build(dag)
dataset.data.show();
TUNE_OBJECT_FACTORY.set_temp_path("/tmp")
with FugueWorkflow() as dag:
# this method is significantly simpler, as long as you don't have more
# than 2 dataframes for a tuning task, use this.
dataset = TUNE_OBJECT_FACTORY.make_dataset(
dag, space,
df_name="df1", df=pdf1,
test_df_name="df2", test_df=pdf2,
partition_keys=["a"],
)
dataset.data.show();
a | __tune_df__df1 | __tune_df__df2 | __tune_trials__ | |
---|---|---|---|---|
0 | 0 | /tmp/01b823d6-2d65-43be-898d-ed4d5b1ab582.parquet | /tmp/5c35d480-6fa8-4776-a0f9-770974b73bb4.parquet | gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln... |
1 | 0 | /tmp/01b823d6-2d65-43be-898d-ed4d5b1ab582.parquet | /tmp/5c35d480-6fa8-4776-a0f9-770974b73bb4.parquet | gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln... |
2 | 0 | /tmp/01b823d6-2d65-43be-898d-ed4d5b1ab582.parquet | /tmp/5c35d480-6fa8-4776-a0f9-770974b73bb4.parquet | gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln... |
3 | 1 | /tmp/15f2ec83-3494-4ba8-80a5-fa7c558c273c.parquet | /tmp/2fe00d9c-b690-49c6-87a5-d365d59066c6.parquet | gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln... |
4 | 1 | /tmp/15f2ec83-3494-4ba8-80a5-fa7c558c273c.parquet | /tmp/2fe00d9c-b690-49c6-87a5-d365d59066c6.parquet | gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln... |
5 | 1 | /tmp/15f2ec83-3494-4ba8-80a5-fa7c558c273c.parquet | /tmp/2fe00d9c-b690-49c6-87a5-d365d59066c6.parquet | gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln... |
a | __tune_df__df1 | __tune_df__df2 | __tune_trials__ | |
---|---|---|---|---|
0 | 0 | /tmp/943302c8-2704-4b29-a2ac-64946352a90d.parquet | /tmp/9084e1ad-2156-4f3a-be36-52cf55d5c2fb.parquet | gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln... |
1 | 0 | /tmp/943302c8-2704-4b29-a2ac-64946352a90d.parquet | /tmp/9084e1ad-2156-4f3a-be36-52cf55d5c2fb.parquet | gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln... |
2 | 0 | /tmp/943302c8-2704-4b29-a2ac-64946352a90d.parquet | /tmp/9084e1ad-2156-4f3a-be36-52cf55d5c2fb.parquet | gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln... |
3 | 1 | /tmp/74fa6215-116d-4828-a49c-f58358a9b4e7.parquet | /tmp/0aa2aae2-3ab7-46e7-82e2-34a14ded2f0f.parquet | gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln... |
4 | 1 | /tmp/74fa6215-116d-4828-a49c-f58358a9b4e7.parquet | /tmp/0aa2aae2-3ab7-46e7-82e2-34a14ded2f0f.parquet | gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln... |
5 | 1 | /tmp/74fa6215-116d-4828-a49c-f58358a9b4e7.parquet | /tmp/0aa2aae2-3ab7-46e7-82e2-34a14ded2f0f.parquet | gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln... |
We got 6 rows, because the space will contain 3 configurations. And since for the dataframes, we partitioned by a
and inner joined, there will be 2 rows. So in total there are 6 rows in the TuneDataset
.
Notice, the number of rows of TuneDataset determines max parallelism. For this case, if you assign 10 workers, 4 will always be idle.
Actually, a more common case is that for each of the dataframe, we don’t partition at all. For TUNE_OBJECT_FACTORY.make_dataset
we just need to remove the partition_keys
.
[4]:
with FugueWorkflow() as dag:
dataset = TUNE_OBJECT_FACTORY.make_dataset(
dag, space,
df_name="df1", df=pdf1,
test_df_name="df2", test_df=pdf2,
)
dataset.data.show();
__tune_df__df1 | __tune_df__df2 | __tune_trials__ | |
---|---|---|---|
0 | /tmp/a774965e-d0df-417c-84d0-bb693ac337d1.parquet | /tmp/2f9a93cd-121b-4697-8fe9-0513aa6bcd82.parquet | gASVXwEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln... |
1 | /tmp/a774965e-d0df-417c-84d0-bb693ac337d1.parquet | /tmp/2f9a93cd-121b-4697-8fe9-0513aa6bcd82.parquet | gASVXwEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln... |
2 | /tmp/a774965e-d0df-417c-84d0-bb693ac337d1.parquet | /tmp/2f9a93cd-121b-4697-8fe9-0513aa6bcd82.parquet | gASVXwEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln... |
But what if we want to partition on df1
but not on df2
? Then again, you can use TuneDatasetBuilder
[5]:
with FugueWorkflow() as dag:
builder = TuneDatasetBuilder(space, path="/tmp")
builder.add_df("df1", dag.df(pdf1).partition_by("a"))
# use cross join because there no common key
builder.add_df("df2", dag.df(pdf2), how="cross")
dataset = builder.build(dag)
dataset.data.show();
a | __tune_df__df1 | __tune_df__df2 | __tune_trials__ | |
---|---|---|---|---|
0 | 0 | /tmp/4e16f5d7-1dc2-438c-86c7-504502c3e1ad.parquet | /tmp/3b92a6f2-31aa-485e-a608-58dcdc925a3c.parquet | gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln... |
1 | 0 | /tmp/4e16f5d7-1dc2-438c-86c7-504502c3e1ad.parquet | /tmp/3b92a6f2-31aa-485e-a608-58dcdc925a3c.parquet | gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln... |
2 | 0 | /tmp/4e16f5d7-1dc2-438c-86c7-504502c3e1ad.parquet | /tmp/3b92a6f2-31aa-485e-a608-58dcdc925a3c.parquet | gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln... |
3 | 1 | /tmp/058862d5-4c24-437e-ae38-c4810d071a11.parquet | /tmp/3b92a6f2-31aa-485e-a608-58dcdc925a3c.parquet | gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln... |
4 | 1 | /tmp/058862d5-4c24-437e-ae38-c4810d071a11.parquet | /tmp/3b92a6f2-31aa-485e-a608-58dcdc925a3c.parquet | gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln... |
5 | 1 | /tmp/058862d5-4c24-437e-ae38-c4810d071a11.parquet | /tmp/3b92a6f2-31aa-485e-a608-58dcdc925a3c.parquet | gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln... |
[ ]:
Checkpoint
Checkpoint is normally constructed and provided to you, but if you are interested, this can give you some details.
[4]:
from tune import Checkpoint
from triad import FileSystem
root = FileSystem()
fs = root.makedirs("/tmp/test", recreate=True)
checkpoint = Checkpoint(fs)
print(len(checkpoint))
0
[5]:
!ls /tmp/test
[6]:
with checkpoint.create() as folder:
folder.writetext("a.txt", "test")
[7]:
!ls /tmp/test
STATE d9ed2530-20f1-42b3-8818-7fbf1b8eedf3
Here is how to create a new checkpoint under /tmp/test
[8]:
with checkpoint.create() as folder:
folder.writetext("a.txt", "test2")
[9]:
!ls /tmp/test/*/
/tmp/test/8d4e7fed-2a4c-4789-a732-0cb46294e704/:
a.txt
/tmp/test/d9ed2530-20f1-42b3-8818-7fbf1b8eedf3/:
a.txt
Here is how to get the latest checkpoint folder
[10]:
print(len(checkpoint))
print(checkpoint.latest.readtext("a.txt"))
2
test2
[ ]: