Tune Dataset

TuneDataset contains searching space and all related dataframes with metadata for a tuning task.

TuneDataset should not to be constructed by users directly. Instead, you should use TuneDatasetBuilder or the factory method to construct TuneDataset.

[1]:
from fugue_notebook import setup

setup(is_lab=True)

import pandas as pd
from tune import TUNE_OBJECT_FACTORY, TuneDatasetBuilder, Space, Grid
from fugue import FugueWorkflow

TUNE_OBJECT_FACTORY.make_dataset is a wrapper of TuneDatasetBuilder, making the dataset construction even easier. But TuneDatasetBuilder still has the most flexibility. For example, it can add multiple dataframes with different join types while TUNE_OBJECT_FACTORY.make_dataset can add at most two dataframes (nomrally train and validations dataframes).

[2]:
with FugueWorkflow() as dag:
    builder = TuneDatasetBuilder(Space(a=1, b=2))
    dataset = builder.build(dag)
    dataset.data.show();

with FugueWorkflow() as dag:
    dataset = TUNE_OBJECT_FACTORY.make_dataset(dag, Space(a=1, b=2))
    dataset.data.show();
__tune_trials__
0 gASVXwEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln...
schema: __tune_trials__:str
__tune_trials__
0 gASVXwEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln...
schema: __tune_trials__:str

Here are the equivalent ways to construct TuneDataset with space and two dataframes.

In TuneDataset, every dataframe will be partition by certain keys, and each partition will be saved into a temp parquet file. The temp path must be specified. Using the factory, you can call set_temp_path once so you no longer need to provide the temp path explicitly, if you still provide a path, it will be used.

[3]:
pdf1 = pd.DataFrame([[0,1],[1,1],[0,2]], columns = ["a", "b"])
pdf2 = pd.DataFrame([[0,0.5],[2,0.1],[0,0.1],[1,0.3]], columns = ["a", "c"])
space = Space(a=1, b=Grid(1,2,3))

with FugueWorkflow() as dag:
    builder = TuneDatasetBuilder(space, path="/tmp")
    # here we must make pdf1 pdf2 the FugueWorkflowDataFrame, and they
    # both need to be partitioned by the same keys so each partition
    # will be saved to a temp parquet file, and the chunks of data are
    # replaced by file paths before join.
    builder.add_df("df1", dag.df(pdf1).partition_by("a"))
    builder.add_df("df2", dag.df(pdf2).partition_by("a"), how="inner")
    dataset = builder.build(dag)
    dataset.data.show();


TUNE_OBJECT_FACTORY.set_temp_path("/tmp")

with FugueWorkflow() as dag:
    # this method is significantly simpler, as long as you don't have more
    # than 2 dataframes for a tuning task, use this.
    dataset = TUNE_OBJECT_FACTORY.make_dataset(
        dag, space,
        df_name="df1", df=pdf1,
        test_df_name="df2", test_df=pdf2,
        partition_keys=["a"],
    )
    dataset.data.show();
a __tune_df__df1 __tune_df__df2 __tune_trials__
0 0 /tmp/01b823d6-2d65-43be-898d-ed4d5b1ab582.parquet /tmp/5c35d480-6fa8-4776-a0f9-770974b73bb4.parquet gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln...
1 0 /tmp/01b823d6-2d65-43be-898d-ed4d5b1ab582.parquet /tmp/5c35d480-6fa8-4776-a0f9-770974b73bb4.parquet gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln...
2 0 /tmp/01b823d6-2d65-43be-898d-ed4d5b1ab582.parquet /tmp/5c35d480-6fa8-4776-a0f9-770974b73bb4.parquet gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln...
3 1 /tmp/15f2ec83-3494-4ba8-80a5-fa7c558c273c.parquet /tmp/2fe00d9c-b690-49c6-87a5-d365d59066c6.parquet gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln...
4 1 /tmp/15f2ec83-3494-4ba8-80a5-fa7c558c273c.parquet /tmp/2fe00d9c-b690-49c6-87a5-d365d59066c6.parquet gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln...
5 1 /tmp/15f2ec83-3494-4ba8-80a5-fa7c558c273c.parquet /tmp/2fe00d9c-b690-49c6-87a5-d365d59066c6.parquet gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln...
schema: a:long,__tune_df__df1:str,__tune_df__df2:str,__tune_trials__:str
a __tune_df__df1 __tune_df__df2 __tune_trials__
0 0 /tmp/943302c8-2704-4b29-a2ac-64946352a90d.parquet /tmp/9084e1ad-2156-4f3a-be36-52cf55d5c2fb.parquet gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln...
1 0 /tmp/943302c8-2704-4b29-a2ac-64946352a90d.parquet /tmp/9084e1ad-2156-4f3a-be36-52cf55d5c2fb.parquet gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln...
2 0 /tmp/943302c8-2704-4b29-a2ac-64946352a90d.parquet /tmp/9084e1ad-2156-4f3a-be36-52cf55d5c2fb.parquet gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln...
3 1 /tmp/74fa6215-116d-4828-a49c-f58358a9b4e7.parquet /tmp/0aa2aae2-3ab7-46e7-82e2-34a14ded2f0f.parquet gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln...
4 1 /tmp/74fa6215-116d-4828-a49c-f58358a9b4e7.parquet /tmp/0aa2aae2-3ab7-46e7-82e2-34a14ded2f0f.parquet gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln...
5 1 /tmp/74fa6215-116d-4828-a49c-f58358a9b4e7.parquet /tmp/0aa2aae2-3ab7-46e7-82e2-34a14ded2f0f.parquet gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln...
schema: a:long,__tune_df__df1:str,__tune_df__df2:str,__tune_trials__:str

We got 6 rows, because the space will contain 3 configurations. And since for the dataframes, we partitioned by a and inner joined, there will be 2 rows. So in total there are 6 rows in the TuneDataset.

Notice, the number of rows of TuneDataset determines max parallelism. For this case, if you assign 10 workers, 4 will always be idle.

Actually, a more common case is that for each of the dataframe, we don’t partition at all. For TUNE_OBJECT_FACTORY.make_dataset we just need to remove the partition_keys.

[4]:
with FugueWorkflow() as dag:
    dataset = TUNE_OBJECT_FACTORY.make_dataset(
        dag, space,
        df_name="df1", df=pdf1,
        test_df_name="df2", test_df=pdf2,
    )
    dataset.data.show();
__tune_df__df1 __tune_df__df2 __tune_trials__
0 /tmp/a774965e-d0df-417c-84d0-bb693ac337d1.parquet /tmp/2f9a93cd-121b-4697-8fe9-0513aa6bcd82.parquet gASVXwEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln...
1 /tmp/a774965e-d0df-417c-84d0-bb693ac337d1.parquet /tmp/2f9a93cd-121b-4697-8fe9-0513aa6bcd82.parquet gASVXwEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln...
2 /tmp/a774965e-d0df-417c-84d0-bb693ac337d1.parquet /tmp/2f9a93cd-121b-4697-8fe9-0513aa6bcd82.parquet gASVXwEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln...
schema: __tune_df__df1:str,__tune_df__df2:str,__tune_trials__:str

But what if we want to partition on df1 but not on df2? Then again, you can use TuneDatasetBuilder

[5]:
with FugueWorkflow() as dag:
    builder = TuneDatasetBuilder(space, path="/tmp")
    builder.add_df("df1", dag.df(pdf1).partition_by("a"))
    # use cross join because there no common key
    builder.add_df("df2", dag.df(pdf2), how="cross")
    dataset = builder.build(dag)
    dataset.data.show();
a __tune_df__df1 __tune_df__df2 __tune_trials__
0 0 /tmp/4e16f5d7-1dc2-438c-86c7-504502c3e1ad.parquet /tmp/3b92a6f2-31aa-485e-a608-58dcdc925a3c.parquet gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln...
1 0 /tmp/4e16f5d7-1dc2-438c-86c7-504502c3e1ad.parquet /tmp/3b92a6f2-31aa-485e-a608-58dcdc925a3c.parquet gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln...
2 0 /tmp/4e16f5d7-1dc2-438c-86c7-504502c3e1ad.parquet /tmp/3b92a6f2-31aa-485e-a608-58dcdc925a3c.parquet gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln...
3 1 /tmp/058862d5-4c24-437e-ae38-c4810d071a11.parquet /tmp/3b92a6f2-31aa-485e-a608-58dcdc925a3c.parquet gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln...
4 1 /tmp/058862d5-4c24-437e-ae38-c4810d071a11.parquet /tmp/3b92a6f2-31aa-485e-a608-58dcdc925a3c.parquet gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln...
5 1 /tmp/058862d5-4c24-437e-ae38-c4810d071a11.parquet /tmp/3b92a6f2-31aa-485e-a608-58dcdc925a3c.parquet gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln...
schema: a:long,__tune_df__df1:str,__tune_df__df2:str,__tune_trials__:str
[ ]: