Tune Dataset
TuneDataset
contains searching space and all related dataframes with metadata for a tuning task.
TuneDataset
should not to be constructed by users directly. Instead, you should use TuneDatasetBuilder
or the factory method to construct TuneDataset
.
[1]:
from fugue_notebook import setup
setup(is_lab=True)
import pandas as pd
from tune import TUNE_OBJECT_FACTORY, TuneDatasetBuilder, Space, Grid
from fugue import FugueWorkflow
TUNE_OBJECT_FACTORY.make_dataset
is a wrapper of TuneDatasetBuilder
, making the dataset construction even easier. But TuneDatasetBuilder
still has the most flexibility. For example, it can add multiple dataframes with different join types while TUNE_OBJECT_FACTORY.make_dataset
can add at most two dataframes (nomrally train and validations dataframes).
[2]:
with FugueWorkflow() as dag:
builder = TuneDatasetBuilder(Space(a=1, b=2))
dataset = builder.build(dag)
dataset.data.show();
with FugueWorkflow() as dag:
dataset = TUNE_OBJECT_FACTORY.make_dataset(dag, Space(a=1, b=2))
dataset.data.show();
__tune_trials__ | |
---|---|
0 | gASVXwEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln... |
__tune_trials__ | |
---|---|
0 | gASVXwEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln... |
Here are the equivalent ways to construct TuneDataset
with space and two dataframes.
In TuneDataset
, every dataframe will be partition by certain keys, and each partition will be saved into a temp parquet file. The temp path must be specified. Using the factory, you can call set_temp_path
once so you no longer need to provide the temp path explicitly, if you still provide a path, it will be used.
[3]:
pdf1 = pd.DataFrame([[0,1],[1,1],[0,2]], columns = ["a", "b"])
pdf2 = pd.DataFrame([[0,0.5],[2,0.1],[0,0.1],[1,0.3]], columns = ["a", "c"])
space = Space(a=1, b=Grid(1,2,3))
with FugueWorkflow() as dag:
builder = TuneDatasetBuilder(space, path="/tmp")
# here we must make pdf1 pdf2 the FugueWorkflowDataFrame, and they
# both need to be partitioned by the same keys so each partition
# will be saved to a temp parquet file, and the chunks of data are
# replaced by file paths before join.
builder.add_df("df1", dag.df(pdf1).partition_by("a"))
builder.add_df("df2", dag.df(pdf2).partition_by("a"), how="inner")
dataset = builder.build(dag)
dataset.data.show();
TUNE_OBJECT_FACTORY.set_temp_path("/tmp")
with FugueWorkflow() as dag:
# this method is significantly simpler, as long as you don't have more
# than 2 dataframes for a tuning task, use this.
dataset = TUNE_OBJECT_FACTORY.make_dataset(
dag, space,
df_name="df1", df=pdf1,
test_df_name="df2", test_df=pdf2,
partition_keys=["a"],
)
dataset.data.show();
a | __tune_df__df1 | __tune_df__df2 | __tune_trials__ | |
---|---|---|---|---|
0 | 0 | /tmp/01b823d6-2d65-43be-898d-ed4d5b1ab582.parquet | /tmp/5c35d480-6fa8-4776-a0f9-770974b73bb4.parquet | gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln... |
1 | 0 | /tmp/01b823d6-2d65-43be-898d-ed4d5b1ab582.parquet | /tmp/5c35d480-6fa8-4776-a0f9-770974b73bb4.parquet | gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln... |
2 | 0 | /tmp/01b823d6-2d65-43be-898d-ed4d5b1ab582.parquet | /tmp/5c35d480-6fa8-4776-a0f9-770974b73bb4.parquet | gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln... |
3 | 1 | /tmp/15f2ec83-3494-4ba8-80a5-fa7c558c273c.parquet | /tmp/2fe00d9c-b690-49c6-87a5-d365d59066c6.parquet | gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln... |
4 | 1 | /tmp/15f2ec83-3494-4ba8-80a5-fa7c558c273c.parquet | /tmp/2fe00d9c-b690-49c6-87a5-d365d59066c6.parquet | gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln... |
5 | 1 | /tmp/15f2ec83-3494-4ba8-80a5-fa7c558c273c.parquet | /tmp/2fe00d9c-b690-49c6-87a5-d365d59066c6.parquet | gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln... |
a | __tune_df__df1 | __tune_df__df2 | __tune_trials__ | |
---|---|---|---|---|
0 | 0 | /tmp/943302c8-2704-4b29-a2ac-64946352a90d.parquet | /tmp/9084e1ad-2156-4f3a-be36-52cf55d5c2fb.parquet | gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln... |
1 | 0 | /tmp/943302c8-2704-4b29-a2ac-64946352a90d.parquet | /tmp/9084e1ad-2156-4f3a-be36-52cf55d5c2fb.parquet | gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln... |
2 | 0 | /tmp/943302c8-2704-4b29-a2ac-64946352a90d.parquet | /tmp/9084e1ad-2156-4f3a-be36-52cf55d5c2fb.parquet | gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln... |
3 | 1 | /tmp/74fa6215-116d-4828-a49c-f58358a9b4e7.parquet | /tmp/0aa2aae2-3ab7-46e7-82e2-34a14ded2f0f.parquet | gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln... |
4 | 1 | /tmp/74fa6215-116d-4828-a49c-f58358a9b4e7.parquet | /tmp/0aa2aae2-3ab7-46e7-82e2-34a14ded2f0f.parquet | gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln... |
5 | 1 | /tmp/74fa6215-116d-4828-a49c-f58358a9b4e7.parquet | /tmp/0aa2aae2-3ab7-46e7-82e2-34a14ded2f0f.parquet | gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln... |
We got 6 rows, because the space will contain 3 configurations. And since for the dataframes, we partitioned by a
and inner joined, there will be 2 rows. So in total there are 6 rows in the TuneDataset
.
Notice, the number of rows of TuneDataset determines max parallelism. For this case, if you assign 10 workers, 4 will always be idle.
Actually, a more common case is that for each of the dataframe, we don’t partition at all. For TUNE_OBJECT_FACTORY.make_dataset
we just need to remove the partition_keys
.
[4]:
with FugueWorkflow() as dag:
dataset = TUNE_OBJECT_FACTORY.make_dataset(
dag, space,
df_name="df1", df=pdf1,
test_df_name="df2", test_df=pdf2,
)
dataset.data.show();
__tune_df__df1 | __tune_df__df2 | __tune_trials__ | |
---|---|---|---|
0 | /tmp/a774965e-d0df-417c-84d0-bb693ac337d1.parquet | /tmp/2f9a93cd-121b-4697-8fe9-0513aa6bcd82.parquet | gASVXwEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln... |
1 | /tmp/a774965e-d0df-417c-84d0-bb693ac337d1.parquet | /tmp/2f9a93cd-121b-4697-8fe9-0513aa6bcd82.parquet | gASVXwEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln... |
2 | /tmp/a774965e-d0df-417c-84d0-bb693ac337d1.parquet | /tmp/2f9a93cd-121b-4697-8fe9-0513aa6bcd82.parquet | gASVXwEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln... |
But what if we want to partition on df1
but not on df2
? Then again, you can use TuneDatasetBuilder
[5]:
with FugueWorkflow() as dag:
builder = TuneDatasetBuilder(space, path="/tmp")
builder.add_df("df1", dag.df(pdf1).partition_by("a"))
# use cross join because there no common key
builder.add_df("df2", dag.df(pdf2), how="cross")
dataset = builder.build(dag)
dataset.data.show();
a | __tune_df__df1 | __tune_df__df2 | __tune_trials__ | |
---|---|---|---|---|
0 | 0 | /tmp/4e16f5d7-1dc2-438c-86c7-504502c3e1ad.parquet | /tmp/3b92a6f2-31aa-485e-a608-58dcdc925a3c.parquet | gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln... |
1 | 0 | /tmp/4e16f5d7-1dc2-438c-86c7-504502c3e1ad.parquet | /tmp/3b92a6f2-31aa-485e-a608-58dcdc925a3c.parquet | gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln... |
2 | 0 | /tmp/4e16f5d7-1dc2-438c-86c7-504502c3e1ad.parquet | /tmp/3b92a6f2-31aa-485e-a608-58dcdc925a3c.parquet | gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln... |
3 | 1 | /tmp/058862d5-4c24-437e-ae38-c4810d071a11.parquet | /tmp/3b92a6f2-31aa-485e-a608-58dcdc925a3c.parquet | gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln... |
4 | 1 | /tmp/058862d5-4c24-437e-ae38-c4810d071a11.parquet | /tmp/3b92a6f2-31aa-485e-a608-58dcdc925a3c.parquet | gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln... |
5 | 1 | /tmp/058862d5-4c24-437e-ae38-c4810d071a11.parquet | /tmp/3b92a6f2-31aa-485e-a608-58dcdc925a3c.parquet | gASVYgEAAAAAAABdlIwYdHVuZS5jb25jZXB0cy5mbG93Ln... |
[ ]: