Introduction#

This guide follows the same format as quickstart but explores further functionality provided by twinLab. In this jupyter notebook we will:

Upload a dataset to twinLab.
List, view and summarise uploaded datasets.
Use Emulator.train to create a surrogate model.
List, view and summarise trained emulators.
Use the model to make a prediction with Emulator.predict.
Visualise the results and their uncertainty.
Verify the model using Emulator.sample.

[1]:

# Standard imports
from pprint import pprint

# Third-party imports
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Project imports
import twinlab as tl


          ====== TwinLab Client Initialisation ======
          Version     : 2.0.0
          Server      : https://twinlab.digilab.co.uk
          Environment : /Users/mead/digiLab/twinLab-Demos/.env

Your twinLab information#

Confirm your twinLab version

[2]:

tl.versions()

[2]:

{'cloud': '2.0.0',
 'modal': '0.2.0',
 'library': '1.3.0',
 'image': 'twinlab-prod'}

And view your user information, including how many credits you have.

[3]:

tl.user_information()

[3]:

{'username': 'alexander', 'credits': 1000}

Upload a dataset#

Datasets must be data presented as a pandas.DataFrame object, or a filepaths which points to a csv file that can be parsed to a pandas.DataFrame object. Both must be formatted with clearly labelled columns. Here, we will label the input (predictor) variable x and the output variable y. In twinlab, data is expected to be in column-feature format, meaning each row represents a single data sample, and each column represents a data feature.

twinLab contains a Dataset class with attirbutes and methods to process, view and summarise the dataset. Datasets must be created with a dataset_id which is used to access them. The dataset can be uploaded using the upload method.

[4]:

x = [
    0.6964691855978616,
    0.28613933495037946,
    0.2268514535642031,
    0.5513147690828912,
    0.7194689697855631,
    0.42310646012446096,
    0.9807641983846155,
    0.6848297385848633,
    0.48093190148436094,
    0.3921175181941505,
]

y = [
    -0.8173739564129022,
    0.8876561174050408,
    0.921552660721474,
    -0.3263338765412979,
    -0.8325176123242133,
    0.4006686354731812,
    -0.16496626502368078,
    -0.9607643657025954,
    0.3401149876855609,
    0.8457949914442409,
]

# Creating the dataframe using the above arrays
df = pd.DataFrame({"x": x, "y": y})

# View the dataset before uploading
display(df)

# Define the name of the dataset
dataset_id = "example_data"

# Intialise a Dataset object
dataset = tl.Dataset(id=dataset_id)

# Upload the dataset
dataset.upload(df, verbose=True)

	x	y
0	0.696469	-0.817374
1	0.286139	0.887656
2	0.226851	0.921553
3	0.551315	-0.326334
4	0.719469	-0.832518
5	0.423106	0.400669
6	0.980764	-0.164966
7	0.684830	-0.960764
8	0.480932	0.340115
9	0.392118	0.845795

Dataframe is uploading.
Processing dataset
Dataset example_data was processed.

View datasets#

Once a dataset has been uploaded it can be easily accessed using built in twinLab functions. A list of all uploaded datasets can be produced, individual datasets can be viewed and summarised. This summary contains some basic statistics of the data.

[5]:

# List all datasets on cloud
tl.list_datasets()

[5]:

['2DActive_Data',
 'Excel-test',
 'Falmouth-Mikey',
 'New_Points',
 'biscuits',
 'eval_data',
 'example_data',
 'functional-data',
 'functional-test-data',
 'fusion',
 'sampled-data',
 'twinLab-logo']

[6]:

# View the dataset
dataset.view()

[6]:

	x	y
0	0.696469	-0.817374
1	0.286139	0.887656
2	0.226851	0.921553
3	0.551315	-0.326334
4	0.719469	-0.832518
5	0.423106	0.400669
6	0.980764	-0.164966
7	0.684830	-0.960764
8	0.480932	0.340115
9	0.392118	0.845795

[7]:

# Get a statistical summary of the dataset
dataset.summarise()

[7]:

	x	y
count	10.000000	10.000000
mean	0.544199	0.029383
std	0.229352	0.748191
min	0.226851	-0.960764
25%	0.399865	-0.694614
50%	0.516123	0.087574
75%	0.693559	0.734513
max	0.980764	0.921553

Train an emulator#

The Emulator class is used to train and implement your surrogate models. As with datasets, an id is defined, this is what the model will be saved as in the cloud. When training a model the arguments are passed using a TrainParams object; TrainParams is a class that contains all the necessary parameters needed to train your model. To train the model we use the Emulator.train function, inputting the TrainParams object as an argument to this function.

[8]:

# Initialise emulator
emulator_id = "example_emulator"

emulator = tl.Emulator(id=emulator_id)

# Define the training parameters for your emulator
params = tl.TrainParams(train_test_ratio=1.0)

# Train the mulator using the train method
emulator.train(dataset=dataset, inputs=["x"], outputs=["y"], params=params, verbose=True)

Model example_emulator has begun training.
Training complete!

View emulators#

Just as with datasets all saved emulators can be listed, viewed and summarised.

[9]:

# List emulators
tl.list_emulators()

[9]:

['2DActiveGP',
 'Example_emulator',
 'Excel-emulator',
 'Excel-model',
 'Hello',
 'backward-model',
 'biscuits',
 'campaign',
 'decoder',
 'example_emulator',
 'fusion',
 'gardening',
 'my_emulator',
 'new-campaign',
 'twinLab-logo',
 'universe']

[10]:

# View an emulator's parameters
emulator.view()

[10]:

{'model_id': 'example_emulator',
 'fidelity': None,
 'estimator': 'gaussian_process_regression',
 'estimator_kwargs': {'detrend': False,
  'device': 'cpu',
  'covar_module': None,
  'estimator_type': None},
 'decompose_input': False,
 'input_explained_variance': 0.99,
 'decompose_output': False,
 'output_explained_variance': 0.99,
 'train_test_ratio': 1.0,
 'model_selection': False,
 'model_selection_kwargs': {'seed': None,
  'evaluation_metric': 'MSLL',
  'val_ratio': 0.2,
  'base_kernels': 'restricted',
  'depth': 1,
  'beam': None,
  'resource_per_trial': {'cpu': 1, 'gpu': 0}},
 'seed': None,
 'inputs': ['x'],
 'outputs': ['y'],
 'dataset_id': 'example_data',
 'modal_handle': 'fc-ktRp3Nc749lshsKZvgJH4y'}

[11]:

# View the status of a campaign
pprint(emulator.summarise())

{'model_summary': {'data_diagnostics': {'inputs': {'x': {'25%': 0.39986475367672814,
                                                         '50%': 0.5161233352836261,
                                                         '75%': 0.693559323844612,
                                                         'count': 10.0,
                                                         'max': 0.9807641983846156,
                                                         'mean': 0.544199352975335,
                                                         'min': 0.2268514535642031,
                                                         'std': 0.22935216613691597}},
                                        'outputs': {'y': {'25%': -0.6946139364450011,
                                                          '50%': 0.0875743613309401,
                                                          '75%': 0.7345134024514759,
                                                          'count': 10.0,
                                                          'max': 0.921552660721474,
                                                          'mean': 0.029383131672480845,
                                                          'min': -0.9607643657025954,
                                                          'std': 0.7481906564998719}}},
                   'estimator_diagnostics': {'covar_module': 'ScaleKernel(\n'
                                                             '  (base_kernel): '
                                                             'MaternKernel(\n'
                                                             '    '
                                                             '(lengthscale_prior): '
                                                             'GammaPrior()\n'
                                                             '    '
                                                             '(raw_lengthscale_constraint): '
                                                             'Positive()\n'
                                                             '  )\n'
                                                             '  '
                                                             '(outputscale_prior): '
                                                             'GammaPrior()\n'
                                                             '  '
                                                             '(raw_outputscale_constraint): '
                                                             'Positive()\n'
                                                             ')',
                                             'covar_module.base_kernel.lengthscale_prior.concentration': 3.0,
                                             'covar_module.base_kernel.lengthscale_prior.rate': 6.0,
                                             'covar_module.base_kernel.original_lengthscale': [[0.4232063885665337]],
                                             'covar_module.base_kernel.raw_lengthscale': [[-0.6408405631160488]],
                                             'covar_module.base_kernel.raw_lengthscale_constraint.lower_bound': 0.0,
                                             'covar_module.base_kernel.raw_lengthscale_constraint.upper_bound': inf,
                                             'covar_module.original_outputscale': 1.7130960752713094,
                                             'covar_module.outputscale_prior.concentration': 2.0,
                                             'covar_module.outputscale_prior.rate': 0.15000000596046448,
                                             'covar_module.raw_outputscale': 1.514271061131159,
                                             'covar_module.raw_outputscale_constraint.lower_bound': 0.0,
                                             'covar_module.raw_outputscale_constraint.upper_bound': inf,
                                             'input_transform._coefficient': [[0.7539127448204125]],
                                             'input_transform._offset': [[0.2268514535642031]],
                                             'likelihood.noise_covar.noise_prior.concentration': 1.100000023841858,
                                             'likelihood.noise_covar.noise_prior.rate': 0.05000000074505806,
                                             'likelihood.noise_covar.original_noise': [0.031576525703137535],
                                             'likelihood.noise_covar.raw_noise': [0.031576525703137535],
                                             'likelihood.noise_covar.raw_noise_constraint.lower_bound': 9.999999747378752e-05,
                                             'likelihood.noise_covar.raw_noise_constraint.upper_bound': inf,
                                             'mean_module': 'ConstantMean()',
                                             'mean_module.original_constant': 0.21052500685316855,
                                             'mean_module.raw_constant': 0.21052500685316855,
                                             'outcome_transform._stdvs_sq': [[0.5597892584737093]],
                                             'outcome_transform.means': [[0.029383131672480856]],
                                             'outcome_transform.stdvs': [[0.7481906564998719]]},
                   'transformer_diagnostics': []}}

Prediction using the trained emulators#

The surrogate model is now trained and saved to the cloud under the emulator_id. It can now be used to make predictions. First define a dataset of inputs for which you want to find outputs; ensure that this is a pandas.DataFrame object. Then call Emulator.predict with the keyword arguments being the evaluation dataset.

[12]:

# Define the inputs for the dataset
x_eval = np.linspace(0, 1, 128)

# Convert to a dataframe
df_eval = pd.DataFrame({"x": x_eval})
display(df_eval)

# Predict the results
predictions = emulator.predict(df_eval)
result_df = pd.concat([predictions[0], predictions[1]], axis=1)
df_mean, df_stdev = result_df.iloc[:,0], result_df.iloc[:,1]
df_mean, df_stdev = df_mean.values, df_stdev.values
print(result_df.head())

	x
0	0.000000
1	0.007874
2	0.015748
3	0.023622
4	0.031496
...	...
123	0.968504
124	0.976378
125	0.984252
126	0.992126
127	1.000000

128 rows × 1 columns

          y         y
0  0.617689  0.656265
1  0.629105  0.640576
2  0.640630  0.624421
3  0.652252  0.607809
4  0.663957  0.590755

Viewing the results#

Emulator.predict outputs mean values for each input and their standard deviation; this gives the abilty to nicely visualise the uncertainty in results.

[13]:

# Plot parameters
nsigs = [1, 2]
color = "blue"
alpha = 0.5
plot_training_data = True
plot_model_mean = True
plot_model_bands = True

# Plot results
grid = df_eval["x"]
mean = df_mean
err = df_stdev
if plot_model_bands:
    label = r"Model prediction"
    plt.fill_between(grid, np.nan, np.nan, lw=0, color=color, alpha=alpha, label=label)
    for isig, nsig in enumerate(nsigs):
        plt.fill_between(
            grid,
            mean - nsig * err,
            mean + nsig * err,
            lw=0,
            color=color,
            alpha=alpha / (isig + 1),
        )
if plot_model_mean:
    label = r"Model prediction" if not plot_model_bands else None
    plt.plot(grid, mean, color=color, alpha=alpha, label=label)
if plot_training_data:
    plt.plot(df["x"], df["y"], ".", color="black", label="Training data")
plt.xlim((0.0, 1.0))
plt.xlabel(r"$X$")
plt.ylabel(r"$y$")
plt.legend()
plt.show()

../../../_images/python_examples_notebooks_01-introduction_21_0.png

Sampling from an emulator#

The Emulator.sample function can be used to retrieve a number of results from your model. It requires the inputs for which you want the values and how many outputs to calculate for each.

[14]:

# Define the sample inputs
sample_inputs = pd.DataFrame({"x": np.linspace(0, 1, 128)})

# Define number of samples to calculate for each input
num_samples = 100

# Calculate the samples using twinLab
sample_result = emulator.sample(sample_inputs, num_samples)

# View the results in the form of a dataframe
display(sample_result)

	y
	0	1	2	3	4	5	6	7	8	9	...	90	91	92	93	94	95	96	97	98	99
0	1.264127	1.259498	0.664856	1.290327	-1.061115	0.063356	1.326666	1.789418	-0.009024	1.259869	...	0.939294	0.362207	0.011521	1.053759	-0.409518	1.014478	0.751929	1.641282	0.826082	0.469116
1	1.263028	1.274143	0.643489	1.300232	-0.994251	0.108671	1.342242	1.764897	0.008574	1.246439	...	0.978143	0.343955	0.050809	1.024005	-0.418823	1.016179	0.779650	1.598208	0.826613	0.468406
2	1.260150	1.283785	0.621806	1.308611	-0.921023	0.157252	1.354430	1.740139	0.031373	1.232385	...	1.014755	0.328073	0.091420	0.993893	-0.423898	1.014795	0.809190	1.552288	0.830950	0.468252
3	1.256019	1.290381	0.602094	1.315111	-0.842858	0.207179	1.363444	1.714233	0.056631	1.218001	...	1.048197	0.315876	0.130009	0.965887	-0.424134	1.010628	0.840416	1.503579	0.839256	0.469101
4	1.249843	1.294567	0.584412	1.319314	-0.762038	0.258642	1.367987	1.686284	0.081911	1.204734	...	1.077572	0.307563	0.166610	0.942370	-0.419536	1.003131	0.871847	1.452200	0.851765	0.470634
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
123	-0.243466	-0.376759	-0.222920	-0.164613	-0.123029	-0.107526	-0.491095	-0.200305	-0.075683	-0.080205	...	-0.130376	-0.088158	0.005756	-0.193393	-0.067758	-0.131498	-0.122326	-0.211535	-0.343941	-0.117467
124	-0.237385	-0.361383	-0.176733	-0.150867	-0.107936	-0.042114	-0.411948	-0.214185	-0.082167	-0.053210	...	-0.099505	-0.143001	-0.020947	-0.184918	-0.064551	-0.152278	-0.089314	-0.181028	-0.336784	-0.087687
125	-0.229653	-0.344955	-0.129858	-0.138228	-0.093384	0.024250	-0.334807	-0.228415	-0.090190	-0.028539	...	-0.063867	-0.199541	-0.047168	-0.171532	-0.062122	-0.173057	-0.057060	-0.150006	-0.333920	-0.053370
126	-0.220961	-0.326984	-0.082629	-0.126160	-0.080773	0.088925	-0.260510	-0.242954	-0.099340	-0.005383	...	-0.024045	-0.256805	-0.072422	-0.153905	-0.059526	-0.193384	-0.026572	-0.118515	-0.335199	-0.014476
127	-0.211589	-0.307667	-0.034203	-0.113222	-0.070149	0.152874	-0.191118	-0.259736	-0.108993	0.016120	...	0.019210	-0.314823	-0.096437	-0.133574	-0.056565	-0.213358	0.001427	-0.087783	-0.339115	0.027192

128 rows × 100 columns

Viewing the results#

The results can be plotted over the top of the previous graph giving a nice visualisation of the sampled data, with the model’s uncertainity.

[15]:

# Plot parameters
color_curve = "blue"
alpha_curve = 0.10
color_data = "black"
plot_training_data = True
plot_model_bands = False

# Plot samples drawn from the model
if plot_training_data:
    plt.plot(df["x"], df["y"], ".", color=color_data, label="Training data")
plt.plot(sample_inputs, sample_result["y"], color=color_curve, alpha=alpha_curve)
plt.xlim((0.0, 1.0))
plt.xlabel(r"$X$")
plt.ylabel(r"$y$")
plt.legend()
plt.show()

../../../_images/python_examples_notebooks_01-introduction_25_0.png

Deleting datasets and emulators#

To keep your cloud storage tidy you should delete your datasets and emulators when you are finished with them. Emulator.delete and Dataset.delete deletes the emulators and the datasets from the cloud storage respectively.

[16]:

# Delete dataset
dataset.delete()

# Delete emulator
emulator.delete()