Replication
FIrst clone this repo: https://github.com/EduardoVernier/dynamic-projections
Recreating the results / Testing new methods and datasets
Set up virtual env and dependencies using pipenv. https://pipenv.readthedocs.io/en/latest/
pip install pipenv
pipenv run pip install pip==18.0
pipenv install
sudo apt-get install python3-tk
To run a script use pipenv run python <script_name>.py
. To open notebooks use pipenv run jupyter notebook
or create a new shell with pipenv shell
and then call jupyter notebook
.
Generating the projections
Autoencoders — ./Models/ae
The notebooks should contain information about training total time and performance metric (training/test accuracy and loss). The Shared.py
file contains methods that might be useful for all notebooks and projection techniques e.g., saving projection, loading data.
Dynamic/static t-sne — ./Models/tsne
From the root folder, we need to add the tsne
folder to the PYTHONPATH and then run the dtsne_wrapper script.
export PYTHONPATH=${PYTHONPATH}:${PWD}/Models/tsne
python Models/tsne/dtsne_wrapper.py ./Datasets/gaussians 70 0.1
The default options are n_epochs=200, sigma_iters=50
.
For static t-sne with strategies 1 and 4 (of the dt-sne paper):
export PYTHONPATH=${PYTHONPATH}:${PWD}/Models/tsne
python Models/tsne/tsne_s1.py ./Datasets/gaussians 70 # or
python Models/tsne/tsne_s4.py ./Datasets/gaussians 70
Principal component analysis — ./Models/pca
export PYTHONPATH=${PYTHONPATH}:${PWD}/Models/
python Models/pca/pca_s1.py ./Datasets/gaussians # or
python Models/pca/pca_s4.py ./Datasets/gaussians
UMAP — ./Models/pca
export PYTHONPATH=${PYTHONPATH}:${PWD}/Models/
python Models/umap/umap_s1.py ./Datasets/gaussians # or
python Models/umap/umap_s4.py ./Datasets/gaussians
Formatting
Image datasets – The directory hierarchy doesn’t matter, all the metadata should be contained in the file name. <class>-<id>-<time>.png
, e.g. airplane-1234-10.png
– 10th revision of airplane with id 1234.
Tabular datasets – Each timestep is a single csv file named <dataset_name>-<time>.csv
. The first column is the id and the next are the n features. I think this dtsne implementation only handles numerical features, so nothing categorical here for now.
Output data (actual projections) — ./Output
-
Single csv file with information about the model in the name in the format <dataset>-<model_info>.csv
, as in quickdraw-AE_728c_200c_p_d_200f_500f_2f.csv
. The previous string is an hypothetical filename for the results of projection using an AE with two convolutional layers of 728 and 200 kernels each, followed by max pooling and dropout layers and three dense layers of 200, 500 and 2 neurons each. As for the contents of the file, the first column is the id
, and the next are t0d0, t0d1, ... t0dX, t1d0, ..., tTdX.
The number ’t’ is the timestep and ’d’ is the representation dimension of each value.
Visualizing the projections
There is a simple python tool based on matplotlib to quickly show and help us debug the generated projections. To use it, call
python Vis/Main.py ./Output/gaussians-pca_s4.csv ./Output/gaussians-AE_10f_2f_20ep.csv
python Vis/Main.py $(find Output/ -type f -name cartolastd*)
Computing the metrics
The code for the metrics is located in a notebook called template.ipynb
. For each dataset we use a tool called Papermill to instantiate a new notebook from the template. The two parameters that are needed are the output notebook path (remember to change name to dataset_id) and the list of output/projection files we want to analyse. This is the code that generates the analysis for the gaussians dataset:
papermill Metrics/template.ipynb ./Metrics/gaussians.ipynb --log-output -p projection_paths 'Output/gaussians-AE_10f_10f_2f_20ep.csv Output/gaussians-AE_10f_2f_20ep.csv Output/gaussians-tsne_s1_70p.csv Output/gaussians-tsne_s4_70p.csv Output/gaussians-dtsne_70p_0-1l.csv Output/gaussians-pca_s1.csv Output/gaussians-pca_s4.csv'
The results are written in a csv file that goes into the ./Metrics/Results
directory.
To check the (tqdm) progress see the log_<dataset_id>
file in real time.
Generating videos
papermill Plots/trails-video.ipynb Plots/temp.ipynb --log-output -p dataset_id gaussians
Ploting static trail viz
papermill Plots/trails-image.ipynb Plots/temp.ipynb --log-output -p dataset_id gaussians
Plotting the metric results
Simply run the Plots/plots.ipynb
notebook.
Datasets
The notebooks and files that generated the datasets are available at https://drive.google.com/drive/u/1/folders/1MXJK2mqH015pAohuBawVIQeqgB38JAsy.