Getting Started
Installation
To install the CCA/PLS Toolkit, clone the repository from Github using the following command:
git clone https://github.com/mlnl/cca_pls_toolkit
After the toolkit is downloaded, go to the folder containing the toolkit and open MATLAB. In general, we advise you to run all commands from this toolkit folder.
To initialize the toolkit, run the following line in the MATLAB command window:
set_path;
Dependencies
The neuroimaging community has great visualization and other tools, therefore we decided to use some of these available tools for specific purposes in the toolkit. Depending on the analysis and plots you would like to make you will need to download some of the toolboxes below. We recommend two ways of adding toolboxes to your path:
- you can just add the toolboxes to the path in their current location if you already have them on your computer,
- you can add the toolboxes in a dedicated folder (called
external
) inside the toolkit.
For easier management of the dependencies, all toolboxes are stored in a dedicated folder within the toolkit. To create this folder, run the following line in the MATLAB command window:
mkdir external
Importantly, this external
folder (and its content) is not added to .gitignore
and thus it is not version controlled by git. On the one side, this is to accomodate the specific needs of users and only to use toolboxes that are essential for their specific analyses and plots. On the other side, this is to avoid that the toolkit gets unnecessarily large due to potentially large external toolboxes.
Here is a complete list of toolboxes that you might need for using the toolkit:
- PALM (Permutation Analysis of Linear Models, used for permutation testing)
- SPM (Statistical Parametric Mapping, used for opening files and reading/writing
.nii
files) - BrainNet Viewer (used for brain data visualization)
- AAL (Automated Anatomical Labelling, used for region-of-interest (ROI) analyses)
To know which dependencies you will need for your specific analysis, please see the Analysis and Visualization pages.
Below, an example is provided to illustrate how to add toolboxes to the toolkit using PALM. For this, download PALM manually by the provided link, copy the PALM
folder into your external
folder, then finally add PALM to the MATLAB path using the following line in the MATLAB command window:
set_path('PALM');
Toolbox overview
The toolkit consists of two main parts:
- Analysis
- Visualization
The reason behind this division is that whereas an analysis can be run on a cluster without a need for a graphical output, the visualization usually takes place on a local computer with a need for a graphical output. Of course, if both the analysis and visualization are done on a local computer, the two can be easily combined as demonstrated in the examples.
Analysis
The figure below illustrates the inputs and outputs of each analysis.
The main inputs of the analysis are:
cfg
structure, which is a MATLAB variable created by the user for the configuration of all the analysis settings,X.mat
andY.mat
files including the two modalities of input data (i.e., \(\mathbf{X}\) and \(\mathbf{Y}\) matrices).
Other input files can be also provided, for instance, C.mat
file including the confounding variables of the analysis and ÈB.mat
file including the (exchangeability) block structure of the data.
For details on the cfg
structure and the input data files, see Analysis.
The main outputs of the analysis are:
results_table.txt
including the summary of the results,- various
.mat
files including, for instance, information about the data splits, the results of hyperparameter optimization and the trained models (for details, see below).
Next, we discuss the folder structure of your analysis and list the specific folders where the main input and output files are stored.
As illustrated in the figure above, a project consists of a project folder with two subfolders:
- a
data
folder including the input data files, - a
framework
folder including the results of the analyses with each analysis in a specific framework folder.
We need to create the project and data
folders manually and place our input data files within the data
folder (illustrated by a red box in the figure). To generate simulated data, see the generate_data function.
All the other folders and output files will be created by the toolkit during analysis. Our specific framework folder will be generated by the toolkit based on your CCA/PLS model and framework choice. For instance, an SPLS analysis with a single holdout set (20% of the data) and 10 validation sets (20% of the optimization set) will generate the spls_holdout1-0.20_subsamp10-0.20
folder. If you want to specify a custom name for this analysis, you can change cfg.frwork.flag
from its default empty value, which will then append a flag to your specific framwork name. For instance, cfg.frwork.flag = '_TEST'
will create the spls_holdout1-0.20_subsamp10-0.20_TEST
folder.
Each analysis will contain the following output files in the specific framework folder:
cfg*.mat
file including thecfg
structure you created and filled up with other necessary default bycfg_defaults
,outmat*.mat
file including the training and test indexes of the outer splits of the data (i.e., optimization and holdout sets),inmat*.mat
file including the training and test indexes of the inner splits of the data (i.e., inner training and validation sets).
The other output files are stored in specific folders with each folder having one or multiple levels of results, where each level stands for an associative effect (e.g., first associative effect in folder level1
, second associative effect in folder level2
):
grid
folder including the results of the hyperparameter optimization ingrid*.mat
files,load
folder with apreproc
folder including the preprocessed (e.g., z-scored) data inpreproc*.mat
files and ansvd
folder including the Singular Value Decomposition (SVD) of the preprocessed data in*svd*.mat
files (SVD is needed for the computational efficiency of the toolkit),perm
folder including the results of the permutation testing inperm*.mat
files,res
folder including the best (or fixed) hyperparameters inparam*.mat
file, the results of the main trained models inmodel*.mat
file, additional results inres*.mat
file and the summary of the results inresults_table.txt
file.
For additional details on the output data files, see Analysis.
Visualization
The figure below illustrates the inputs and outputs of visualizing the results.
The main inputs of the visualization are:
res
structure, which is a MATLAB variable loaded fromres*.mat
and appended with settings for visualization,.mat
files either as outputs of the analysis or other files including data (e.g.,mask.mat
including a mask for connectivity data) or other settings for visualization (e.g.,options.mat
including BrainNet Viewer settings),.csv
files, which are label files including information about the variables in your \(\mathbf{X}\) and \(\mathbf{Y}\) matrices,.nv
file, which is a surface mesh file in BrainNet Viewer used as a template to overlay your brain weights on,.nii
files, which can be an atlas file defining regions of interest (ROI) in the brain or a mask file for voxel-wise structural MRI data.
The main outputs of the visualization are:
- images of figures, in any requested standard file format, e.g.,
.fig
,.png
,.svg
, .csv
files including information about the plotted results (e.g., ordered model weights).
For additional details on the res
structure and the other input and output data files of visualization, see Visualization.
Info
We also highly recommend that you go through the defaults of cfg and res so that you understand thoroughly the analysis and visualization settings.
You can find a detailed documentation of each high-level function of the toolkit under the main menu Functions, for instance, see cfg_defaults.
Finally, if you want to get started and run an experiment on your local computer, see Demo for a complete analysis and visualization. In addition, we provide three simple examples to generate and analyse simulated data, simulated structural MRI data, and simulated fMRI data.
Next, we briefly discuss how to run an analysis on a cluster.
Running on a cluster
The toolkit can be run in multiple MATLAB instances, e.g., in a cluster environment. If you use SGE or SLURM scheduling systems, you simply need to send the same analysis to different nodes and the toolkit will take care of the rest. If you use a different scheduling system then a one-line modification of the cfg_defaults
function is needed to account for your scheduling system and you are ready to go. Feel free to get in touch with us to help you set this up.
Here is a brief description of what happens under the hood when the toolkit is running on different MATLAB instances. Although MATLAB can load the same .mat
file from different MATLAB instances, it cannot save to the same .mat
file . To work around this, the toolkit appends the unique ID of the computing node/job at the end of each .mat
file (in a local environment, the ID is set to _1
by default), so even if different jobs are saving the same content simultaneously, they will do it into different files. In addition, there are a handful of wrapper functions to save, search and load these .mat
files and the following mechanisms are in place to share computational cost:
- jobs save computations to
.mat
files regularly on disc, - there is a random seed before each time consuming operation in the toolkit (e.g., grid search and permutation testing), hence different jobs will most likely perform different computations,
- jobs can load
.mat
files computed by other jobs, - jobs regularly check what
.mat
files are available and they only start a new computation if that has not yet been performed by another job, - if two MATLAB instances are saving the same computation to a
.mat
file then they will write to different files due to their different job ID-s.
You might ask: doesn't this computational strategy create a lot of intermediate and some duplicate files? Indeed, that is the case, however, there is a cleanup_files
function that allows to remove all the intermediate and duplicate files after your analysis is done. Do not forget to use this as otherwise you might exceed your disc space or have difficulties to move your results around, e.g., from a cluster to a local computer.
An extra benefit of this computational strategy is that in case your analysis is aborted (e.g., you run out of allocated time on a cluster), you can always restart the same analysis and it will catch up with the computations where it was aborted.