molecular_simulations.analysis.autocluster module
- class molecular_simulations.analysis.autocluster.AutoKMeans(data_directory, pattern='', dataloader=<class 'molecular_simulations.analysis.autocluster.GenericDataloader'>, max_clusters=10, stride=1, reduction_algorithm='PCA', reduction_kws={'n_components': 2})
Bases:
objectPerforms automatic clustering using KMeans++ including dimensionality reduction of the feature space.
- Parameters:
data_directory (PathLike) – Directory where data files can be found.
pattern (str) – Optional filename pattern to select out subset of npy files using glob.
dataloader (Type[_T]) – Defaults to GenericDataLoader. Which dataloader to use.
max_clusters (int) – Defaults to 10. The maximum number of clusters to test during parameter sweep.
stride (int) – Defaults to 1. Linear stride of number of clusters during parameter sweep. Aids on not testing too many values if number of clusters is high.
reduction_algorithm (str) – Defaults to PCA. Which dimensionality reduction algorithm to use. Currently only PCA is supported.
reduction_kws (dict[str, Any]) – Defaults to {‘n_components’: 2} for PCA. kwargs for supplied reduction_algorithm.
- map_centers_to_frames()
Finds and stores the data point which lies closest to the cluster center for each cluster.
- Return type:
- Returns:
None
- reduce_dimensionality()
Performs dimensionality reduction using decomposer of choice.
- Return type:
- Returns:
None
- save_labels()
Generates a polars dataframe containing system, frame and cluster label assignments and saves to a parquet file.
- Return type:
- Returns:
None
- class molecular_simulations.analysis.autocluster.Decomposition(algorithm, **kwargs)
Bases:
objectThin wrapper for various dimensionality reduction algorithms. Uses scikit-learn style methods like fit and fit_transform.
- Parameters:
algorithm (str) – Which algorithm to use from PCA, TICA and UMAP. Currently only PCA is supported.
kwargs – algorithm-specific kwargs to inject into the decomposer.
- fit(X)
Fits the decomposer with data.
- Parameters:
X (np.ndarray) – Array of input data.
- Return type:
- Returns:
None
- fit_transform(X)
Fits the decomposer with data and returns the reduced dimension data.
- Parameters:
X (np.ndarray) – Array of input data.
- Returns:
Reduced dimension data.
- Return type:
(np.ndarray)
- transform(X)
Returns the reduced dimension data from a decomposer which has already been fit.
- Parameters:
X (np.ndarray) – Array of input data.
- Returns:
Reduced dimension data.
- Return type:
(np.ndarray)
- class molecular_simulations.analysis.autocluster.GenericDataloader(data_files)
Bases:
objectLoads any generic data stored in numpy arrays and stores the full dataset. Capable of loading data with variable row lengths but must be consistent in the columnar dimension.
- Parameters:
data_files (list[PathLike]) – List of paths to input data files.
- class molecular_simulations.analysis.autocluster.PeriodicDataloader(data_files)
Bases:
GenericDataloaderDecomposes periodic data using sin and cos, returning double the features.
- remove_periodicity(arr)
Removes periodicity from each feature using sin and cos. Each column is expanded into two such that the indices become i -> 2*i, 2*i + 1.
- Parameters:
arr (np.ndarray) – Data to perform decomposition on.
- Returns:
New array which should be shape (arr.shape[0], arr.shape[1] * 2).
- Return type:
(np.ndarray)