molecular_simulations.analysis.autocluster module

class molecular_simulations.analysis.autocluster.AutoKMeans(data_directory, pattern='', dataloader=<class 'molecular_simulations.analysis.autocluster.GenericDataloader'>, max_clusters=10, stride=1, reduction_algorithm='PCA', reduction_kws={'n_components': 2})

Bases: object

Performs automatic clustering using KMeans++ including dimensionality reduction of the feature space.

Parameters:
  • data_directory (PathLike) – Directory where data files can be found.

  • pattern (str) – Optional filename pattern to select out subset of npy files using glob.

  • dataloader (Type[_T]) – Defaults to GenericDataLoader. Which dataloader to use.

  • max_clusters (int) – Defaults to 10. The maximum number of clusters to test during parameter sweep.

  • stride (int) – Defaults to 1. Linear stride of number of clusters during parameter sweep. Aids on not testing too many values if number of clusters is high.

  • reduction_algorithm (str) – Defaults to PCA. Which dimensionality reduction algorithm to use. Currently only PCA is supported.

  • reduction_kws (dict[str, Any]) – Defaults to {‘n_components’: 2} for PCA. kwargs for supplied reduction_algorithm.

map_centers_to_frames()

Finds and stores the data point which lies closest to the cluster center for each cluster.

Return type:

None

Returns:

None

reduce_dimensionality()

Performs dimensionality reduction using decomposer of choice.

Return type:

None

Returns:

None

run()

Runs the automated clustering workflow.

Return type:

None

Returns:

None

save_centers()

Saves out cluster centers as a json file.

Return type:

None

Returns:

None

save_labels()

Generates a polars dataframe containing system, frame and cluster label assignments and saves to a parquet file.

Return type:

None

Returns:

None

sweep_n_clusters(n_clusters)

Uses silhouette score to perform a parameter sweep over number of clusters. Stores the cluster centers for the best performing parameterization.

Parameters:

n_clusters (list[int]) – List of number of clusters to test.

Return type:

None

Returns:

None

class molecular_simulations.analysis.autocluster.Decomposition(algorithm, **kwargs)

Bases: object

Thin wrapper for various dimensionality reduction algorithms. Uses scikit-learn style methods like fit and fit_transform.

Parameters:
  • algorithm (str) – Which algorithm to use from PCA, TICA and UMAP. Currently only PCA is supported.

  • kwargs – algorithm-specific kwargs to inject into the decomposer.

fit(X)

Fits the decomposer with data.

Parameters:

X (np.ndarray) – Array of input data.

Return type:

None

Returns:

None

fit_transform(X)

Fits the decomposer with data and returns the reduced dimension data.

Parameters:

X (np.ndarray) – Array of input data.

Returns:

Reduced dimension data.

Return type:

(np.ndarray)

transform(X)

Returns the reduced dimension data from a decomposer which has already been fit.

Parameters:

X (np.ndarray) – Array of input data.

Returns:

Reduced dimension data.

Return type:

(np.ndarray)

class molecular_simulations.analysis.autocluster.GenericDataloader(data_files)

Bases: object

Loads any generic data stored in numpy arrays and stores the full dataset. Capable of loading data with variable row lengths but must be consistent in the columnar dimension.

Parameters:

data_files (list[PathLike]) – List of paths to input data files.

property data: ndarray

Returns: (np.ndarray): Internal data array.

load_data()

Lumps data into one large array.

Return type:

None

Returns:

None

property shape: tuple[int]

Returns: (tuple[int]): Shape of each individual data file, or if they have

different shapes, the shape of each based on the order they were provided to this class.

class molecular_simulations.analysis.autocluster.PeriodicDataloader(data_files)

Bases: GenericDataloader

Decomposes periodic data using sin and cos, returning double the features.

Parameters:

data_files (list[Path | str])

load_data()

Loads file of input periodic data.

Return type:

None

Returns:

None

remove_periodicity(arr)

Removes periodicity from each feature using sin and cos. Each column is expanded into two such that the indices become i -> 2*i, 2*i + 1.

Parameters:

arr (np.ndarray) – Data to perform decomposition on.

Returns:

New array which should be shape (arr.shape[0], arr.shape[1] * 2).

Return type:

(np.ndarray)