models.coherencemodel – Topic coherence pipeline¶Module for calculating topic coherence in python. This is the implementation of the four stage topic coherence pipeline from the paper [1]. The four stage pipeline is basically:
Segmentation -> Probability Estimation -> Confirmation Measure -> Aggregation.
Implementation of this pipeline allows for the user to in essence “make” a coherence measure of his/her choice by choosing a method in each of the pipelines.
| [1] | Michael Roeder, Andreas Both and Alexander Hinneburg. Exploring the space of topic coherence measures. http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf. |
gensim.models.coherencemodel.CoherenceModel(model=None, topics=None, texts=None, corpus=None, dictionary=None, window_size=None, coherence='c_v', topn=10, processes=-1)¶Bases: gensim.interfaces.TransformationABC
Objects of this class allow for building and maintaining a model for topic coherence.
The main methods are:
get_coherence() method, which returns the topic coherence.Pipeline phases can also be executed individually. Methods for doing this are:
One way of using this feature is through providing a trained topic model. A dictionary has to be explicitly provided if the model does not contain a dictionary already:
cm = CoherenceModel(model=tm, corpus=corpus, coherence='u_mass') # tm is the trained topic model
cm.get_coherence()
Another way of using this feature is through providing tokenized topics such as:
topics = [['human', 'computer', 'system', 'interface'],
['graph', 'minors', 'trees', 'eps']]
cm = CoherenceModel(topics=topics, corpus=corpus, dictionary=dictionary, coherence='u_mass') # note that a dictionary has to be provided.
cm.get_coherence()
Model persistency is achieved via its load/save methods.
| Parameters: |
|
|---|
aggregate_measures(topic_coherences)¶Aggregate the individual topic coherence measures using the pipeline’s aggregation function.
estimate_probabilities(segmented_topics=None)¶Accumulate word occurrences and co-occurrences from texts or corpus using the optimal method for the chosen coherence metric. This operation may take quite some time for the sliding window based coherence methods.
get_coherence()¶Return coherence value based on pipeline parameters.
get_coherence_per_topic(segmented_topics=None)¶Return list of coherence values for each topic based on pipeline parameters.
load(fname, mmap=None)¶Load a previously saved object from file (also see save).
If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.
If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then mmap=None must be set. Load will raise an IOError if this condition is encountered.
measure¶model¶save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)¶Save the object to file (also see load).
fname_or_handle is either a string specifying the file name to save to, or an open file-like object which can be written to. If the object is a file handle, no special array handling will be performed; all attributes will be saved to the same file.
If separately is None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently.
You can also set separately manually, in which case it must be a list of attribute names to be stored in separate files. The automatic check is not performed in this case.
ignore is a set of attribute names to not serialize (file handles, caches etc). On subsequent load() these attributes will be set to None.
pickle_protocol defaults to 2 so the pickled object can be imported in both Python 2 and 3.
segment_topics()¶topics¶