corpora.malletcorpus – Corpus in Mallet format of List-Of-Words.¶Corpus in Mallet format of List-Of-Words.
gensim.corpora.malletcorpus.MalletCorpus(fname, id2word=None, metadata=False)¶Bases: gensim.corpora.lowcorpus.LowCorpus
Quoting http://mallet.cs.umass.edu/import.php:
One file, one instance per line Assume the data is in the following format:
[URL] [language] [text of the page…]
Note that language/label is not considered in Gensim.
docbyoffset(offset)¶Return the document stored at file position offset.
id2word¶line2doc(line)¶load(fname, mmap=None)¶Load a previously saved object from file (also see save).
If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.
If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then mmap=None must be set. Load will raise an IOError if this condition is encountered.
save(*args, **kwargs)¶save_corpus(fname, corpus, id2word=None, metadata=False)¶Save a corpus in the Mallet format.
The document id will be generated by enumerating the corpus. That is, it will range between 0 and number of documents in the corpus.
Since Mallet has a language field in the format, this defaults to the string ‘__unknown__’. If the language needs to be saved, post-processing will be required.
This function is automatically called by MalletCorpus.serialize; don’t call it directly, call serialize instead.
serialize(serializer, fname, corpus, id2word=None, index_fname=None, progress_cnt=None, labels=None, metadata=False)¶Iterate through the document stream corpus, saving the documents to fname and recording byte offset of each document. Save the resulting index structure to file index_fname (or fname.index is not set).
This relies on the underlying corpus class serializer providing (in addition to standard iteration):
Example:
>>> MmCorpus.serialize('test.mm', corpus)
>>> mm = MmCorpus('test.mm') # `mm` document stream now has random access
>>> print(mm[42]) # retrieve document no. 42, etc.