Metadata-Version: 2.1
Name: prefixspan
Version: 0.3.4
Summary: PrefixSpan and BIDE in Python 3
Home-page: https://github.com/chuanconggao/PrefixSpan-py
Author: Chuancong Gao
Author-email: chuancong@gmail.com
License: MIT
Download-URL: https://github.com/chuanconggao/PrefixSpan-py/tarball/0.3.4
Description: [![PyPi version](https://img.shields.io/pypi/v/prefixspan.svg)](https://pypi.python.org/pypi/prefixspan/)
        [![PyPi pyversions](https://img.shields.io/pypi/pyversions/prefixspan.svg)](https://pypi.python.org/pypi/prefixspan/)
        [![PyPi license](https://img.shields.io/pypi/l/prefixspan.svg)](https://pypi.python.org/pypi/prefixspan/)
        
        The shortest yet efficient implementation of famous frequent sequential pattern mining algorithm [PrefixSpan](https://ieeexplore.ieee.org/abstract/document/914830/) in Python 3, with less than 20 lines in core part (scan and extend).
        
        - You can also try the Scala [version](https://github.com/chuanconggao/PrefixSpan-scala).
        
        Also includes the implementation of famous frequent **closed** sequential pattern mining algorithm [BIDE](https://ieeexplore.ieee.org/abstract/document/1319986), which extends the framework of PrefixSpan algorithm.
        
        - A pattern is closed if there is no super-pattern with the same frequency.
        
        - BIDE is much faster than PrefixSpan, as only a small subset of closed patterns sharing the equivalent information of all the patterns are returned.
        
        # Features
        
        Outputs traditional single-item sequential patterns, where gaps are allowed between items.
        
        - Mining top-k patterns is supported, with respective optimizations on efficiency.
        
        - You can limit the length of mined patterns. Note that setting maximum pattern length properly can significantly speedup the algorithm.
        
        - Custom key function and custom filter function can be applied.
        
        # Installation
        
        This package is available on PyPi. Just use `pip3 install -U prefixspan` to install it.
        
        # CLI Usage
        
        You can simply use the algorithms on terminal.
        
        ``` text
        Usage:
            prefixspan-cli (frequent | top-k) <threshold> [options] [<file>]
        
            prefixspan-cli --help
        
        
        Options:
            --closed           Return only closed patterns.
            
            --key=<key>        Custom key function. [default: ]
                               Must be a Python function in form of "lambda patt, matches: ...", returning a float value.
                               Returns a float value.
            --bound=<bound>    The upper-bound function of the respective key function. [default: ]
                               When unspecified, the same key function is used.
                               Must be a Python function in form of "lambda patt, matches: ...", returning a float value.
                               Must be no less than the key function, i.e. bound(patt, matches) ≥ key(patt, matches).
                               Must be anti-monotone, i.e. for patt1 ⊑ patt2, bound(patt1, matches1) ≥ bound(patt2, matches2).
                               
            --filter=<filter>  Custom filter function. [default: ]
                               Must be a Python function in form of "lambda patt, matches: ...", returning a boolean value.
                               
            --minlen=<minlen>  Minimum length of patterns. [default: 1]
            --maxlen=<maxlen>  Maximum length of patterns. [default: 1000]
        ```
        
        * Sequences are read from standard input. Each sequence is integers separated by space, like this example:
        
        ``` text
        cat test.dat
        
        0 1 2 3 4
        1 1 1 3 4
        2 1 2 2 0
        1 1 1 2 2
        ```
        
        * The patterns and their respective frequencies are printed to standard output.
        
        ``` text
        prefixspan-cli frequent 2 test.dat
        
        0 : 2
        1 : 4
        1 2 : 3
        1 2 2 : 2
        1 3 : 2
        1 3 4 : 2
        1 4 : 2
        1 1 : 2
        1 1 1 : 2
        2 : 3
        2 2 : 2
        3 : 2
        3 4 : 2
        4 : 2
        ```
        
        * As you can see, the closed patterns are much more compact.
        
        ``` text
        prefixspan-cli frequent 2 --closed test.dat
        
        0 : 2
        1 : 4
        1 2 : 3
        1 2 2 : 2
        1 3 4 : 2
        1 1 1 : 2
        ```
        
        # API Usage
        
        Alternatively, you can use the algorithms via API.
        
        ``` python
        from prefixspan import PrefixSpan
        
        db = [
            [0, 1, 2, 3, 4],
            [1, 1, 1, 3, 4],
            [2, 1, 2, 2, 0],
            [1, 1, 1, 2, 2],
        ]
        
        ps = PrefixSpan(db)
        
        
        print(ps.frequent(2))
        # [(2, [0]),
        #  (4, [1]),
        #  (3, [1, 2]),
        #  (2, [1, 2, 2]),
        #  (2, [1, 3]),
        #  (2, [1, 3, 4]),
        #  (2, [1, 4]),
        #  (2, [1, 1]),
        #  (2, [1, 1, 1]),
        #  (3, [2]),
        #  (2, [2, 2]),
        #  (2, [3]),
        #  (2, [3, 4]),
        #  (2, [4])]
        
        print(ps.topk(5))
        # [(4, [1]),
        #  (3, [2]),
        #  (3, [1, 2]),
        #  (2, [1, 3]),
        #  (2, [1, 3, 4])]
        
        
        print(ps.frequent(2, closed=True))
        # [(2, [0]),
        #  (4, [1]),
        #  (3, [1, 2]),
        #  (2, [1, 2, 2]),
        #  (2, [1, 3, 4]),
        #  (2, [1, 1, 1])]
        
        print(ps.topk(5, closed=True))
        # [(4, [1]),
        #  (3, [1, 2]),
        #  (2, [1, 1, 1]),
        #  (2, [1, 2, 2]),
        #  (2, [1, 3, 4])]
        ```
        
        # Custom Key Function
        
        For both frequent and top-k algorithms, a custom key function `key=lambda patt, matches: ...` can be applied, where `patt` is the current pattern and `matches` is the current list of matching sequence `(id, position)` tuples.
            
        - In default, `len(matches)` is used denoting the support of current pattern.
        
        - Alternatively, any key function can be used. As an example, `sum(len(db[i]) for i in matches)` can be used to find the satisfying patterns according to the number of matched items.
        
        - For efficiency, an anti-monotone upper-bound function must also be specified for pruning.
        
            - If unspecified, the key function is also the upper-bound function, and must be anti-monotone.
        
        ``` python
        print(ps.topk(5, key=lambda patt, matches: sum(len(db[i]) for i, _ in matches)))
        # [(20, [1]),
        #  (15, [2]),
        #  (15, [1, 2]),
        #  (10, [1, 3]),
        #  (10, [1, 3, 4])]
        ```
        
        # Custom Filter Function
        
        For both frequent and top-k algorithms, a custom filter function `key=lambda patt, matches: ...` can be applied, where `patt` is the current pattern and `matches` is the current list of matching sequence `(id, position)` tuples.
        
        - In default, `True` is used denoting all the patterns.
        
        - Alternatively, any function can be used. As an example, `any(i == 0 for i, _ in matches)` can be used to find only the patterns covering the first sequence.
        
        ``` python
        print(ps.topk(5, filter=lambda patt, matches: any(i == 0 for i, _ in matches)))
        # [(4, [1]),
        #  (3, [1, 2]),
        #  (3, [2]),
        #  (2, [1, 3, 4]),
        #  (2, [1, 4])]
        ```
        
        # Tip
        
        I strongly encourage using [PyPy](http://pypy.org/) instead of CPython to run the script for best performance. In my own experience, it is nearly 10 times faster in average. To start, you can install this package in a [virtual environment](https://virtualenv.pypa.io/en/stable/) created for PyPy.
        
Keywords: data-mining,pattern-mining
Platform: UNKNOWN
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Requires-Python: >= 3
Description-Content-Type: text/markdown
