Metadata-Version: 1.0
Name: parsimonious
Version: 0.1
Summary: (Soon to be) the fastest pure-Python PEG parser I could muster
Home-page: https://github.com/erikrose/parsimonious
Author: Erik Rose
Author-email: erikrose@grinchcentral.com
License: MIT
Description: ============
        Parsimonious
        ============
        
        Parsimonious aims to be the fastest PEG parser written in pure Python. It was
        designed to undergird a MediaWiki parser that wouldn't take 5 seconds or a GB
        of RAM to do one page.
        
        Beyond speed, secondary goals include...
        
        * Frugal RAM use
        * Minimalistic, understandable, idiomatic Python code
        * Readable grammars
        * Extensible grammars
        * Complete test coverage
        * Separation of concerns. Some Python parsing kits mix recognition with
          instructions about how to turn the resulting tree into some kind of other
          representation. This is limiting when you want to do several different things
          with a tree: for example, render wiki markup to HTML *or* to text.
        * Good error reporting. I want the parser to work *with* me as I develop a
          grammar.
        
        
        Example
        =======
        
        Here's how to build a simple grammar::
        
            >>> from parsimonious.grammar import Grammar
            >>> grammar = Grammar(
            ...     """
            ...     bold_text  = bold_open text bold_close
            ...     text       = ~"[A-Z 0-9]*"i
            ...     bold_open  = "(("
            ...     bold_close = "))"
            ...     """)
        
        You can have forward references or anything you want; it's all taken care of by
        the grammar compiler. The first rule is taken to be the default start symbol,
        but you can override that.
        
        Next, let's parse something and get an abstract syntax tree::
        
            >>> grammar.parse('((bold stuff))')
            <Node called "bold_text" matching "((bold stuff))">
                <Node called "bold_open" matching "((">
                <RegexNode called "text" matching "bold stuff">
                <Node called "bold_close" matching "))">
        
        You'd typically then use a ``nodes.NodeVisitor`` subclass (see below) to walk
        the tree and do something useful with it.
        
        But first, lots of scary warnings...
        
        
        Status
        ======
        
        Take it easy, because 0.1 is a useable but rough preview release.
        
        * Everything that exists works. Test coverage is good.
        * I don't plan on making any backward-incompatible changes to the grammar DSL
          in the future, so you can write grammars without fear.
        * It may be slow and use a lot of RAM; I haven't measured either yet. However,
          I have several macro- and micro-optimizations in mind.
        * Error reporting and debugging are nonexistent, though ``repr`` methods of
          things are often helpful and informative. I have big things planned here.
        * Grammar extensibility story is underdeveloped at the moment. You should be
          able to extend a grammar by simply concatening more rules onto its textual
          (DSL) representation; later rules of the same name should override previous
          ones. However, this is untested and may not be the final story.
        * I'm not in love with the ``Grammar`` API, so it may change.
          ``ExpressionFlattener`` is probably going to move or get absorbed by
          something else.
        * No Sphinx docs yet, but the docstrings are pretty good
        
        Next, I'll do some optimization and see if I can make Parsimonious worthy of
        its name. RAM use and better thought-out grammar extensibility come after that.
        
        
        A Little About PEG Parsers
        ==========================
        
        PEG parsers don't draw a distinction between lexing and parsing; everything's
        done at once. As a result, there is no lookahead limit, as there is with, for
        instance, Yacc. And, due to both of these properties, PEG grammars are easier
        to write: they're basically just EBNF. With caching, they take O(grammar size *
        text length) memory (though I plan to do better), but they run in O(text
        length) time.
        
        More Technically
        ----------------
        
        PEGs can describe a superset of LL(k) languages, any deterministic LR(k)
        language, and many others, including some that aren't context-free
        (http://www.brynosaurus.com/pub/lang/peg.pdf). They can also deal with what
        would be ambiguous languages if described in canonical EBNF. They do this by
        trading the ``|`` alternation operator for the ``/`` operator, which works the
        same except that it makes priority explicit: ``a / b / c`` first tries matching
        ``a``. If that fails, it tries ``b``, and, failing that, moves on to ``c``.
        Thus, ambiguity is resolved by always yielding the first successful recognition.
        
        
        Writing Grammars
        ================
        
        They're basically just extended EBNF syntax:
        
        ``"some literal"``
          Used to quote literals. Backslash escaping and Python conventions for "raw"
          and Unicode strings help support fiddly characters.
        space
          Sequences are made out of space- or tab-delimited things. ``a b c`` matches
          spots where those 3 terms appear in that order.
        ``a / b``
          Alternatives. The first to succeed of ``a / b / c`` wins.
        ``a & b``
          A lookahead assertion followed by a normal, consumed symbol. ``a & b & c``
          means that ``a``, ``b``, and ``c`` all have to match at the current position.
          ``c``, however, is the only bit that's actually consumed.
        ``thing?``
          An optional expression
        ``!thing``
          (Not implemented yet.) Matches if ``thing`` isn't found here. Doesn't consume
          any text.
        ``things*``
          Zero or more things
        ``things+``
          One or more things
        ``~r"regex"ilmsux``
          Regexes have ``~`` in front and are quoted like literals. Any flags follow
          the end quotes as single chars. Regexes are good for representing character
          classes (``[a-z0-9]``) and optimizing for speed. The downside is that they
          won't be able to take advantage of our fancy debugging, once we get that
          working. Ultimately, I'd like to deprecate explicit regexes and instead have
          Parsimonious build them dynamically out of simpler primitives.
        
        We might implement parentheses in the future for anonymous grouping. For now,
        just break up complex rules instead.
        
        We shouldn't need to represent Empty; the quantifiers should suffice.
        
        
        Optimizing Grammars
        ===================
        
        Don't repeat expressions. If you need a ``Regex('such-and-such')`` at some
        point in your grammar, don't type it twice; make it a rule of its own, and
        reference it from wherever you need it. You'll get the most out of the caching
        this way, since cache lookups are by expression object identity (for speed).
        Even if you have an expression that's very simple, not repeating it will save
        RAM, as there can, at worst, be a cached int for every char in the text you're
        parsing. But hmm, maybe I can identify repeated subexpressions automatically
        and factor that up while building the grammar....
        
        How much should you shove into one ``Regex``, versus how much should you break
        them up to not repeat yourself? That's a fine balance and worthy of
        benchmarking. More stuff jammed into a regex will execute faster, because it
        doesn't have to run any Python between pieces, but a broken-up one will give
        better cache performance if the individual pieces are re-used elsewhere. If the
        pieces of a regex aren't used anywhere else, by all means keep the whole thing
        together.
        
        Quantifiers: bring your ``?`` and ``*`` quantifiers up to the highest level you
        can. Otherwise, lower-level patterns could succeed but be empty and put a bunch
        of useless nodes in your tree that didn't really match anything.
        
        
        Dealing With Parse Trees
        ========================
        
        A parse tree has a node for each expression matched, even if it matched a
        zero-length string, like ``"thing"?`` might do.
        
        The ``NodeVisitor`` class provides an inversion of control framework for
        walking a tree and returning a new construct (tree, string, or whatever) based
        on it. For now, have a look at its docstrings for more detail. There's also a
        good example in ``grammar.DslVisitor``. Notice how we take advantage of nodes'
        iterability by using tuple unpacks in the formal parameter lists::
        
            def visit_or_term(self, or_term, (_, slash, term)):
                ...
        
        When something goes wrong in your visitor, you get a nice error like this::
        
            [normal traceback here...]
            VisitationException: 'Node' object has no attribute 'foo'
        
            Parse tree:
            <Node called "rules" matching "number = ~"[0-9]+"">  <-- *** We were here. ***
                <Node matching "number = ~"[0-9]+"">
                    <Node called "rule" matching "number = ~"[0-9]+"">
                        <Node matching "">
                        <Node called "label" matching "number">
                        <Node matching " ">
                            <Node called "_" matching " ">
                        <Node matching "=">
                        <Node matching " ">
                            <Node called "_" matching " ">
                        <Node called "rhs" matching "~"[0-9]+"">
                            <Node called "term" matching "~"[0-9]+"">
                                <Node called "atom" matching "~"[0-9]+"">
                                    <Node called "regex" matching "~"[0-9]+"">
                                        <Node matching "~">
                                        <Node called "literal" matching ""[0-9]+"">
                                        <Node matching "">
                        <Node matching "">
                        <Node called "eol" matching "
                        ">
                <Node matching "">
        
        The parse tree tacked onto the exception, and the node whose visitor method
        raised the error is pointed out.
        
        Why No Streaming Tree Processing?
        ---------------------------------
        
        Some have asked why we don't process the tree as we go, SAX-style. There are
        two main reasons:
        
        1. It wouldn't work. With a PEG parser, no parsing decision is final until the
           whole text is parsed. If we had to change a decision, we'd have to backtrack
           and redo the SAX-style interpretation as well, which would involve
           reconstituting part of the AST and quite possibly scuttling whatever you
           were doing with the streaming output. (Note that some bursty SAX-style
           processing may be possible in the future if we use cuts.)
        
        2. It interferes with the ability to derive multiple representations from the
           AST: for example, first HTML and then text from wiki markup.
        
        
        Future Directions
        =================
        
        DSL Changes
        -----------
        
        * Do we need a LookAhead? It might be slightly faster, but ``A Lookahead(B)``
          is equivalent to ``AB & A``, which, while more verbose, takes full advantage
          of packratting.
        * Maybe support left-recursive rules like PyMeta, if anybody cares.
        * The ability to mark certain nodes as undesired, so we don't bother
          constructing them and cluttering the tree with them. For example, we might
          only care to see the ``OneOf`` node in the final tree, not the boring
          Literals inside it::
        
            greeting = "hi" / "hello" / "bonjour"
        
          Perhaps we could express it like this::
        
            greeting = -"hi" / -"hello" / -"bonjour"
        
          On the other hand, parentheses for anonymous subexpressions could largely
          solve this problem--and in a more familiar way--if we implicitly omitted
          their nodes. (The exception would be subexpressions that you end up having to
          repeat several times in the grammar.) On the third hand, I don't really care
          to clutter grammar definitions up like this. It makes them less readable and
          conflates recognition with tree processing. I'll most likely just focus on
          making ``NodeVisitor`` subclasses as easy as possible to write.
        * Pijnu has a raft of tree manipulators. I don't think I want all of them, but
          a judicious subset might be nice. Don't get into mixing formatting with tree
          manipulation.
          https://github.com/erikrose/pijnu/blob/master/library/node.py#L333
        
        Optimizations
        -------------
        
        * Make RAM use almost constant by automatically inserting "cuts", as described
          in
          http://ialab.cs.tsukuba.ac.jp/~mizusima/publications/paste513-mizushima.pdf.
          This would also improve error reporting, as we wouldn't backtrack out of
          everything informative before finally failing.
        * Think about having the user (optionally) provide some representative input
          along with a grammar. We can then profile against it, see which expressions
          are worth caching, and annotate the grammar. Perhaps there will even be
          positions at which a given expression is more worth caching. Or we could keep
          a count of how many times each cache entry has been used and evict the most
          useless ones as RAM use grows.
        * We could possibly compile the grammar into VM instructions, like in "A
          parsing machine for PEGs" by Medeiros.
        * If the recursion gets too deep in practice, use trampolining to dodge it.
        
        
        Version History
        ===============
        
        0.1
          * A rough but useable preview release
        
Keywords: parse,parser,parsing,peg,packrat,grammar,language
Platform: UNKNOWN
Classifier: Intended Audience :: Developers
Classifier: Natural Language :: English
Classifier: Development Status :: 3 - Alpha
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 2
Classifier: Programming Language :: Python :: 2.5
Classifier: Programming Language :: Python :: 2.6
Classifier: Programming Language :: Python :: 2.7
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Text Processing :: General
