CedarBackup3.filesystem
=======================

.. py:module:: CedarBackup3.filesystem

.. autoapi-nested-parse::

   Provides filesystem-related objects.
   :author: Kenneth J. Pronovici <pronovic@ieee.org>









Module Contents
---------------

.. py:data:: logger

.. py:class:: FilesystemList

   Bases: :py:obj:`list`


   Represents a list of filesystem items.

   This is a generic class that represents a list of filesystem items.  Callers
   can add individual files or directories to the list, or can recursively add
   the contents of a directory.  The class also allows for up-front exclusions
   in several forms (all files, all directories, all items matching a pattern,
   all items whose basename matches a pattern, or all directories containing a
   specific "ignore file").  Symbolic links are typically backed up
   non-recursively, i.e. the link to a directory is backed up, but not the
   contents of that link (we don't want to deal with recursive loops, etc.).

   The custom methods such as :any:`addFile` will only add items if they exist on
   the filesystem and do not match any exclusions that are already in place.
   However, since a FilesystemList is a subclass of Python's standard list
   class, callers can also add items to the list in the usual way, using
   methods like ``append()`` or ``insert()``.  No validations apply to items
   added to the list in this way; however, many list-manipulation methods deal
   "gracefully" with items that don't exist in the filesystem, often by
   ignoring them.

   Once a list has been created, callers can remove individual items from the
   list using standard methods like ``pop()`` or ``remove()`` or they can use
   custom methods to remove specific types of entries or entries which match a
   particular pattern.

   *Note:* Regular expression patterns that apply to paths are assumed to be
   bounded at front and back by the beginning and end of the string, i.e. they
   are treated as if they begin with ``^`` and end with ``$``.  This is true
   whether we are matching a complete path or a basename.



   .. py:attribute:: excludeFiles
      :value: False



   .. py:attribute:: excludeLinks
      :value: False



   .. py:attribute:: excludeDirs
      :value: False



   .. py:attribute:: excludePaths
      :value: []



   .. py:attribute:: excludePatterns


   .. py:attribute:: excludeBasenamePatterns


   .. py:attribute:: ignoreFile
      :value: None



   .. py:method:: addFile(path)

      Adds a file to the list.

      The path must exist and must be a file or a link to an existing file.  It
      will be added to the list subject to any exclusions that are in place.

      :param path: File path to be added to the list
      :type path: String representing a path on disk

      :returns: Number of items added to the list

      :raises ValueError: If path is not a file or does not exist
      :raises ValueError: If the path could not be encoded properly



   .. py:method:: addDir(path)

      Adds a directory to the list.

      The path must exist and must be a directory or a link to an existing
      directory.  It will be added to the list subject to any exclusions that
      are in place.  The :any:`ignoreFile` does not apply to this method, only to
      :any:`addDirContents`.

      :param path: Directory path to be added to the list
      :type path: String representing a path on disk

      :returns: Number of items added to the list

      :raises ValueError: If path is not a directory or does not exist
      :raises ValueError: If the path could not be encoded properly



   .. py:method:: addDirContents(path, recursive=True, addSelf=True, linkDepth=0, dereference=False)

      Adds the contents of a directory to the list.

      The path must exist and must be a directory or a link to a directory.
      The contents of the directory (as well as the directory path itself) will
      be recursively added to the list, subject to any exclusions that are in
      place.  If you only want the directory and its immediate contents to be
      added, then pass in ``recursive=False``.

      *Note:* If a directory's absolute path matches an exclude pattern or path,
      or if the directory contains the configured ignore file, then the
      directory and all of its contents will be recursively excluded from the
      list.

      *Note:* If the passed-in directory happens to be a soft link, it will be
      recursed.  However, the linkDepth parameter controls whether any soft
      links *within* the directory will be recursed.  The link depth is
      maximum depth of the tree at which soft links should be followed.  So, a
      depth of 0 does not follow any soft links, a depth of 1 follows only
      links within the passed-in directory, a depth of 2 follows the links at
      the next level down, etc.

      *Note:* Any invalid soft links (i.e.  soft links that point to
      non-existent items) will be silently ignored.

      *Note:* The :any:`excludeDirs` flag only controls whether any given directory
      path itself is added to the list once it has been discovered.  It does
      *not* modify any behavior related to directory recursion.

      *Note:* If you call this method *on a link to a directory* that link will
      never be dereferenced (it may, however, be followed).

      :param path: Directory path whose contents should be added to the list
      :type path: String representing a path on disk
      :param recursive: Indicates whether directory contents should be added recursively
      :type recursive: Boolean value
      :param addSelf: Indicates whether the directory itself should be added to the list
      :type addSelf: Boolean value
      :param linkDepth: Maximum depth of the tree at which soft links should be followed, zero means not to folow
      :type linkDepth: Integer value
      :param dereference: Indicates whether soft links, if followed, should be dereferenced
      :type dereference: Boolean value

      :returns: Number of items recursively added to the list

      :raises ValueError: If path is not a directory or does not exist
      :raises ValueError: If the path could not be encoded properly



   .. py:method:: removeFiles(pattern=None)

      Removes file entries from the list.

      If ``pattern`` is not passed in or is ``None``, then all file entries will
      be removed from the list.  Otherwise, only those file entries matching
      the pattern will be removed.  Any entry which does not exist on disk
      will be ignored (use :any:`removeInvalid` to purge those entries).

      This method might be fairly slow for large lists, since it must check the
      type of each item in the list.  If you know ahead of time that you want
      to exclude all files, then you will be better off setting :any:`excludeFiles`
      to ``True`` before adding items to the list.

      :param pattern: Regular expression pattern representing entries to remove

      :returns: Number of entries removed

      :raises ValueError: If the passed-in pattern is not a valid regular expression



   .. py:method:: removeDirs(pattern=None)

      Removes directory entries from the list.

      If ``pattern`` is not passed in or is ``None``, then all directory entries
      will be removed from the list.  Otherwise, only those directory entries
      matching the pattern will be removed.  Any entry which does not exist on
      disk will be ignored (use :any:`removeInvalid` to purge those entries).

      This method might be fairly slow for large lists, since it must check the
      type of each item in the list.  If you know ahead of time that you want
      to exclude all directories, then you will be better off setting
      :any:`excludeDirs` to ``True`` before adding items to the list (note that this
      will not prevent you from recursively adding the *contents* of
      directories).

      :param pattern: Regular expression pattern representing entries to remove

      :returns: Number of entries removed

      :raises ValueError: If the passed-in pattern is not a valid regular expression



   .. py:method:: removeLinks(pattern=None)

      Removes soft link entries from the list.

      If ``pattern`` is not passed in or is ``None``, then all soft link entries
      will be removed from the list.  Otherwise, only those soft link entries
      matching the pattern will be removed.  Any entry which does not exist on
      disk will be ignored (use :any:`removeInvalid` to purge those entries).

      This method might be fairly slow for large lists, since it must check the
      type of each item in the list.  If you know ahead of time that you want
      to exclude all soft links, then you will be better off setting
      :any:`excludeLinks` to ``True`` before adding items to the list.

      :param pattern: Regular expression pattern representing entries to remove

      :returns: Number of entries removed

      :raises ValueError: If the passed-in pattern is not a valid regular expression



   .. py:method:: removeMatch(pattern)

      Removes from the list all entries matching a pattern.

      This method removes from the list all entries which match the passed in
      ``pattern``.  Since there is no need to check the type of each entry, it
      is faster to call this method than to call the :any:`removeFiles`,
      :any:`removeDirs` or :any:`removeLinks` methods individually.  If you know which
      patterns you will want to remove ahead of time, you may be better off
      setting :any:`excludePatterns` or :any:`excludeBasenamePatterns` before adding
      items to the list.

      *Note:* Unlike when using the exclude lists, the pattern here is *not*
      bounded at the front and the back of the string.  You can use any pattern
      you want.

      :param pattern: Regular expression pattern representing entries to remove

      :returns: Number of entries removed

      :raises ValueError: If the passed-in pattern is not a valid regular expression



   .. py:method:: removeInvalid()

      Removes from the list all entries that do not exist on disk.

      This method removes from the list all entries which do not currently
      exist on disk in some form.  No attention is paid to whether the entries
      are files or directories.

      :returns: Number of entries removed



   .. py:method:: normalize()

      Normalizes the list, ensuring that each entry is unique.



   .. py:method:: verify()

      Verifies that all entries in the list exist on disk.
      :returns: ``True`` if all entries exist, ``False`` otherwise



.. py:class:: SpanItem(fileList, size, capacity, utilization)

   Item returned by :any:`BackupFileList.generateSpan`.


   .. py:attribute:: fileList


   .. py:attribute:: size


   .. py:attribute:: capacity


   .. py:attribute:: utilization


.. py:class:: BackupFileList

   Bases: :py:obj:`FilesystemList`


   List of files to be backed up.

   A BackupFileList is a :any:`FilesystemList` containing a list of files to be
   backed up.  It only contains files, not directories (soft links are treated
   like files).  On top of the generic functionality provided by
   :any:`FilesystemList`, this class adds functionality to keep a hash (checksum)
   for each file in the list, and it also provides a method to calculate the
   total size of the files in the list and a way to export the list into tar
   form.



   .. py:method:: addDir(path)

      Adds a directory to the list.

      Note that this class does not allow directories to be added by themselves
      (a backup list contains only files).  However, since links to directories
      are technically files, we allow them to be added.

      This method is implemented in terms of the superclass method, with one
      additional validation: the superclass method is only called if the
      passed-in path is both a directory and a link.  All of the superclass's
      existing validations and restrictions apply.

      :param path: Directory path to be added to the list
      :type path: String representing a path on disk

      :returns: Number of items added to the list

      :raises ValueError: If path is not a directory or does not exist
      :raises ValueError: If the path could not be encoded properly



   .. py:method:: totalSize()

      Returns the total size among all files in the list.
      Only files are counted.
      Soft links that point at files are ignored.
      Entries which do not exist on disk are ignored.
      :returns: Total size, in bytes



   .. py:method:: generateSizeMap()

      Generates a mapping from file to file size in bytes.
      The mapping does include soft links, which are listed with size zero.
      Entries which do not exist on disk are ignored.
      :returns: Dictionary mapping file to file size



   .. py:method:: generateDigestMap(stripPrefix=None)

      Generates a mapping from file to file digest.

      Currently, the digest is an SHA hash, which should be pretty secure.  In
      the future, this might be a different kind of hash, but we guarantee that
      the type of the hash will not change unless the library major version
      number is bumped.

      Entries which do not exist on disk are ignored.

      Soft links are ignored.  We would end up generating a digest for the file
      that the soft link points at, which doesn't make any sense.

      If ``stripPrefix`` is passed in, then that prefix will be stripped from
      each key when the map is generated.  This can be useful in generating two
      "relative" digest maps to be compared to one another.

      :param stripPrefix: Common prefix to be stripped from paths
      :type stripPrefix: String with any contents

      :returns: Dictionary mapping file to digest value

      @see: :any:`removeUnchanged`



   .. py:method:: generateFitted(capacity, algorithm='worst_fit')

      Generates a list of items that fit in the indicated capacity.

      Sometimes, callers would like to include every item in a list, but are
      unable to because not all of the items fit in the space available.  This
      method returns a copy of the list, containing only the items that fit in
      a given capacity.  A copy is returned so that we don't lose any
      information if for some reason the fitted list is unsatisfactory.

      The fitting is done using the functions in the knapsack module.  By
      default, the first fit algorithm is used, but you can also choose
      from best fit, worst fit and alternate fit.

      :param capacity: Maximum capacity among the files in the new list
      :type capacity: Integer, in bytes
      :param algorithm: Knapsack (fit) algorithm to use
      :type algorithm: One of "first_fit", "best_fit", "worst_fit", "alternate_fit"

      :returns: Copy of list with total size no larger than indicated capacity

      :raises ValueError: If the algorithm is invalid



   .. py:method:: generateSpan(capacity, algorithm='worst_fit')

      Splits the list of items into sub-lists that fit in a given capacity.

      Sometimes, callers need split to a backup file list into a set of smaller
      lists.  For instance, you could use this to "span" the files across a set
      of discs.

      The fitting is done using the functions in the knapsack module.  By
      default, the first fit algorithm is used, but you can also choose
      from best fit, worst fit and alternate fit.

      *Note:* If any of your items are larger than the capacity, then it won't
      be possible to find a solution.  In this case, a value error will be
      raised.

      :param capacity: Maximum capacity among the files in the new list
      :type capacity: Integer, in bytes
      :param algorithm: Knapsack (fit) algorithm to use
      :type algorithm: One of "first_fit", "best_fit", "worst_fit", "alternate_fit"

      :returns: List of :any:`SpanItem` objects

      :raises ValueError: If the algorithm is invalid
      :raises ValueError: If it's not possible to fit some items



   .. py:method:: generateTarfile(path, mode='tar', ignore=False, flat=False)

      Creates a tar file containing the files in the list.

      By default, this method will create uncompressed tar files.  If you pass
      in mode ``'targz'``, then it will create gzipped tar files, and if you
      pass in mode ``'tarbz2'``, then it will create bzipped tar files.

      The tar file will be created as a GNU tar archive, which enables extended
      file name lengths, etc.  Since GNU tar is so prevalent, I've decided that
      the extra functionality out-weighs the disadvantage of not being
      "standard".

      If you pass in ``flat=True``, then a "flat" archive will be created, and
      all of the files will be added to the root of the archive.  So, the file
      ``/tmp/something/whatever.txt`` would be added as just ``whatever.txt``.

      By default, the whole method call fails if there are problems adding any
      of the files to the archive, resulting in an exception.  Under these
      circumstances, callers are advised that they might want to call
      :any:`removeInvalid` and then attempt to extract the tar file a second
      time, since the most common cause of failures is a missing file (a file
      that existed when the list was built, but is gone again by the time the
      tar file is built).

      If you want to, you can pass in ``ignore=True``, and the method will
      ignore errors encountered when adding individual files to the archive
      (but not errors opening and closing the archive itself).

      We'll always attempt to remove the tarfile from disk if an exception will
      be thrown.

      *Note:* No validation is done as to whether the entries in the list are
      files, since only files or soft links should be in an object like this.
      However, to be safe, everything is explicitly added to the tar archive
      non-recursively so it's safe to include soft links to directories.

      *Note:* The Python ``tarfile`` module, which is used internally here, is
      supposed to deal properly with long filenames and links.  In my testing,
      I have found that it appears to be able to add long really long filenames
      to archives, but doesn't do a good job reading them back out, even out of
      an archive it created.  Fortunately, all Cedar Backup does is add files
      to archives.

      :param path: Path of tar file to create on disk
      :type path: String representing a path on disk
      :param mode: Tar creation mode
      :type mode: One of either ``'tar'``, ``'targz'`` or ``'tarbz2'``
      :param ignore: Indicates whether to ignore certain errors
      :type ignore: Boolean
      :param flat: Creates "flat" archive by putting all items in root
      :type flat: Boolean

      :raises ValueError: If mode is not valid
      :raises ValueError: If list is empty
      :raises ValueError: If the path could not be encoded properly
      :raises TarError: If there is a problem creating the tar file



   .. py:method:: removeUnchanged(digestMap, captureDigest=False)

      Removes unchanged entries from the list.

      This method relies on a digest map as returned from :any:`generateDigestMap`.
      For each entry in ``digestMap``, if the entry also exists in the current
      list *and* the entry in the current list has the same digest value as in
      the map, the entry in the current list will be removed.

      This method offers a convenient way for callers to filter unneeded
      entries from a list.  The idea is that a caller will capture a digest map
      from ``generateDigestMap`` at some point in time (perhaps the beginning of
      the week), and will save off that map using ``pickle`` or some other
      method.  Then, the caller could use this method sometime in the future to
      filter out any unchanged files based on the saved-off map.

      If ``captureDigest`` is passed-in as ``True``, then digest information will
      be captured for the entire list before the removal step occurs using the
      same rules as in :any:`generateDigestMap`.  The check will involve a lookup
      into the complete digest map.

      If ``captureDigest`` is passed in as ``False``, we will only generate a
      digest value for files we actually need to check, and we'll ignore any
      entry in the list which isn't a file that currently exists on disk.

      The return value varies depending on ``captureDigest``, as well.  To
      preserve backwards compatibility, if ``captureDigest`` is ``False``, then
      we'll just return a single value representing the number of entries
      removed.  Otherwise, we'll return a tuple of C{(entries removed, digest
      map)}.  The returned digest map will be in exactly the form returned by
      :any:`generateDigestMap`.

      *Note:* For performance reasons, this method actually ends up rebuilding
      the list from scratch.  First, we build a temporary dictionary containing
      all of the items from the original list.  Then, we remove items as needed
      from the dictionary (which is faster than the equivalent operation on a
      list).  Finally, we replace the contents of the current list based on the
      keys left in the dictionary.  This should be transparent to the caller.

      :param digestMap: Dictionary mapping file name to digest value
      :type digestMap: Map as returned from :any:`generateDigestMap`
      :param captureDigest: Indicates that digest information should be captured
      :type captureDigest: Boolean

      :returns: Results as discussed above (format varies based on arguments)



.. py:class:: PurgeItemList

   Bases: :py:obj:`FilesystemList`


   List of files and directories to be purged.

   A PurgeItemList is a :any:`FilesystemList` containing a list of files and
   directories to be purged.  On top of the generic functionality provided by
   :any:`FilesystemList`, this class adds functionality to remove items that are
   too young to be purged, and to actually remove each item in the list from
   the filesystem.

   The other main difference is that when you add a directory's contents to a
   purge item list, the directory itself is not added to the list.  This way,
   if someone asks to purge within in ``/opt/backup/collect``, that directory
   doesn't get removed once all of the files within it is gone.


   .. py:method:: addDirContents(path, recursive=True, addSelf=True, linkDepth=0, dereference=False)

      Adds the contents of a directory to the list.

      The path must exist and must be a directory or a link to a directory.
      The contents of the directory (but *not* the directory path itself) will
      be recursively added to the list, subject to any exclusions that are in
      place.  If you only want the directory and its contents to be added, then
      pass in ``recursive=False``.

      *Note:* If a directory's absolute path matches an exclude pattern or path,
      or if the directory contains the configured ignore file, then the
      directory and all of its contents will be recursively excluded from the
      list.

      *Note:* If the passed-in directory happens to be a soft link, it will be
      recursed.  However, the linkDepth parameter controls whether any soft
      links *within* the directory will be recursed.  The link depth is
      maximum depth of the tree at which soft links should be followed.  So, a
      depth of 0 does not follow any soft links, a depth of 1 follows only
      links within the passed-in directory, a depth of 2 follows the links at
      the next level down, etc.

      *Note:* Any invalid soft links (i.e.  soft links that point to
      non-existent items) will be silently ignored.

      *Note:* The :any:`excludeDirs` flag only controls whether any given soft link
      path itself is added to the list once it has been discovered.  It does
      *not* modify any behavior related to directory recursion.

      *Note:* The :any:`excludeDirs` flag only controls whether any given directory
      path itself is added to the list once it has been discovered.  It does
      *not* modify any behavior related to directory recursion.

      *Note:* If you call this method *on a link to a directory* that link will
      never be dereferenced (it may, however, be followed).

      :param path: Directory path whose contents should be added to the list
      :type path: String representing a path on disk
      :param recursive: Indicates whether directory contents should be added recursively
      :type recursive: Boolean value
      :param addSelf: Ignored in this subclass
      :param linkDepth: Depth of soft links that should be followed
      :type linkDepth: Integer value, where zero means not to follow any soft links
      :param dereference: Indicates whether soft links, if followed, should be dereferenced
      :type dereference: Boolean value

      :returns: Number of items recursively added to the list

      :raises ValueError: If path is not a directory or does not exist
      :raises ValueError: If the path could not be encoded properly



   .. py:method:: removeYoungFiles(daysOld)

      Removes from the list files younger than a certain age (in days).

      Any file whose "age" in days is less than (``<``) the value of the
      ``daysOld`` parameter will be removed from the list so that it will not be
      purged later when :any:`purgeItems` is called.  Directories and soft links
      will be ignored.

      The "age" of a file is the amount of time since the file was last used,
      per the most recent of the file's ``st_atime`` and ``st_mtime`` values.

      *Note:* Some people find the "sense" of this method confusing or
      "backwards".  Keep in mind that this method is used to remove items
      *from the list*, not from the filesystem!  It removes from the list
      those items that you would *not* want to purge because they are too
      young.  As an example, passing in ``daysOld`` of zero (0) would remove
      from the list no files, which would result in purging all of the files
      later.  I would be happy to make a synonym of this method with an
      easier-to-understand "sense", if someone can suggest one.

      :param daysOld: Minimum age of files that are to be kept in the list
      :type daysOld: Integer value >= 0

      :returns: Number of entries removed



   .. py:method:: purgeItems()

      Purges all items in the list.

      Every item in the list will be purged.  Directories in the list will
      *not* be purged recursively, and hence will only be removed if they are
      empty.  Errors will be ignored.

      To faciliate easy removal of directories that will end up being empty,
      the delete process happens in two passes: files first (including soft
      links), then directories.

      :returns: Tuple containing count of (files, dirs) removed



.. py:function:: normalizeFile(path)

   Normalizes a file name.

   On Windows in particular, we often end up with mixed slashes, where
   parts of a path have forward slash and parts have backward slash.
   This makes it difficult to construct exclusions in configuration,
   because you never know what part of a path will have what kind of
   slash.  I've decided to standardize on forward slashes.

   :param path: Path to be normalized
   :type path: String representing a path on disk

   :returns: Normalized path, which should be equivalent to the original


.. py:function:: normalizeDir(path)

   Normalizes a directory name.

   For our purposes, a directory name is normalized by removing the trailing
   path separator, if any.  This is important because we want directories to
   appear within lists in a consistent way, although from the user's
   perspective passing in ``/path/to/dir/`` and ``/path/to/dir`` are equivalent.

   We also convert slashes.  On Windows in particular, we often end up with
   mixed slashes, where parts of a path have forward slash and parts have
   backward slash.  This makes it difficult to construct exclusions in
   configuration, because you never know what part of a path will have
   what kind of slash.  I've decided to standardize on forward slashes.

   :param path: Path to be normalized
   :type path: String representing a path on disk

   :returns: Normalized path, which should be equivalent to the original


.. py:function:: compareContents(path1, path2, verbose=False)

   Compares the contents of two directories to see if they are equivalent.

   The two directories are recursively compared.  First, we check whether they
   contain exactly the same set of files.  Then, we check to see every given
   file has exactly the same contents in both directories.

   This is all relatively simple to implement through the magic of
   :any:`BackupFileList.generateDigestMap`, which knows how to strip a path prefix
   off the front of each entry in the mapping it generates.  This makes our
   comparison as simple as creating a list for each path, then generating a
   digest map for each path and comparing the two.

   If no exception is thrown, the two directories are considered identical.

   If the ``verbose`` flag is ``True``, then an alternate (but slower) method is
   used so that any thrown exception can indicate exactly which file caused the
   comparison to fail.  The thrown ``ValueError`` exception distinguishes
   between the directories containing different files, and containing the same
   files with differing content.

   *Note:* Symlinks are *not* followed for the purposes of this comparison.

   :param path1: First path to compare
   :type path1: String representing a path on disk
   :param path2: First path to compare
   :type path2: String representing a path on disk
   :param verbose: Indicates whether a verbose response should be given
   :type verbose: Boolean

   :raises ValueError: If a directory doesn't exist or can't be read
   :raises ValueError: If the two directories are not equivalent
   :raises IOError: If there is an unusual problem reading the directories


.. py:function:: compareDigestMaps(digest1, digest2, verbose=False)

   Compares two digest maps and throws an exception if they differ.

   :param digest1: First digest to compare
   :type digest1: Digest as returned from BackupFileList.generateDigestMap()
   :param digest2: Second digest to compare
   :type digest2: Digest as returned from BackupFileList.generateDigestMap()
   :param verbose: Indicates whether a verbose response should be given
   :type verbose: Boolean

   :raises ValueError: If the two directories are not equivalent


