Skip to main content

A library to walk through tar archives, simplifying use by handling listing and decompression.

Project description

Summary

TarWalker provides a method to easily scan files somewhat like os.walk, handling compressed files, recursing through directories and scanning within tarfiles.

The library is very stable, changes are rare. It well documented and has full unit testing (100% code coverage), and is maintained.


Build Status

Build Status

Overview

Notes about this library:

  1. It can walk through compressed or uncompressed files and tarfiles (and optionally directories), processing the target files one at a time.

  2. It uses a pair of callbacks to avoid opening or decompressing files that do are not of interest:

    1. The matcher is called first with meta data of the file, and returns true if the file is to be used.

    2. If so, the handler is called with the meta data, the matcher return value, and an open (possibly decompressed) file handle.

  3. Decompression is done on the stream, to reduce memory requirements and to avoid wasted processing when a handler returns early.

  4. If the recurse parameter is true and the walker encounters a tarfile embedded within a tarfile, its contents will also be scanned the same way.

There are two (2) classes that are provided. The primary difference is that TarWalker will throw an exception if given a directory.

  • TarWalker handles compressed or uncompressed files and tarfile archives.

  • TarDirWalker is a subclass of TarWalker that expands it to recursively walk through directories, processing any files encountered.

Installation

Install the package using pip, eg:

pip install –user tarwalker

pip3 install –user tarwalker

Examples

The following is simple tool to look for a given string within files. Files can be given as arguments or within tarballs, and must end with either ‘.log’ (w/an optional numeric suffix) or with ‘.txt’:

import re
import sys

from tarwalker import TarWalker

PATTERN = re.compile(r'.*\.(txt|log(\.\d+)?)$')


def handler(fileobj, filename, arch, info, match):
  try:
        for line in fileobj:
      if text in line:
        path = (arch + ':') if arch else ''
                print("Found in: " + path + filename)
        return
  except IOError:
    pass


text = sys.argv[1]
walker = TarWalker(file_handler=handler, name_matcher=PATTERN.match, recurse=False)

for arg in sys.argv[2:]:
    walker.handle_path(arg)

Constructors and Callbacks

Constructing an instance of TarWalker or TarDirWalker take the same parameters. Note that at most one of file_matcher or name_matcher is allowed.

  • file_handler (required) a callable taking five (5) positional parameters:

    • fileobj - a readable file object for the file contents.

    • filepath - a str with the filename, either as one of:

      • the file path given to handle_path(), or

      • the path of a file found beneath a directory given to handle_path().

      • the file path of a file within an expanded tar archive.

    • archname - a str path of the tar archive name, when handling a file found within a tar archive. It will be a colon (‘:’) separated list if reading a recursive tar archive.

    • fileinfo - may be None or an object with the following attributes. See os.stat for more details:

      • name - the str name of the file,

      • size - the size of the file in bytes,

      • mtime - modification time, in POSIX (epoch) time,

      • mode - the file permission bits,

      • uid - the file owner’s User ID, and

      • gid - the file owner’s Group ID

    • MATCH - the value returned from the name_matcher or file_matcher call.

    NOTE: files with a compression suffix will have the suffix removed, and the file object will return decompressed contents. For example, for “foo.txt.gz” filepath would be “foo.txt” and fileobj would be the equivalent contents of “foo.txt”.

  • file_matcher (optional) a callable that takes two (2) positional parameters and returns true if the file should be opened and passed to the file_handler callback:

    • filepath - See filepath above.

    • fileinfo - See fileinfo above.

  • name_matcher (optional) a callable that takes one (1) positional parameter and returns true if the file be opened and passed to file_handler:

    • filepath - See file_handler, above.

  • recurse (optional) If true, the algorithm will recurse into tarballs found within other tarballs. Furthermore, if recurse is a callable it will be called before and after opening an interior tarball, with four (4) positional parameters:

    • start - a bool that indicates recursion into the given tarball is starting; it is False on the second call.

    • tarname - name of the contained (interior) tarball, see filepath above.

    • archive - the name of the containing (exterior) tarball, see archname above.

    • fileinfo - See fileinfo above.

Known Issues

If you think you have found a defect, or wish to add an enhancement request, please do so via the GitLab issues page:.

  • The ARCHNAME passed to the file_handler callback uses ‘:’ as a separator, which is a legal filename component, so does not necessarily indicate a nested archive.

  • The recurse feature will scan an embedded tarfile, but there is currently no mechanism to avoid scanning a tarfile found within an embedded tarfile (at any level). If needed, please submit an enhancement request.

  • There are lots of other compression algorithms that are not handled.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tarwalker-1.1.tar.gz (7.9 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page