skip to navigation
skip to content

azure-datalake-store 0.0.9

Azure Data Lake Store Filesystem Client Library for Python

Latest Version: 0.0.17

azure-datalake-store

azure-datalake-store is a file-system management system in python for the Azure Data-Lake Store.

To install from source instead of pip (for local testing and development):

> pip install -r dev_requirements.txt
> python setup.py develop

To run tests, you are required to set the following environment variables: azure_tenant_id, azure_username, azure_password, azure_data_lake_store_name

To play with the code, here is a starting point:

from azure.datalake.store import core, lib, multithread
token = lib.auth(tenant_id, username, password)
adl = core.AzureDLFileSystem(token, store_name=store_name)

# typical operations
adl.ls('')
adl.ls('tmp/', detail=True)
adl.cat('littlefile')
adl.head('gdelt20150827.csv')

# file-like object
with adl.open('gdelt20150827.csv', blocksize=2**20) as f:
    print(f.readline())
    print(f.readline())
    print(f.readline())
    # could have passed f to any function requiring a file object:
    # pandas.read_csv(f)

with adl.open('anewfile', 'wb') as f:
    # data is written on flush/close, or when buffer is bigger than
    # blocksize
    f.write(b'important data')

adl.du('anewfile')

# recursively download the whole directory tree with 5 threads and
# 16MB chunks
multithread.ADLDownloader(adl, "", 'my_temp_dir', 5, 2**24)

Command Line Sample Usage

To interact with the API at a higher-level, you can use the provided command-line interface in “samples/cli.py”. You will need to set the appropriate environment variables as described above to connect to the Azure Data Lake Store. Below is a simple sample, with more details beyond.

python samples\cli.py ls -l

Execute the program without arguments to access documentation.

To start the CLI in interactive mode, run “python samples/cli.py” and then type “help” to see all available commands (similiar to Unix utilities):

> python samples/cli.py
azure> help

Documented commands (type help <topic>):
========================================
cat    chmod  close  du      get   help  ls     mv   quit  rmdir  touch
chgrp  chown  df     exists  head  info  mkdir  put  rm    tail

azure>

While still in interactive mode, you can run “ls -l” to list the entries in the home directory (“help ls” will show the command’s usage details). If you’re not familiar with the Unix/Linux “ls” command, the columns represent 1) permissions, 2) file owner, 3) file group, 4) file size, 5-7) file’s modification time, and 8) file name.

> python samples/cli.py
azure> ls -l
drwxrwx--- 0123abcd 0123abcd         0 Aug 02 12:44 azure1
-rwxrwx--- 0123abcd 0123abcd   1048576 Jul 25 18:33 abc.csv
-r-xr-xr-x 0123abcd 0123abcd        36 Jul 22 18:32 xyz.csv
drwxrwx--- 0123abcd 0123abcd         0 Aug 03 13:46 tmp
azure> ls -l --human-readable
drwxrwx--- 0123abcd 0123abcd   0B Aug 02 12:44 azure1
-rwxrwx--- 0123abcd 0123abcd   1M Jul 25 18:33 abc.csv
-r-xr-xr-x 0123abcd 0123abcd  36B Jul 22 18:32 xyz.csv
drwxrwx--- 0123abcd 0123abcd   0B Aug 03 13:46 tmp
azure>

To download a remote file, run “get remote-file [local-file]”. The second argument, “local-file”, is optional. If not provided, the local file will be named after the remote file minus the directory path.

> python samples/cli.py
azure> ls -l
drwxrwx--- 0123abcd 0123abcd         0 Aug 02 12:44 azure1
-rwxrwx--- 0123abcd 0123abcd   1048576 Jul 25 18:33 abc.csv
-r-xr-xr-x 0123abcd 0123abcd        36 Jul 22 18:32 xyz.csv
drwxrwx--- 0123abcd 0123abcd         0 Aug 03 13:46 tmp
azure> get xyz.csv
2016-08-04 18:57:48,603 - ADLFS - DEBUG - Creating empty file xyz.csv
2016-08-04 18:57:48,604 - ADLFS - DEBUG - Fetch: xyz.csv, 0-36
2016-08-04 18:57:49,726 - ADLFS - DEBUG - Downloaded to xyz.csv, byte offset 0
2016-08-04 18:57:49,734 - ADLFS - DEBUG - File downloaded (xyz.csv -> xyz.csv)
azure>

It is also possible to run in command-line mode, allowing any available command to be executed separately without remaining in the interpreter.

For example, listing the entries in the home directory:

> python samples/cli.py ls -l
drwxrwx--- 0123abcd 0123abcd         0 Aug 02 12:44 azure1
-rwxrwx--- 0123abcd 0123abcd   1048576 Jul 25 18:33 abc.csv
-r-xr-xr-x 0123abcd 0123abcd        36 Jul 22 18:32 xyz.csv
drwxrwx--- 0123abcd 0123abcd         0 Aug 03 13:46 tmp
>

Also, downloading a remote file:

> python samples/cli.py get xyz.csv
2016-08-04 18:57:48,603 - ADLFS - DEBUG - Creating empty file xyz.csv
2016-08-04 18:57:48,604 - ADLFS - DEBUG - Fetch: xyz.csv, 0-36
2016-08-04 18:57:49,726 - ADLFS - DEBUG - Downloaded to xyz.csv, byte offset 0
2016-08-04 18:57:49,734 - ADLFS - DEBUG - File downloaded (xyz.csv -> xyz.csv)
>

Release History

0.0.9 (2017-05-09)

  • Enforce basic SSL utilization to ensure performance due to GitHub issue 625 <https://github.com/pyca/pyopenssl/issues/625>

0.0.8 (2017-04-26)

  • Fix server-side throttling retry support. This is not a guarantee that if the server is throttling the upload (or download) it will eventually succeed, but there is now a back-off retry in place to make it more likely.

0.0.7 (2017-04-19)

  • Update the build process to more efficiently handle multi-part namespaces for pip.

0.0.6 (2017-03-15)

  • Fix an issue with path caching that should drastically improve performance for download

0.0.5 (2017-03-01)

  • Fix for downloader to ensure there is access to the source path before creating destination files
  • Fix for credential objects to inherit from msrest.authentication for more universal authentication support
  • Add support for the following:
    • set_expiry: allows for setting expiration on files
    • ACL management:
      • set_acl: allows for the full replacement of an ACL on a file or folder
      • set_acl_entries: allows for “patching” an existing ACL on a file or folder
      • get_acl_status: retrieves the ACL information for a file or folder
      • remove_acl_entries: removes the specified entries from an ACL on a file or folder
      • remove_acl: removes all non-default ACL entries from a file or folder
      • remove_default_acl: removes all default ACL entries from a folder
  • Remove unsupported and unused “TRUNCATE” operation.
  • Added API-Version support with a default of the latest api version (2016-11-01)

0.0.4 (2017-02-07)

  • Fix for folder upload to properly delete folders with contents when overwrite specified.
  • Fix to set verbose output to False/Off by default. This removes progress tracking output by default but drastically improves performance.

0.0.3 (2017-02-02)

  • Fix to setup.py to include the HISTORY.rst file. No other changes

0.0.2 (2017-01-30)

  • Addresses an issue with lib.auth() not properly defaulting to 2FA
  • Fixes an issue with Overwrite for ADLUploader sometimes not being honored.
  • Fixes an issue with empty files not properly being uploaded and resulting in a hang in progress tracking.
  • Addition of a samples directory showcasing examples of how to use the client and upload and download logic.
  • General cleanup of documentation and comments.
  • This is still based on API version 2016-11-01

0.0.1 (2016-11-21)

  • Initial preview release. Based on API version 2016-11-01.
  • Includes initial ADLS filesystem functionality and extended upload and download support.
 
File Type Py Version Uploaded on Size
azure-datalake-store-0.0.9.tar.gz (md5) Source 2017-05-11 46KB