Python native implementation of the Spark RDD interface.
Project description
pysparkling
A native Python implementation of Spark’s RDD interface, but instead of being resilient and distributed it is just transient and local; but fast (lower latency than PySpark). It is a drop in replacement for PySpark’s SparkContext and RDD.
Use case: you have a pipeline that processes 100k input documents and converts them to normalized features. They are used to train a local scikit-learn classifier. The preprocessing is perfect for a full Spark task. Now, you want to use this trained classifier in an API endpoint. You need the same pre-processing pipeline for a single document per API call. This does not have to be done in parallel, but there should be only a small overhead in initialization and preferably no dependency on the JVM. This is where pysparkling shines.
Features
Parallelization via multiprocessing.Pool, concurrent.futures.ThreadPoolExecutor or any other Pool-like objects that have a map(func, iterable) method.
AWS S3 is supported. Use file paths of the form s3n://bucket_name/filename.txt with Context.textFile(). Specify multiple files separated by comma. Use environment variables AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID for auth. Mixed local and S3 files are supported. Glob expressions (filenames with * and ?) are resolved.
Lazy execution is in development.
Seamless handling of compressed files is not supported yet.
only dependency: boto for AWS S3 access
Examples
Count the lines in the *.py files in the tests directory:
import pysparkling
context = pysparkling.Context()
print(context.textFile('tests/*.py').count())
API
Context
__init__(pool=None): takes a pool object (an object that has a map() method, e.g. a multiprocessing.Pool) to parallelize all map() and foreach() methods.
textFile(filename): load every line of a text file into a RDD. filename can contain a comma separated list of many files, ? and * wildcards, file paths on S3 (s3n://bucket_name/filename.txt) and local file paths (relative/path/my_text.txt, /absolut/path/my_text.txt or file:///absolute/file/path.txt). If the filename points to a folder containing part* files, those are resolved.
broadcast(var): returns an instance of Broadcast() and it’s values are accessed with value.
RDD
cache(): execute previous steps and cache result
coalesce(): do nothing
collect(): return the underlying list
count(): get length of internal list
countApprox(): same as count()
countByKey: input is list of pairs, returns a dictionary
countByValue: input is a list, returns a dictionary
context(): return the context
distinct(): returns a new RDD containing the distinct elements
filter(func): return new RDD filtered with func
first(): return first element
flatMap(func): return a new RDD of a flattened map
flatMapValues(func): return new RDD
fold(zeroValue, op): aggregate elements
foldByKey(zeroValue, op): aggregate elements by key
foreach(func): apply func to every element in place
foreachPartition(func): same as foreach()
groupBy(func): group by the output of func
groupByKey(): group by key where the RDD is of type [(key, value), …]
histogram(buckets): buckets can be a list or an int
id(): currently just returns None
intersection(other): return a new RDD with the intersection
isCheckpointed(): returns False
join(other): join
keyBy(func): creates tuple in new RDD
keys(): returns the keys of tuples in new RDD
leftOuterJoin(other): left outer join
lookup(key): return list of values for this key
TODO: continue going through the list
map(func): apply func to every element and return a new RDD
mapValues(func): apply func to value in (key, value) pairs and return a new RDD
max(): get the maximum element
min(): get the minimum element
reduce(): reduce
reduceByKey(): reduce by key and return the new RDD
rightOuterJoin(other): right outer join
take(n): get the first n elements
takeSample(n): get n random samples
Broadcast
value: access the value it stores
Changelog
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.