pyspark-cloud 0.0.0

Creator: bradpython12

Last updated:

Add to Cart

Description:

pysparkcloud 0.0.0

Tekumara build of Apache PySpark with Hadoop 3.x
A build of Apache PySpark that uses the hadoop-cloud maven profile to bundle hadoop-aws 3.x which contains S3A.
Install
See Releases
Usage
To use pyspark with temporary STS credentials:
pyspark --driver-java-options "-Dspark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider"

To modify an existing spark session to use S3A for S3 urls, for example spark in the pyspark shell:
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

See test_s3a.py for an example of using the staging committers.
Rationale
The pyspark distribution on pypi ships with hadoop 2.7 and no cloud jars (ie: hadoop-aws).
So common practice is to use hadoop-aws 2.7.3 as follows:
pyspark --packages "org.apache.hadoop:hadoop-aws:2.7.3" --driver-java-options "-Dspark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem"

However, later versions of hadoop-aws cannot be used this way without errors.
This project builds a pyspark distribution from source with Hadoop 3.x.
Later versions of hadoop-aws contain the following new features:

2.8 release line contains S3A improvements to support any AWSCredentialsProvider
2.9 release line contains S3Guard which provides consistency and metadata caching for S3A via a backing DynamoDB metadata store.
3.1 release line incorporates HADOOP-13786 which contains optimised job committers including the Netflix staging committers (Directory and Partitioned) and the Magic committers. See committers and committer architecture.
3.2 release line an enhanced S3A connector and S3Guard, including better resilience to throttled AWS S3 and DynamoDB IO.

To take advantage of the 3.x release line committers in Spark you also need the binding classes introduced into Spark 3.0.0 by SPARK-23977. For Spark 2.4, the HortonWorks backport is used from the Hortonworks repo.

License

For personal and professional use. You cannot resell or redistribute these repositories in their original state.

Customer Reviews

There are no reviews.