Spark on Windows 10
The third in a series of blogs from Anandraj Jagadeesan, talks us through downloading Apache Spark on Windows 10, using the new Ubuntu environment.
Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
Spark is very popular as it can carry out parallel execution without having to code for it, DSL support for popular languages, along with rich APIs for various data processing scenarios and special needs such as SQL, Machine learning, Graph processing and Streaming. Spark can consume data in various formats and sources such as flat files, streaming data, parquet, Avro, jdbc, etc. Spark applications are IO intensive and I have seen Spark working at its best when processing files from HDFS or S3 as they are free from IO bottlenecks of databases and streaming systems.
This article is about setting up Spark in Windows 10, using the new Ubuntu environment. To follow the instructions below, the pre-requisites are to have Windows 10 build 15063 and Ubuntu bash for Windows which you can get by following my previous blogs.
Any version of Apache Spark could be downloaded from the link. In the page choose the version of Spark to be run, Hadoop version and the mirror. Then click the download spark link to download the contents packaged in tgz.
If you followed my previous blog-post and installed Ubuntu Bash for windows. Then open the Ubuntu console, change directory to Downloads folder and execute the below command by replacing the tgz file name.
tar –xvzf <.tgz file name>
Below is the example for unpacking “spark-2.2.0-bin-hadoop2.7.tgz”
One of the pre-requsite to run Spark is a recent version of Java. Check Java 8 installed on the Ubuntu for Windows, issue the below command on the console as shown below:
If Java is not available or has older version, then issue the commands to uprgrade
sudo apt-get update sudo apt-get install default-jdk
All pre-requisites are now complete. Go to the Linux console, change to spark download folder, then run
Below example shows creation of Spark RDD and performing calculation on the same.
In order to develop spark applications in Scala setup below tools
- Install SBT
echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 2EE0EA64E40A89B84B2DF73499E82A75642AC823 sudo apt-get update sudo apt-get install sbt
- Get Eclipse Scala IDE from link http://scala-ide.org/
- In my github account, I have a setup a sample Spark Project with all mandatory dependencies. Also, It has sbt plug-in to generate eclipse project and fat jar to run the application without having to install Spark in the runtime environment. Steps to setup are:
- Clone the project from – https://github.com/anandrajj/spark-example
- Run the command “sbt eclipse” from the project root to generate the eclipse project
- Import the project in eclipse
- To run the application, issue “sbt assembly” to build the fat-jar, then use the spark-submit to run the application. Learn how to use spark-submit here.
- For Windows environment, setup HADOOP_HOME environment variable to run Spark from eclipse. Download the winutils.exe here.Spark-example project also includes below features.
- SparkApp – Similar to Scala App. When extended by a Spark Job creates spark context, sql context, validates arguments etc.
- Scala App to query Redshift and postgres dbs.
- Convert CSV to Parquet
- Convert parquet to CSV
- Sample word count example showing SaprkApp in action.