在Ubuntu 20.04/18.04和Debian 10/9上安装Apache Spark
欢迎我们关于如何在Ubuntu 20.04/18.04&Debian 9/8/10上安装Apache Spark的教程。
Apache Spark是一个开源分布式通用聚类计算框架。
它是一个用于大数据和机器学习处理的快速统一分析引擎。
Spark在Java,Scala,Python和R中提供高级API,以及支持一般执行图的优化引擎。
它还支持丰富的高级工具集,包括SQL和结构化数据处理的Spark SQL,MLLIB用于机器学习,GraphX用于图形处理和火花流。
在Ubuntu 20.04/18.04/Debian 9/8/10上安装Apache Spark
在我们在Ubuntu/Debian上安装Apache Spark之前,让我们更新我们的系统包。
sudo apt update sudo apt -y upgrade
现在使用所示的步骤在Ubuntu 18.04/debian 9上安装Spark。
第1步:安装Java
Apache Spark需要Java运行,让我们确保我们在Ubuntu/Debian系统上安装了Java。
对于默认系统Java:
sudo apt install default-jdk
使用命令验证Java版本:
java -version
对于Ubuntu 18.04上的Java 8:
sudo apt update sudo add-apt-repository ppa:webupd8team/java sudo apt update sudo apt install oracle-java8-installer oracle-java8-set-default
对于缺少add-apt-repository命令,请检查debian/ubuntu上安装Add-APT存储库
第2步:下载Apache Spark
从下载页面下载最新版本的Apache Spart。
截至此更新,这是2.4.5.
curl -O https://www.apache.org/dyn/closer.lua/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
提取火花tar包。
tar xvf spark-2.4.5-bin-hadoop2.7.tgz
移动提取到/opt /目录后创建的Spark文件夹。
sudo mv spark-2.4.5-bin-hadoop2.7//opt/spark
设置火花环境
打开Bashrc配置文件。
vim ~/.bashrc
添加:
export SPARK_HOME=/opt/spark export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
激活更改。
source ~/.bashrc
第3步:启动独立主服务器
我们现在可以使用start-master.sh命令启动独立主服务器。
# start-master.sh starting org.apache.spark.deploy.master.Master, logging to /opt/spark/logs/spark-root-org.apache.spark.deploy.master.Master-1-ubuntu.out
该过程将在TCP端口8080上侦听。
# ss -tunelp | grep 8080 tcp LISTEN 0 1 *:8080 *:* users:(("java",pid=8033,fd=238)) ino:41613 sk:5 v6only:0 <->
Web UI看起来如下。
我的火花URL是Spark://Ubuntu:7077.
第4步:启动Spark Worker过程
start-slave.sh命令用于启动Spark Worker进程。
$start-slave.sh spark://ubuntu:7077 starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ubuntu.out
如果我们在$path中没有脚本,则可以首先找到它。
$locate start-slave.sh /opt/spark/sbin/start-slave.sh
我们还可以使用绝对路径来运行脚本。
第5步:使用Spark shell
使用spark-shell命令访问spark shell。
# /opt/spark/bin/spark-shell 19/04/25 21:48:59 WARN Utils: Your hostname, ubuntu resolves to a loopback address: 127.0.1.1; using 116.203.127.13 instead (on interface eth0) 19/04/25 21:48:59 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.11-2.4.1.jar) to method java.nio.Bits.unaligned() WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release 19/04/25 21:49:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://static.13.127.203.116.clients.your-server.de:4040 Spark context available as 'sc' (master = local[*], app id = local-1556221755866). Spark session available as 'spark'. Welcome to ____ __ /__/__ ___ _____//__ _\ \/_ \/_ `/__/ '_/ /___/.__/_,_/_//_/_\ version 2.4.1 /_/ Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 11.0.2) Type in expressions to have them evaluated. Type :help for more information. scala> println("Hello Spark World") Hello Spark World scala>
如果你是一个Python人,请使用pyspark。
# /opt/spark/bin/pyspark Python 2.7.15rc1 (default, Nov 12 2016, 14:31:15) [GCC 7.3.0] on linux2 Type "help", "copyright", "credits" or "license" for more information. 19/04/25 21:53:44 WARN Utils: Your hostname, ubuntu resolves to a loopback address: 127.0.1.1; using 116.203.127.13 instead (on interface eth0) 19/04/25 21:53:44 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.11-2.4.1.jar) to method java.nio.Bits.unaligned() WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release 19/04/25 21:53:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Welcome to ____ __ /__/__ ___ _____//__ _\ \/_ \/_ `/__/ '_/ /__/.__/_,_/_//_/_\ version 2.4.1 /_/ Using Python version 2.7.15rc1 (default, Nov 12 2016 14:31:15) SparkSession available as 'spark'. >>>
使用下面的命令轻松关闭主机和从Spark流程。
$SPARK_HOME/sbin/stop-slave.sh $SPARK_HOME/sbin/stop-master.sh