Apache Spark: Difference between revisions

From HPCWIKI
Jump to navigation Jump to search
(Created page with "== Apache Spark == Apache Spark<ref>https://spark.apache.org/</ref> is an open-source developed at UC Berkeley in Scala, Spark is a multi-language unified engine for large-scale data analytics, data science, and machine learning on single-node machines or clusters. Spark can easily support multiple workloads ranging from batch processing, interactive querying, real-time analytics to machine learning and graph processing. Unlike many other platforms with limited option...")
 
 
(One intermediate revision by the same user not shown)
Line 4: Line 4:
Spark can easily [[support]] multiple workloads ranging from batch processing, interactive querying, real-time analytics to machine learning and graph processing. Unlike many other platforms with limited options or requiring users to learn a platform-specific language, Spark supports all leading data analytics languages such as R, SQL, Python, Scala, and Java.
Spark can easily [[support]] multiple workloads ranging from batch processing, interactive querying, real-time analytics to machine learning and graph processing. Unlike many other platforms with limited options or requiring users to learn a platform-specific language, Spark supports all leading data analytics languages such as R, SQL, Python, Scala, and Java.


== Use Apache Spark in Jupyter through PySpark ==
== PySpark ==
PySpark supports most Apache Spark features such as Spark SQL, DataFrame, MLib, Spark Core, and Streaming. PySpark allows users to write Spark applications using the Python API and provides the ability to interface with the Resilient Distributed Datasets (RDDs) in Apache Spark. PySpark also allows Python to interface with JVM objects using the Py4J [[library]].
The Spark Python API (PySpark) exposes the Spark programming model to Python. PySpark supports most Apache Spark features such as Spark SQL, DataFrame, MLib, Spark Core, and Streaming. PySpark allows users to write Spark applications using the Python API and provides the ability to interface with the Resilient Distributed Datasets (RDDs) in Apache Spark. PySpark also allows Python to interface with JVM objects using the Py4J [[library]].





Latest revision as of 17:31, 28 November 2023

Apache Spark

Apache Spark[1] is an open-source developed at UC Berkeley in Scala, Spark is a multi-language unified engine for large-scale data analytics, data science, and machine learning on single-node machines or clusters.

Spark can easily support multiple workloads ranging from batch processing, interactive querying, real-time analytics to machine learning and graph processing. Unlike many other platforms with limited options or requiring users to learn a platform-specific language, Spark supports all leading data analytics languages such as R, SQL, Python, Scala, and Java.

PySpark

The Spark Python API (PySpark) exposes the Spark programming model to Python. PySpark supports most Apache Spark features such as Spark SQL, DataFrame, MLib, Spark Core, and Streaming. PySpark allows users to write Spark applications using the Python API and provides the ability to interface with the Resilient Distributed Datasets (RDDs) in Apache Spark. PySpark also allows Python to interface with JVM objects using the Py4J library.


This site introduce basic Big data analysis using PySpark

References