What is Apache Spark?
Apache Spark is an open-source, distributed computing framework designed for big data processing.
It allows you to process massive datasets across multiple nodes while being fast, scalable, and fairly easy to use.
Unlike traditional Hadoop-based solutions, Spark performs most of its operations in-memory, making it 100x faster than MapReduce in certain cases.
Key Features of Apache Spark:
✔ Lightning Fast: Thanks to in-memory computation.
✔ Distributed Processing: Runs across multiple machines or locally on your computer.
✔ Multi-Language Support: Supports Python (PySpark), Java, Scala, and R.
✔ Integrated Libraries: Includes tools for SQL, machine learning (MLlib), graph processing (GraphX), and streaming.
✔ Easy to Use: Provides an intuitive API for data manipulation.
What is PySpark?
PySpark is the Python API for Apache Spark.
It allows you to leverage the power of Spark while writing Python code.
If you’re already comfortable with pandas, NumPy, or SQL, PySpark will feel quite familiar.
Why Use PySpark Instead of Just Apache Spark?
- Python-Friendly: Great for those who prefer Python over Scala or Java.
- Data Science & ML Integration: Easily integrates with libraries like TensorFlow and pandas.
- Simplifies Big Data Workflows: Write concise Python code while Spark does the heavy lifting.
Step 1: Prerequisites
Before installing Spark, ensure you have:
✅ Java (JDK 8 or later) installed. Run:
|
|
✅ Python 3.7+ installed. Run:
|
|
✅ Scala (optional, but useful for Spark development):
|
|
✅ Ensure your system has at least 8GB RAM for smooth performance.
Step 2: Download Apache Spark
- Visit the official Apache Spark website.
- Choose the latest stable release.
- Select Pre-built for Apache Hadoop.
- Download the
.tgz
file and extract it.
Step 3: Install Spark on Windows
1. Extract Spark Files
Unzip the downloaded file into C:\spark
.
2. Set Environment Variables
Add the following to your system environment variables:
SPARK_HOME = C:\spark
- Add
%SPARK_HOME%\bin
toPATH
.
3. Verify Installation
Open PowerShell and run:
|
|
If Spark starts, the installation is successful!
Step 4: Install Spark on macOS/Linux
1. Install via Homebrew (macOS only)
|
|
2. Manually Extract Spark (Linux & macOS)
|
|
3. Set Environment Variables
Add these lines to ~/.bashrc
or ~/.zshrc
:
|
|
Then apply changes:
|
|
4. Verify Installation
Run:
|
|
You should see a Spark REPL session start.
Step 5: Install PySpark
If you plan to use Spark with Python, install PySpark via pip:
|
|
Test PySpark:
|
|
Now, let’s start working with real data!
Now we will feed Spark a set of local documents (like research papers, articles, or logs) and perform basic data processing with PySpark.
Step 1: Prepare Your Data
For this example, we’ll assume you have a directory called documents/
that contains multiple .txt
files with research papers or articles.
1. Create a Local Dataset
Make a directory and place some text files in it:
|
|
Alternatively, download some real research papers in .txt
format.
Step 2: Start a PySpark Session
Open a Python script or Jupyter Notebook and start PySpark:
|
|
This initializes Apache Spark for processing.
Step 3: Load Documents into Spark
Now, we’ll load all text files in the spark_documents
directory:
|
|
This will output the first few lines from your documents.
Step 4: Perform Basic Processing
Let’s count the number of lines in our documents:
|
|
Or filter lines containing specific words:
|
|
This filters out lines that mention “Spark.”
Step 5: Word Count Example
One of the most common text-processing examples is word count:
|
|
This tokenizes the documents, counts word occurrences, and sorts them in descending order.
Now, let’s take things up a notch with advanced text analysis, including TF-IDF (Term Frequency-Inverse Document Frequency) and structured queries with PySpark DataFrames.
(WHAT!?!?!- Dont worry, we will explain it…)
Step 1: Recap – Load Documents into Spark
Before diving into advanced analytics, let’s ensure we have our document dataset loaded.
Start by launching a PySpark session:
|
|
Step 2: Tokenizing Words (Splitting Text into Words)
To analyze text, we first tokenize it into individual words using Spark’s split()
function:
|
|
This will break sentences into separate words and list them as rows.
Step 3: Removing Stopwords
Common words like “the,” “and,” or “is” don’t add much meaning to text analysis. We can remove them using PySpark’s built-in StopWordsRemover:
|
|
This helps in cleaning up the dataset before applying more advanced analytics.
Step 4: TF-IDF – Identifying Important Words
What is TF-IDF?
TF-IDF (Term Frequency-Inverse Document Frequency) is a technique to find important words in documents. It assigns higher scores to words that appear frequently in a document but rarely across all documents.
Applying TF-IDF in PySpark
|
|
Now, Spark assigns importance scores to words, helping us identify keywords in research papers or most relevant terms in documents.
Step 5: Structured Queries on Documents
Since Spark supports SQL, let’s treat our documents like a database and run SQL queries on them.
Register DataFrame as a Temporary Table
|
|
Example Queries:
Find lines containing “machine learning”:
|
|
Find the top words in documents:
|
|
Now, let’s take it a step further and apply machine learning to classify documents and analyze sentiment using PySpark MLlib.
Step 1: Understanding Text Classification & Sentiment Analysis
What is Text Classification?
Text classification assigns categories (labels) to text documents. Examples include:
- Spam detection (spam vs. not spam)
- News categorization (politics, sports, technology, etc.)
- Customer feedback tagging (positive, negative, neutral)
What is Sentiment Analysis?
Sentiment analysis determines the emotional tone of text, typically classifying it as positive, negative, or neutral. It is widely used in:
- Social media monitoring
- Product review analysis
- Customer support automation
Step 2: Preparing the Dataset
For this tutorial, let’s assume we have a dataset of customer reviews stored as reviews.csv
:
|
|
Load the dataset into Spark:
|
|
Step 3: Preprocessing Text Data
Before training a model, we need to convert text into a numerical format.
1. Tokenization
Splitting sentences into words:
|
|
2. Removing Stopwords
|
|
3. TF-IDF Feature Extraction
|
|
Step 4: Training a Machine Learning Model
We’ll use Logistic Regression for classification.
|
|
Step 5: Evaluating the Model
Now, let’s test our model and check the accuracy.
|
|
To measure performance:
|
|
Step 6: Predicting Sentiment on New Text
To classify new customer reviews:
|
|
Now, we’re taking things to the next level by integrating MLlib with PHI-2 and Llama.cpp.
This combo will give us a powerful system to summarize, classify and support question answer style prompts-interactions.
Step 1: Understanding MLlib in relation to PHI-2 and Llama.cpp
MLlib in relation to PHI-2 and Llama.cpp
- PHI-2: A lightweight AI model from Microsoft, designed for low-resource LLM tasks.
- Llama.cpp: A high-performance, CPU-friendly framework for running Meta’s Llama models on edge devices.
- MLlib + PHI-2 + Llama.cpp: Combine Spark’s distributed ML capabilities with efficient, local AI inference for handling large-scale NLP, summarization, and text processing tasks.
Step 2: Why Integrate Spark with PHI-2 and Llama.cpp?
Feature | MLlib | PHI-2 | Llama.cpp |
---|---|---|---|
Scalability | ✅ Distributed ML | ❌ Local Model | ✅ Efficient Execution |
Ease of Use | ✅ Built-in ML Algorithms | ✅ Pre-trained NLP Model | ✅ Runs on CPU |
Low Latency | ❌ (Distributed Overhead) | ✅ (Optimized for Speed) | ✅ (Minimal Compute Requirements) |
AI Workloads | ✅ General ML & NLP | ✅ NLP Tasks | ✅ LLM Inference |
Real-World Use Cases
🚀 Summarizing Large Datasets (PHI-2 can summarize documents before MLlib classifies them)
📊 AI-Powered Data Analysis (Llama.cpp can generate insights from Spark-based logs)
⚡ Real-Time NLP Processing (Combine all three for distributed inference)
Step 3: Setting Up Spark with PHI-2 and Llama.cpp
1. Install Dependencies
|
|
2. Initialize Spark
|
|
3. Load Documents into Spark
|
|
4. Run PHI-2 for Text Summarization
|
|
5. Run Llama.cpp for Question Answering
|
|
Step 4: Benefits of This Setup
✅ Scalability: Spark processes large-scale data efficiently.
✅ Efficiency: PHI-2 compresses large text before MLlib processes it.
✅ Edge Deployment: Llama.cpp runs LLM inference on local machines (no GPU required).
✅ AI-Driven Insights: Enables AI-powered NLP tasks directly within Spark.