Omnibot 2000 - Wikipedia
This is a simplified version of a real-world project I built for a job.
The goal: Create a fully on-prem AI chatbot that can search and retrieve information from a large collection of PDFs.
Unlike cloud-based solutions, everything runs locallyβwhich means no API costs, no data privacy concerns, and full control over the system.
π οΈ What We’re Building
This guide will walk through setting up an AI-powered PDF search system using the following tools:
Component | Tool |
---|
Text Extraction | PyMuPDF (fastest) |
Keyword Search | Elasticsearch |
Semantic Search | FAISS (or Qdrant ) |
Embedding Model | sentence-transformers |
Chatbot LLM | Llama 2 (running on local GPUs) |
User Interface | Streamlit |
β A Quick Note About OCR
For this project, I didnβt need OCR because all the PDFs already contained selectable text.
However, if youβre dealing with scanned PDFs (images instead of text), youβll need OCR (Optical Character Recognition).
For those cases, check out: How to OCR PDFs using pdfplumber and Tesseract.
But for this tutorial, weβre assuming text-based PDFs only.
Before we can search or chat with our PDFs, we need to extract the text.
The best way to do this without OCR is using PyMuPDF
(fitz
), which is blazing fast and maintains formatting.
π¦ Step 1: Install Dependencies
First, install PyMuPDF:
Hereβs a simple function to extract text from any text-based PDF:
1
2
3
4
5
6
7
8
9
10
11
| import fitz # PyMuPDF
def extract_text_from_pdf(pdf_path):
"""Extracts text from a PDF using PyMuPDF."""
doc = fitz.open(pdf_path)
text = "\n".join([page.get_text("text") for page in doc])
return text
# Example Usage
pdf_text = extract_text_from_pdf("example.pdf")
print(pdf_text[:500]) # Print first 500 characters
|
β
Why PyMuPDF?
- Super fast π
- Preserves text structure
- Can handle large PDFs without issues
β When it wonβt work
- If the PDF is a scanned image, PyMuPDF won’t extract anything
- If you get empty text, your PDF likely needs OCR
Again, if OCR is needed, check out: How to OCR PDFs using pdfplumber and Tesseract.
π₯ Step 3: Batch Process a Folder of PDFs
If you have hundreds or thousands of PDFs, youβll want to process them all at once.
Hereβs how to extract text from every PDF in a folder and store the results in a dictionary:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| import os
def extract_text_from_folder(pdf_folder):
"""Extracts text from all PDFs in a folder."""
extracted_texts = {}
for filename in os.listdir(pdf_folder):
if filename.endswith(".pdf"):
pdf_path = os.path.join(pdf_folder, filename)
text = extract_text_from_pdf(pdf_path)
extracted_texts[filename] = text
return extracted_texts
# Example Usage
pdf_texts = extract_text_from_folder("pdf_documents")
print(pdf_texts.keys()) # Print the names of processed PDFs
|
πΉ What This Does:
- Loops through all PDFs in a given folder
- Extracts text and stores it in a dictionary
{filename: extracted_text}
- Can be used later for search indexing
β
What We Have So Far
At this point, we can extract text from PDFs, which is the first step toward building our AI-powered search system.
π₯Indexing PDFs in Elasticsearch (Keyword Search)
π¦ Step 1: Install Elasticsearch & Python Client
First, we need to install Elasticsearch and the Python client.
Option 1: Run Elasticsearch Locally (Recommended)
Install Elasticsearch (7.x or 8.x) from elastic.co, then start it:
Option 2: Run Elasticsearch via Docker
1
| docker run -d --name elasticsearch -p 9200:9200 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:8.5.0
|
Now, install the Python client:
1
| pip install elasticsearch
|
We’ll store each PDF’s extracted text as a document in Elasticsearch.
1οΈβ£ Connect to Elasticsearch
1
2
3
4
5
6
7
8
9
| from elasticsearch import Elasticsearch
es = Elasticsearch("http://localhost:9200") # Change if running remotely
# Check connection
if es.ping():
print("Connected to Elasticsearch!")
else:
print("Elasticsearch connection failed.")
|
2οΈβ£ Create an Index for PDFs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| INDEX_NAME = "pdf_documents"
# Define mapping (schema)
mapping = {
"mappings": {
"properties": {
"filename": {"type": "keyword"},
"text": {"type": "text"}
}
}
}
# Create the index
if not es.indices.exists(index=INDEX_NAME):
es.indices.create(index=INDEX_NAME, body=mapping)
print(f"Index '{INDEX_NAME}' created.")
|
3οΈβ£ Add PDFs to Elasticsearch
1
2
3
4
5
6
7
| def index_pdf(filename, text):
"""Indexes a PDF document in Elasticsearch."""
doc = {"filename": filename, "text": text}
es.index(index=INDEX_NAME, body=doc)
# Example usage
index_pdf("example.pdf", "This is a sample PDF content.")
|
π Step 3: Search PDFs in Elasticsearch
Now that PDFs are indexed, we can search for keywords.
Example: Search for “machine learning” in PDFs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| def search_pdfs(query):
"""Search PDFs using Elasticsearch."""
search_query = {
"query": {
"match": {
"text": query
}
}
}
results = es.search(index=INDEX_NAME, body=search_query)
return results["hits"]["hits"]
# Example usage
results = search_pdfs("machine learning")
for r in results:
print(f"Found in: {r['_source']['filename']}\nText: {r['_source']['text'][:200]}...\n")
|
β
Elasticsearch now powers our keyword-based PDF search!
π§ Indexing PDFs in FAISS (Semantic Search)
Elasticsearch works well for exact keyword matches, but it doesnβt understand meaning.
To search PDFs based on meaning, we use FAISS (Facebook AI Similarity Search) with text embeddings.
1
| pip install faiss-cpu sentence-transformers
|
π Step 2: Generate Embeddings for PDFs
Weβll use sentence-transformers
to convert text into numerical embeddings.
1οΈβ£ Load the Embedding Model
1
2
3
| from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2") # Fast and accurate
|
2οΈβ£ Convert PDF Text into Embeddings
1
2
3
4
5
6
7
| def embed_text(text):
"""Generates an embedding for a given text."""
return model.encode(text)
# Example usage
embedding = embed_text("This is a sample text.")
print(embedding.shape) # Output: (384,)
|
π₯ Step 3: Store Embeddings in FAISS
Now, we create a FAISS index to store and search our embeddings.
1οΈβ£ Import FAISS & Create an Index
1
2
3
4
5
| import faiss
import numpy as np
DIMENSIONS = 384 # Model output size
index = faiss.IndexFlatL2(DIMENSIONS) # L2 distance index
|
2οΈβ£ Index PDF Embeddings
1
2
3
4
5
6
7
8
9
| pdf_texts = {
"example.pdf": "This document is about deep learning and AI.",
"sample.pdf": "This paper discusses cloud computing concepts."
}
embeddings = np.array([embed_text(text) for text in pdf_texts.values()])
index.add(embeddings)
print("FAISS index created with", index.ntotal, "documents.")
|
π Step 4: Search PDFs in FAISS
Now we can search using semantic similarity.
1οΈβ£ Search FAISS Using a Query
1
2
3
4
5
6
7
8
9
10
11
12
| def search_faiss(query, k=2):
"""Searches FAISS for the most similar PDFs."""
query_embedding = embed_text(query).reshape(1, -1)
D, I = index.search(query_embedding, k) # Retrieve top-k
return I
# Example usage
query = "AI and deep learning"
results = search_faiss(query)
for i in results[0]:
print("Matched:", list(pdf_texts.keys())[i])
|
β
FAISS now powers our semantic PDF search!
Hereβs Part 3, where we integrate Llama 2 to create an AI chatbot that can answer questions based on our PDF data using retrieval-augmented generation (RAG). π
π₯ Running Llama 2 Locally
ποΈ Step 1: Install Llama 2
Weβll use llama-cpp-python
, which allows us to run Llama 2 on CPU or GPU.
1
| pip install llama-cpp-python
|
π‘ If you have a powerful GPU, use the GGUF
version for better performance.
π Step 2: Download a Llama 2 Model
Go to Metaβs Llama 2 page and download a model.
For fast responses, I recommend:
llama-2-7b-chat.Q4_K_M.gguf
(Quantized 4-bit model)llama-2-13b-chat.Q4_K_M.gguf
(Larger but still manageable)
Place the model file in a folder called models
.
π₯ Step 3: Load Llama 2 in Python
1
2
3
4
5
6
7
8
| from llama_cpp import Llama
# Load Llama 2 model
llm = Llama(model_path="models/llama-2-7b-chat.Q4_K_M.gguf")
# Example chat
response = llm("What is machine learning?")
print(response["choices"][0]["text"])
|
β
Llama 2 is now running locally!
π§ Implementing RAG (Retrieval-Augmented Generation)
By itself, Llama 2 doesnβt know about our PDFs.
To make it answer questions based on PDFs, we use retrieval-augmented generation (RAG):
- Search PDFs using Elasticsearch (keyword) and FAISS (semantic search)
- Feed search results into Llama 2 as context
- Ask Llama 2 a question, and it will generate an answer based on the retrieved PDFs
π Step 1: Search PDFs Using Elasticsearch & FAISS
We combine both search methods to get the most relevant PDF chunks.
1
2
3
4
5
6
7
8
9
10
11
12
13
| def search_pdfs_rag(query, k=3):
"""Search PDFs using Elasticsearch (keyword) and FAISS (semantic)."""
# 1οΈβ£ Keyword Search (Elasticsearch)
es_results = search_pdfs(query)[:k]
# 2οΈβ£ Semantic Search (FAISS)
faiss_results = search_faiss(query, k)[:k]
# 3οΈβ£ Merge and Return Results
combined_results = set([r["_source"]["text"][:500] for r in es_results])
combined_results.update([list(pdf_texts.values())[i][:500] for i in faiss_results[0]])
return "\n\n".join(combined_results)
|
π₯ Step 2: Feed Search Results to Llama 2
Now we pass the retrieved text as context to Llama 2.
1
2
3
4
5
6
7
8
9
10
11
| def chat_with_pdfs(query):
"""Uses RAG to answer questions based on PDF content."""
context = search_pdfs_rag(query)
prompt = f"Use the following context to answer the question:\n\n{context}\n\nQuestion: {query}\nAnswer:"
response = llm(prompt)
return response["choices"][0]["text"]
# Example Usage
print(chat_with_pdfs("What is deep learning?"))
|
β
Now, Llama 2 can answer questions based on our PDFs!
π¨ Building a Simple Chat UI with Streamlit
To make this user-friendly, letβs build a web-based chatbot using Streamlit.
π¦ Step 1: Install Streamlit
π Step 2: Create a Simple Chatbot UI
Create a file app.py
:
1
2
3
4
5
6
7
8
9
| import streamlit as st
st.title("π AI Chatbot for PDF Search")
query = st.text_input("Ask a question:")
if query:
response = chat_with_pdfs(query)
st.write("### π€ AI Response:")
st.write(response)
|
π― Step 3: Run the App
β
Now, you have a chatbot that searches PDFs and answers questions!
πImproving Search Ranking
Right now, our Elasticsearch + FAISS search returns somewhat relevant results, but we can improve ranking & filtering.
π Step 1: Boost Keyword Matches in Elasticsearch
By default, Elasticsearch treats all matches equally. We can boost results that contain exact keyword matches.
β
Update Elasticsearch Search Query
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| def search_pdfs_improved(query, k=3):
"""Improves search ranking by boosting keyword matches."""
search_query = {
"query": {
"bool": {
"should": [
{"match": {"text": {"query": query, "boost": 2.0}}}, # Boost exact matches
{"match_phrase": {"text": {"query": query, "boost": 1.5}}} # Boost phrase matches
]
}
}
}
results = es.search(index="pdf_documents", body=search_query)
return results["hits"]["hits"][:k]
# Example usage
print(search_pdfs_improved("machine learning"))
|
β
Boosts exact and phrase matches
β
More relevant results appear at the top
π Step 2: Adjust FAISS to Prefer Recent Documents
FAISS doesnβt consider document relevance, but we can re-rank results based on recency.
β
Re-rank FAISS Results by Document Date
1
2
3
4
5
6
7
8
| def rerank_faiss_results(faiss_results, doc_metadata):
"""Re-ranks FAISS results based on recency."""
sorted_results = sorted(faiss_results, key=lambda doc: doc_metadata[doc]["date"], reverse=True)
return sorted_results
# Example usage
metadata = {"example.pdf": {"date": "2024-01-01"}, "old.pdf": {"date": "2019-05-10"}}
print(rerank_faiss_results(["old.pdf", "example.pdf"], metadata)) # "example.pdf" comes first
|
β
Recent documents now rank higher
π¨ Enhancing Streamlit UI
Our current chatbot UI is too basic. Letβs:
β
Improve layout
β
Add chat history
β
Show document sources
π Step 1: Upgrade the Chat UI
Update app.py
with a better layout:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
| import streamlit as st
st.set_page_config(page_title="π AI Chatbot for PDFs", layout="wide")
st.title("π AI Chatbot for PDF Search")
# Sidebar
with st.sidebar:
st.header("Settings")
st.text("Customize your search")
query = st.text_input("Ask a question:")
if "chat_history" not in st.session_state:
st.session_state.chat_history = []
if query:
response = chat_with_pdfs(query)
st.session_state.chat_history.append((query, response))
st.write("### π€ AI Response:")
for q, r in st.session_state.chat_history:
st.write(f"**Q:** {q}")
st.write(f"**A:** {r}")
st.write("---")
|
β
Keeps chat history
β
Better layout with a sidebar
π Step 2: Show PDF Sources in Chat
Modify chat_with_pdfs()
to return sources.
1
2
3
4
5
6
7
8
| def chat_with_pdfs(query):
"""Returns AI response + sources."""
context, sources = search_pdfs_rag(query, return_sources=True)
prompt = f"Use the following context to answer the question:\n\n{context}\n\nQuestion: {query}\nAnswer:"
response = llm(prompt)
return response["choices"][0]["text"], sources
|
Now, update app.py
to show sources:
1
2
3
4
5
6
7
8
| response, sources = chat_with_pdfs(query)
st.write("### π€ AI Response:")
st.write(response)
st.write("π **Sources:**")
for source in sources:
st.write(f"- {source}")
|
β
Users see which PDFs were used to generate answers
π Adding PDF Upload Support
Currently, we preload PDFs, but users canβt upload new ones. Letβs fix that!
π Step 1: Add File Upload to Streamlit
Modify app.py
to allow users to upload PDFs.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| uploaded_files = st.file_uploader("Upload PDFs", accept_multiple_files=True, type=["pdf"])
if uploaded_files:
for uploaded_file in uploaded_files:
bytes_data = uploaded_file.read()
# Save file locally
with open(f"pdf_documents/{uploaded_file.name}", "wb") as f:
f.write(bytes_data)
# Extract text and index it
text = extract_text_from_pdf(f"pdf_documents/{uploaded_file.name}")
index_pdf(uploaded_file.name, text)
st.success("Files uploaded and indexed successfully!")
|
β
Users can now upload PDFs, and theyβre instantly indexed
Hereβs Part 5, where we optimize performance by making Llama 2 run faster, improving FAISS search speed, and scaling to thousands of PDFs efficiently. π
β‘Speeding Up Llama 2
By default, Llama 2 can be slow, especially on CPUs. Hereβs how to run it faster.
π Step 1: Use a Quantized Llama 2 Model
Quantization reduces model size and speeds up inference.
β
Download a Quantized GGUF Model
Go to Metaβs Llama 2 page and download:
llama-2-7b-chat.Q4_K_M.gguf
(4-bit quantized)- OR
llama-2-13b-chat.Q4_K_M.gguf
(faster than full precision)
Move it to models/
.
π Step 2: Enable GPU Acceleration (If Available)
If you have a GPU, use llama-cpp-python
with CUDA.
β
Install CUDA & llama-cpp-python with GPU Support
1
| CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python --no-cache-dir
|
Then, modify your Llama 2 loading code:
1
2
3
4
| from llama_cpp import Llama
# Load Llama 2 model with GPU acceleration
llm = Llama(model_path="models/llama-2-7b-chat.Q4_K_M.gguf", n_gpu_layers=100)
|
β
Massive speed boost on GPUs!
β
Even CPU inference is faster with quantization
π Step 3: Reduce Response Time with Streaming
Right now, Llama 2 waits for the full response before returning anything.
We can stream responses as theyβre generated for a faster, chat-like feel.
β
Modify chat_with_pdfs()
to Stream Responses
1
2
3
4
5
6
7
| def chat_with_pdfs(query):
"""Streams responses from Llama 2 for faster user experience."""
context = search_pdfs_rag(query)
prompt = f"Use the following context to answer:\n\n{context}\n\nQuestion: {query}\nAnswer:"
for response in llm(prompt, stream=True):
yield response["choices"][0]["text"]
|
β
Now responses appear instantly instead of waiting!
ποΈ Optimizing FAISS for Large-Scale Search
FAISS is fast, but it can slow down as we add more PDFs.
Hereβs how to speed it up for thousands of documents.
π Step 1: Use HNSW Indexing Instead of Flat L2
By default, FAISS uses brute-force search (IndexFlatL2
).
For huge datasets, we should use Hierarchical Navigable Small World (HNSW) indexing.
β
Modify FAISS Index to Use HNSW
1
2
3
4
| import faiss
DIMENSIONS = 384 # Sentence-Transformer output size
index = faiss.IndexHNSWFlat(DIMENSIONS, 32) # 32 is the max number of links per node
|
β
Now FAISS search is MUCH faster for large datasets
π Step 2: Use IVF Indexing for Faster Lookups
Another trick is Inverted File Index (IVF), which clusters vectors for fast retrieval.
β
Modify FAISS Index to Use IVF
1
2
3
4
| num_clusters = 128 # Adjust based on dataset size
quantizer = faiss.IndexFlatL2(DIMENSIONS)
index = faiss.IndexIVFFlat(quantizer, DIMENSIONS, num_clusters)
index.train(embeddings) # Train on initial dataset
|
β
Speeds up searches by grouping similar documents
π Scaling Elasticsearch for Massive PDF Collections
If you have millions of PDFs, Elasticsearch needs tuning.
π Step 1: Disable Refresh for Bulk Indexing
By default, Elasticsearch refreshes after every document insert, slowing down indexing.
β
Disable Refresh While Indexing
1
2
3
4
5
6
7
8
| es.indices.put_settings(index="pdf_documents", body={"refresh_interval": "-1"})
# Bulk index PDFs
for filename, text in pdf_texts.items():
index_pdf(filename, text)
# Re-enable refresh
es.indices.put_settings(index="pdf_documents", body={"refresh_interval": "1s"})
|
β
Indexing PDFs is now 5-10x faster
π Step 2: Increase Shard Count for Large Datasets
For huge collections, increase the number of shards.
β
Modify Index Settings
1
2
3
4
5
6
7
8
9
| index_settings = {
"settings": {
"index": {
"number_of_shards": 3,
"number_of_replicas": 1
}
}
}
es.indices.create(index="pdf_documents", body=index_settings)
|
β
Speeds up searches & indexing on large datasets
π§ Adding Multi-Turn Memory
By default, Llama 2 only answers one question at a time.
To enable multi-turn conversation memory, we need to track past questions and answers.
π Step 1: Modify chat_with_pdfs()
to Include Memory
Modify the chatbot function to store past questions & responses.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
| def chat_with_pdfs(query):
"""Uses multi-turn memory for AI conversations."""
# Retrieve relevant PDFs
context, sources = search_pdfs_rag(query, return_sources=True)
# Maintain conversation memory
if "conversation_history" not in st.session_state:
st.session_state.conversation_history = []
# Create conversation context
conversation_history = "\n".join(st.session_state.conversation_history)
# Generate response using Llama 2
prompt = f"""
Previous conversation:
{conversation_history}
Use the following PDF context to answer the question:
{context}
Question: {query}
Answer:
"""
response = llm(prompt)["choices"][0]["text"]
# Save conversation
st.session_state.conversation_history.append(f"Q: {query}\nA: {response}")
return response, sources
|
β
Now, the chatbot remembers previous questions!
π Step 2: Display Conversation History in the UI
Modify app.py
to show chat history.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
| import streamlit as st
st.title("π AI Chatbot for PDF Search")
query = st.text_input("Ask a question:")
if query:
response, sources = chat_with_pdfs(query)
st.write("### π€ AI Response:")
st.write(response)
# Display conversation history
st.write("### π Conversation History:")
for message in st.session_state.conversation_history[-5:]: # Show last 5 messages
st.write(message)
# Show document sources
st.write("π **Sources:**")
for source in sources:
st.write(f"- {source}")
|
β
Now users see chat history & sources in a clean format
ποΈββοΈ Fine-Tuning Llama 2 for Better Responses
Currently, Llama 2 isnβt optimized for our PDFs.
Fine-tuning makes it much smarter about our documents.
π Step 1: Prepare Custom Training Data
Fine-tuning requires examples of questions & correct answers.
Weβll use our own PDFs to create a dataset.
1
2
3
4
5
6
7
8
9
10
| [
{
"input": "What is machine learning?",
"output": "Machine learning is a method of data analysis that automates analytical model building."
},
{
"input": "Explain deep learning.",
"output": "Deep learning is a subset of machine learning that uses neural networks to model complex patterns in data."
}
]
|
Save this as training_data.json
.
π Step 2: Fine-Tune Llama 2
Weβll use Hugging Faceβs transformers
to fine-tune Llama 2.
β
Install Dependencies
1
| pip install transformers datasets peft
|
β
Fine-Tune Llama 2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
| from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
import torch
import json
# Load base model & tokenizer
model_name = "meta-llama/Llama-2-7b-chat-hf"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Load training data
with open("training_data.json", "r") as f:
training_data = json.load(f)
# Convert to tokenized format
train_texts = [d["input"] for d in training_data]
train_labels = [d["output"] for d in training_data]
train_encodings = tokenizer(train_texts, padding=True, truncation=True, return_tensors="pt")
label_encodings = tokenizer(train_labels, padding=True, truncation=True, return_tensors="pt")
# Fine-tuning settings
training_args = TrainingArguments(
output_dir="./fine-tuned-llama",
per_device_train_batch_size=2,
num_train_epochs=3,
save_strategy="epoch"
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_encodings,
eval_dataset=label_encodings
)
trainer.train()
model.save_pretrained("./fine-tuned-llama")
tokenizer.save_pretrained("./fine-tuned-llama")
|
β
Llama 2 is now fine-tuned on our PDFs!
β‘ Deploying the Chatbot Backend with FastAPI
FastAPI is a lightweight, high-performance API framework that will serve as our chatbotβs backend.
π Step 1: Install FastAPI & Uvicorn
1
| pip install fastapi uvicorn gunicorn
|
π Step 2: Create the FastAPI Server
Create a new file server.py
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| from fastapi import FastAPI
from pydantic import BaseModel
from llama_cpp import Llama
app = FastAPI()
# Load Llama 2 model
llm = Llama(model_path="models/llama-2-7b-chat.Q4_K_M.gguf", n_gpu_layers=100)
# Define request schema
class ChatRequest(BaseModel):
query: str
@app.post("/chat")
def chat(request: ChatRequest):
"""Handles chat requests and returns AI responses."""
response = llm(request.query)["choices"][0]["text"]
return {"response": response}
|
π Step 3: Run FastAPI Server
Start the server using Uvicorn:
1
| uvicorn server:app --host 0.0.0.0 --port 8000
|
β
Now, our chatbot runs as an API!
Test it with:
1
| curl -X POST "http://localhost:8000/chat" -H "Content-Type: application/json" -d '{"query": "What is machine learning?"}'
|
ποΈ Scaling with Gunicorn
Uvicorn runs a single process, which isnβt ideal for multiple users.
We use Gunicorn to run multiple workers.
π Step 1: Run FastAPI with Gunicorn
1
| gunicorn -w 4 -k uvicorn.workers.UvicornWorker server:app --bind 0.0.0.0:8000
|
β
Now, our chatbot can handle multiple users at once!
π¨ Deploying Streamlit as a Frontend
Now that the API is live, weβll connect it to Streamlit.
π Step 1: Modify app.py
to Call FastAPI
Update app.py
to fetch chatbot responses via API.
1
2
3
4
5
6
7
8
9
10
11
| import streamlit as st
import requests
st.title("π AI Chatbot for PDF Search")
query = st.text_input("Ask a question:")
if query:
response = requests.post("http://localhost:8000/chat", json={"query": query}).json()
st.write("### π€ AI Response:")
st.write(response["response"])
|
π Step 2: Run Streamlit on a Web Server
Run Streamlit with:
1
| streamlit run app.py --server.port 8501 --server.address 0.0.0.0
|
β
Now, the chatbot has a web interface!
ποΈ Running Everything with Supervisor
To keep FastAPI & Streamlit running in the background, use Supervisor.
π Step 1: Install Supervisor
1
| sudo apt install supervisor
|
π Step 2: Create Supervisor Config
Create /etc/supervisor/conf.d/chatbot.conf
:
1
2
3
4
5
6
7
8
9
10
11
12
13
| [program:fastapi_server]
command=/usr/bin/gunicorn -w 4 -k uvicorn.workers.UvicornWorker server:app --bind 0.0.0.0:8000
autostart=true
autorestart=true
stderr_logfile=/var/log/fastapi_server.err.log
stdout_logfile=/var/log/fastapi_server.out.log
[program:streamlit_ui]
command=/usr/bin/streamlit run /path/to/app.py --server.port 8501 --server.address 0.0.0.0
autostart=true
autorestart=true
stderr_logfile=/var/log/streamlit_ui.err.log
stdout_logfile=/var/log/streamlit_ui.out.log
|
Reload Supervisor:
1
2
3
4
| sudo supervisorctl reread
sudo supervisorctl update
sudo supervisorctl start fastapi_server
sudo supervisorctl start streamlit_ui
|
β
Now, FastAPI & Streamlit start automatically on reboot!
π API Authentication with API Keys
Right now, anyone can access our chatbot API. Weβll restrict access using API keys.
π Step 1: Generate API Keys for Users
Modify server.py
to store API keys.
1
2
3
4
5
6
7
8
| API_KEYS = {
"user1": "abc123",
"admin": "xyz789"
}
def verify_api_key(api_key: str):
"""Checks if the provided API key is valid."""
return api_key in API_KEYS.values()
|
π Step 2: Require API Key for Chat Requests
Modify the /chat
route to require an API key.
1
2
3
4
5
6
7
8
9
10
11
12
13
| from fastapi import FastAPI, Header, HTTPException
app = FastAPI()
@app.post("/chat")
def chat(query: str, api_key: str = Header(None)):
"""Requires API key for chatbot access."""
if not api_key or not verify_api_key(api_key):
raise HTTPException(status_code=401, detail="Invalid API Key")
response = llm(query)["choices"][0]["text"]
return {"response": response}
|
β
Now, only users with a valid API key can access the chatbot!
Test it with:
1
| curl -X POST "http://localhost:8000/chat" -H "API-Key: abc123" -H "Content-Type: application/json" -d '{"query": "What is AI?"}'
|
β³ Preventing API Abuse with Rate-Limiting
To prevent spam/bot abuse, weβll limit how often users can query the API.
π Step 1: Install Rate-Limiting Middleware
Install slowapi
, a FastAPI-compatible rate limiter.
π Step 2: Add Rate-Limiting to FastAPI
Modify server.py
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| from slowapi import Limiter
from slowapi.util import get_remote_address
limiter = Limiter(key_func=get_remote_address)
@app.post("/chat")
@limiter.limit("5/minute")
def chat(query: str, api_key: str = Header(None)):
"""Limits requests to 5 per minute per user."""
if not api_key or not verify_api_key(api_key):
raise HTTPException(status_code=401, detail="Invalid API Key")
response = llm(query)["choices"][0]["text"]
return {"response": response}
|
β
Now, users can only make 5 requests per minute!
Test it by sending multiple requests in a short time.
π Part 3: Encrypting User Queries
By default, data is sent in plaintext. Letβs encrypt user queries to protect sensitive data.
π Step 1: Install Cryptography Library
1
| pip install cryptography
|
π Step 2: Encrypt User Queries Before Sending
Modify Streamlit UI (app.py
):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
| from cryptography.fernet import Fernet
import requests
# Generate encryption key (Only run once!)
key = Fernet.generate_key()
cipher = Fernet(key)
st.title("π Secure AI Chatbot")
query = st.text_input("Ask a question:")
if query:
encrypted_query = cipher.encrypt(query.encode()).decode()
response = requests.post("http://localhost:8000/chat", json={"query": encrypted_query})
decrypted_response = cipher.decrypt(response.json()["response"].encode()).decode()
st.write("### π€ AI Response:")
st.write(decrypted_response)
|
β
Now, queries & responses are encrypted before being sent!
π οΈ Securing Deployment with HTTPS
To enable secure communication, use Letβs Encrypt for SSL/TLS encryption.
π Step 1: Install Certbot
1
| sudo apt install certbot python3-certbot-nginx
|
Modify /etc/nginx/sites-available/chatbot
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
| server {
listen 80;
server_name yourdomain.com;
return 301 https://$host$request_uri;
}
server {
listen 443 ssl;
server_name yourdomain.com;
ssl_certificate /etc/letsencrypt/live/yourdomain.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/yourdomain.com/privkey.pem;
location / {
proxy_pass http://localhost:8000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}
|
Restart Nginx:
1
| sudo systemctl restart nginx
|
β
Now, the chatbot runs securely over HTTPS!
π Logging User Interactions
Weβll log every chatbot request to a database for later analysis.
π Step 1: Install SQLite for Logging
Weβll store logs in an SQLite database.
π Step 2: Modify FastAPI to Log Chats
Modify server.py
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
| import sqlite3
from datetime import datetime
# Connect to SQLite database
conn = sqlite3.connect("chat_logs.db")
cursor = conn.cursor()
# Create logs table
cursor.execute("""
CREATE TABLE IF NOT EXISTS chat_logs (
id INTEGER PRIMARY KEY AUTOINCREMENT,
timestamp TEXT,
user TEXT,
query TEXT,
response TEXT
)
""")
conn.commit()
def log_chat(user, query, response):
"""Logs chatbot interactions to the database."""
timestamp = datetime.now().isoformat()
cursor.execute("INSERT INTO chat_logs (timestamp, user, query, response) VALUES (?, ?, ?, ?)",
(timestamp, user, query, response))
conn.commit()
@app.post("/chat")
def chat(query: str, api_key: str = Header(None)):
"""Handles chatbot requests and logs them."""
if not api_key or not verify_api_key(api_key):
raise HTTPException(status_code=401, detail="Invalid API Key")
response = llm(query)["choices"][0]["text"]
# Log interaction
log_chat(api_key, query, response)
return {"response": response}
|
β
Now, all chatbot interactions are logged!
π Tracking Most Common Queries
Now that chats are logged, letβs track the most frequently asked questions.
π Step 1: Query Most Common Searches
Modify server.py
to fetch analytics:
1
2
3
4
5
6
| @app.get("/analytics/top-queries")
def top_queries():
"""Returns the top 5 most asked questions."""
cursor.execute("SELECT query, COUNT(query) as count FROM chat_logs GROUP BY query ORDER BY count DESC LIMIT 5")
results = cursor.fetchall()
return {"top_queries": results}
|
β
Now we can see the top 5 queries!
Test it:
1
| curl -X GET "http://localhost:8000/analytics/top-queries"
|
β³ Monitoring Response Time
To track how fast the chatbot is responding, weβll log execution time.
π Step 1: Modify FastAPI to Track Response Time
Modify server.py
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
| import time
@app.post("/chat")
def chat(query: str, api_key: str = Header(None)):
"""Logs chatbot response times for performance tracking."""
if not api_key or not verify_api_key(api_key):
raise HTTPException(status_code=401, detail="Invalid API Key")
start_time = time.time()
response = llm(query)["choices"][0]["text"]
end_time = time.time()
response_time = round(end_time - start_time, 2)
# Log response time
cursor.execute("INSERT INTO chat_logs (timestamp, user, query, response) VALUES (?, ?, ?, ?)",
(datetime.now().isoformat(), api_key, query, f"{response} (Response Time: {response_time}s)"))
conn.commit()
return {"response": response, "response_time": response_time}
|
β
Now, chatbot response times are tracked!
Test it:
1
| curl -X POST "http://localhost:8000/chat" -H "API-Key: abc123" -H "Content-Type: application/json" -d '{"query": "How does machine learning work?"}'
|
π Displaying Analytics in Streamlit
Weβll display usage insights in a dashboard.
π Step 1: Modify Streamlit UI
Update app.py
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
| import streamlit as st
import requests
st.title("π Chatbot Analytics")
# Fetch top queries
response = requests.get("http://localhost:8000/analytics/top-queries").json()
top_queries = response["top_queries"]
# Display analytics
st.write("### π₯ Top 5 Most Asked Questions")
for query, count in top_queries:
st.write(f"- {query} ({count} times)")
# Fetch recent chats
st.write("### π Recent Chat Logs")
conn = sqlite3.connect("chat_logs.db")
cursor = conn.cursor()
cursor.execute("SELECT timestamp, user, query FROM chat_logs ORDER BY timestamp DESC LIMIT 5")
recent_chats = cursor.fetchall()
for timestamp, user, query in recent_chats:
st.write(f"π
{timestamp} - **{user}** asked: *{query}*")
|
β
Now, we have a real-time analytics dashboard!
Run it:
1
| streamlit run app.py --server.port 8502
|