Chatbots have existed for a long time, and the advancements in artificial intelligence and machine learning have paved the way for developers to leverage Large Language Models (LLM’s) and other techniques to enhance their effectiveness and broaden their use cases.
At Appx, we wanted to leverage LLM’s by developing a server-side chatbot capable of answering questions about specific information by feeding it text and PDF files and have the capability for each user to have a chat history, so they can maintain a conversation with it.
To deploy this chatbot run on a Node.js server, the initial step was to explore open-source LLM’s (to avoid licensing fees) and JavaScript libraries to integrate with these models, enabling them to run on a server using JavaScript.
This is where Ollama and LangChain come into play.
Ollama
Ollama is a streamlined open-source platform/tool that allows users to run LLM’s locally. It has an ever-growing library of models tailored for various tasks: question-answering, embedding, vision, etc.
Ollama also makes downloading, managing and running models seamless through its Command Line Interface (CLI) or Application Programming Interface (API).
To install Ollama on your machine, simply go to the Ollama website and download the adequate version for your operating system.
LangChain
LangChain is an open-source Python/JavaScript framework that streamlines the process of building end-to-end applications powered by LLM’s and language-based pipelines. It achieves this through their extensive library of components which can be chained together to suit your specific project needs.
This makes it a great choice for a wide range of applications such as: question answering systems over different types of data, chatbots, retrieval augmented generation (RAG) applications, etc.
LangChain also supports third-party integrations with other models, platforms and services, allowing you to add extra functionality to your application as needed. For this project, we are going to use Ollama’s third-party integration package.
To install LangChain and its Ollama third-party integration, go to your project’s root folder and run:
npm i langchain @langchain/core @langchain/ollama
Retrieval Augmented Generation (RAG)
RAG is a method used to enhance the knowledge of LLM’s by incorporating external data. Although LLM’s can analyze and discuss a wide range of topics, but their knowledge is limited to the data that they were trained on. This means that they are not able to reason about private data or data that came after the model was trained.
A standard application that uses this technique through LangChain consists of two primary components:
- Indexing: Creating a structured representation of the ingested data to make it easier and faster to retrieve. Indexing can be further divided into four steps:
- Loading the data using Document Loaders, which return Document objects for each document that was loaded.
- Splitting the loaded documents into smaller chunks using Text splitters. This step is beneficial for indexing data because smaller chunks are easier to search over than large chunks, and to input said data to the model because large chunks might not fit in its finite context window (think of the context window as the model’s memory).
- Store and index the chunks using VectorStores and an Embeddings model, so they can be searched over when the user asks a question to the chatbot.
- Retrieval and generation:
- Given the user’s question, search the vector store for relevant chunks using a Retriever.
- The LLM generates an answer using a prompt consisting of the question and the retrieved information.
Code examples for these steps will be shown in the section below.
Solution
The next sections will go over the research and development process, from setting up a simple server to implementing a chatbot we can ask questions to through an endpoint using LangChain and Ollama.
1. Choosing a model
Before setting up the server, we need to choose the LLM that will power our chatbot. Essentially, the model should perform well of course, but it should also be open-source to eliminate concerns about license fees.
Because there are a lot of LLM’s in Ollama’s model catalog, the top two models at the time were chosen for testing: Meta’s Llama 3.1 and Google’s Gemma 2.
To download these models into our machine, you can run these commands on your CLI:
ollama pull llama3.1
ollama pull gemma2
After they are downloaded, you can check the list of the models that you have currently by running:
ollama list
Which will give you the following output:
NAME ID SIZE MODIFIED
gemma2:latest ff02c3702f32 5.4 GB 3 weeks ago
llama3.1:latest 42182419e950 4.7 GB 3 weeks ago
The name of each model on that list will be the name that you will use to call the model in your code.
2. First tests
Now that the models are downloaded into our machine, a simple RAG system was implemented to test if they would be able to answer questions about information inside text/PDF documents.
As seen from the code below, this can be done seamlessly using LangChain. This example was adapted from LangChain’s documentation to use Ollama and Llama 3.1 and to load
text and PDF files from a documents
folder inside the root of your project.
import {ChatOllama} from "@langchain/ollama";
import {PDFLoader} from "@langchain/community/document_loaders/fs/pdf";
import {RecursiveCharacterTextSplitter} from "langchain/text_splitter";
import {ChatPromptTemplate} from "@langchain/core/prompts";
import {createStuffDocumentsChain} from "langchain/chains/combine_documents";
import {DirectoryLoader} from "langchain/document_loaders/fs/directory";
import {OllamaEmbeddings} from "@langchain/ollama";
import {TextLoader} from "langchain/document_loaders/fs/text";
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { StringOutputParser } from "@langchain/core/output_parsers";
// Create the document loader
const loader = new DirectoryLoader(
"documents",
{
".pdf": (path) => new PDFLoader(path, {
splitPages: false
}),
".txt": (path) => new TextLoader(path)
}
);
// Create the embeddings model
const embeddings = new OllamaEmbeddings({
model: "gemma2:latest" // or llama3.1:latest
});
// Load the files inside the documents folder
const docs = await loader.load();
// Split the content into chunks and store them in memory
const textSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 200,
});
// Split the documents
const splits = await textSplitter.splitDocuments(docs);
// Create the vector store that will store the embedded splits
const vectorStore = await MemoryVectorStore.fromDocuments(
splits,
embeddings
);
// Retrieve and generate using the relevant snippets of the blog.
const retriever = vectorStore.asRetriever();
const prompt = ChatPromptTemplate.fromMessages([
["system", "You are an assistant for question-answering tasks. Use the context given to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise."],
["human", "Context: {context}. Question: {question} \n Answer:"],
]);
// Change this variable with the question you want to ask depending on the documents you added
const question = "What is task decomposition?";
const retrievedDocs = await retriever.invoke(question);
const llm = new ChatOllama({ model:"gemma2:latest"});
const ragChain = await createStuffDocumentsChain({
llm,
prompt,
outputParser: new StringOutputParser(),
});
const answer = await ragChain.invoke({
question: question,
context: retrievedDocs,
});
console.log(answer)
Node.js Server
Since the server will essentially consist of a single endpoint that will accept a query parameter containing the user’s question, Express.js was an obvious choice for its simplicity when it comes to setting up a server.
The image below shows the code to set up such a server. The runRag
function represents a function containing the code of the RAG system shown in the previous section.
import express from 'express';
const app = express();
const port = 3000;
// Query endpoint
app.get('/query', async (req, res) => {
const question = req.query.question;
if (!question) {
return res.status(400).send('Empty or invalid question.');
}
try {
const answer = await runRag(question);
res.send(answer);
} catch (error) {
res.status(500).send(`Error: ${error.message}`);
}
});
// Function that is called when the servers starts
app.listen(port, async () => {
console.log("Listening on port 3000...");
});
Saving the embeddings into a file
At this stage, the embeddings are stored in memory, which means they are generated every time a question is asked. This approach is not ideal, as we don’t want the user to experience delays while embeddings are being created. Additionally, the larger the volume of text or documents to embed, the longer this process will take. Therefore, it would be more efficient to find a way to:
- Save the embeddings to a file, so they can be loaded once a question is asked;
- Recreate the embeddings if the information changes (i.e. a document’s content is changed, a document is added/removed from the
documents
folder)
During our research for ways to implement this functionality, Facebook AI Similarity Search (FAISS) was found. FAISS is an open-source library that enables efficient similarity searches between document embeddings.
Additionally, LangChain has an integration with FAISS enabling it to function as a VectorStore (much like the MemoryVectorStore) allowing us to save the embeddings to a file and load it when necessary, making it an excellent choice give our needs.
As shown in the code below, saving the embeddings to a file is straightforward to implement: add these two lines of code after splitting the document into chunks and remove the MemoryVectorStore that was used before.
const vectorStore = await FaissStore.fromDocuments(splits, embeddings);
await vectorStore.save(filename);
Then, to load the embeddings from the created file before sending the question to the chatbot, add these two lines of code:
const vectorStore = await FaissStore.load(filename, embeddings);
const retriever = vectorStore.asRetriever();
Now that there is a way to save the embeddings to a file, it’s necessary to have a way to recreate the embeddings if the content of the existing documents changes, a document is deleted or a new document is added.
An effective solution would be to read each file in the documents folder, push the raw binary data of each file to another array, generate its MD5 and use it for the name of the embeddings file.
Afterwards, when the server starts or is restarted, it will read all the documents, generate the MD5 hash and check if there is a file with that name. If there is, it means that there were no changes made to those documents and loads the embeddings file, otherwise it generates the embeddings and creates the file. The code below shows an implementation of this solution.
let hash = "";
const generateEmbeddingsHash = async () => {
try {
const files = await fsASync.readdir("documents");
let buffers = [];
if (files.length === 0) throw new Error("No files found. Please add some text or PDF files to the ‘documents’ folder.");
for (const file of files) {
const data = await fsASync.readFile(`documents/${file}`);
buffers.push(data);
}
const combinedBuffer = Buffer.concat(buffers);
hash = crypto.createHash('md5').update(combinedBuffer).digest('hex');
} catch (e) {
console.error('Erro: ', e.message);
}
return hash;
}
const createEmbeddings = async () => {
const hash = await createStoreHash();
if (!fs.existsSync(hash)) {
const embeddings = new OllamaEmbeddings({
model: "llama3.1:latest",
});
const loader = new DirectoryLoader(
"documents",
{
".pdf": (path) => new PDFLoader(path, {
splitPages: false
}),
".txt": (path) => new TextLoader(path)
});
const docs = await loader.load();
const textSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 200,
});
const splits = await textSplitter.splitDocuments(docs)
const vectorStore = await FaissStore.fromDocuments(splits, embeddings);
await vectorStore.save(hash);
}
}
Now we can call the createEmbeddings function on server startup like this:
app.listen(port, async () => {
await createEmbeddings();
console.log("Documents loaded successfully.");
});
The table below shows the answer time (in seconds) when the document embeddings are created per answer and when they are read from a file using the method above.
Without embeddings file | With embeddings file | |
---|---|---|
Llama 3.1 | 32.41 | 4.92 |
Gemma2 | 37.54 | 5.99 |
As show in the table above, externalizing the document embedding creation to occur only on server startup, rather than with each question, significantly improved response times.
Chat history
An essential functionality for the chatbot is the ability to remember previous user interactions. This is useful in situations where the user wants to ask a follow-up question. To achieve this, we need to have a way to store the user’s questions and the chatbot’s answers.
We used a MySQL database composed of a messages
table where each record has a messageId, a sessionId used to represent the session of an user, a ‘content’ column to hold the message content and a ‘type’
column to know if a given message was a question from an user or an answer from the chatbot. To connect to the database through Javascript we used the knex
library. The structure of the messages
table is shown below
Field | Type | Null | Key | Extra |
---|---|---|---|---|
id | int(11) | NO | Primary | Auto-increment |
sessionId | int(11) | NO | – | |
content | text | YES | – | |
type | enum(‘ai’, ‘human’) | NO | – |
We can set up LangChain to integrate a chat history to our chatbot by a placeholder in the prompt array, so that the previous messages are placed before the current question as shown below.
const llamaPrompt = ChatPromptTemplate.fromMessages([
["system", "You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise. Context: {context}"],
new MessagesPlaceholder("chat_history"),
["human", "Question: {input}"]
]);
Afterwards, we need to change our ragChain and use LangChain’s RunnableWithMessageHistory object to enable chat history and create a config object to set up an id to distinguish between the history of different users.
const withHistory = new RunnableWithMessageHistory({
runnable: ragChain,
historyMessagesKey: "chat_history",
inputMessagesKey: "input",
config: {configurable: {sessionId: id}},
getMessageHistory: (_sessionId) => messageHistory
});
We also need to change endpoint in order to send the user’s id alongside the question and save it as well as the answer to the database like so:
app.get('/query', async (req, res) => {
const question = req.query.question;
const id = req.query.id;
if (!question) {
return res.status(400).send("Empty of invalid question.");
}
try {
await db('messages').insert({
session_id: id,
content: question,
type: 'human'
});
} catch (error) {
res.status(500).send(`Error: ${error.message}`);
}
try {
const answer = await runLLM(question, id);
res.send(answer);
} catch (error) {
res.status(500).send(`Error: ${error.message}`);
}
try {
await db('messages').insert({
session_id: id,
content: answer,
type: 'ai'
});
} catch (error) {
res.status(500).send(`Error: ${error.message}`);
}
});
In summary, this is the flow of the chatbot:
- The user asks a question;
- The previous questions and answers of a given user are obtained from the database (using the id that is passed through the query parameters) and added to the chat history;
- The question is sent to the chatbot;
- The question and answer are saved into the database;
- The answer is sent to the user;
Streaming
Given that a chatbot’s main purpose is to maintain a conversation with a human user, providing a good experience for them is crucial.
Given the answer times from the first tests it isn’t reasonable to have an user wait 5-10 seconds for an answer without some sort of visual feedback.
To help mitigate this issue, LangChain provides a stream
method which streams an answer chunk by chunk. This way we can give bits of the answer as they come from the chatbot.
This can be implemented by replacing the invoke
method with the stream
method as shown in the code below.
const answer = await ragChain.stream({
question: "What is task decomposition?",
context: retrievedDocs,
});
We can then process the answer chunk by chunk as we see fit using a for
loop:
for await (const chunk of answer) {
console.log(chunk);
}
Embedding models
To test the chatbot’s performance using Llama 3.1 and Gemma2, we extracted the text from our company’s website and asked ChatGPT to generate questions regarding that data, giving us a total of 17 questions.
Afterwards, we created a text file with the same information so the chatbot would be able to create the embeddings and then asked the aforementioned questions.
Overall, the results were not satifactory since both LLM’s were not able to answer a big portion of the answers and thought that the culprit had to be the embeddings since they hold the document’s information. After some research, it was found that there are specialized models for embedding creation and that Ollama supports them.
From the embedding models that Ollama has to offer, we chose the top 3 most downloaded embedding models at the time of testing: nomic-embed-text
, mxbai-embed-large
and snowflake-arctic-embed
.
Although both LLM’s were able to answer most questions using these models, there was still a small portion they couldn’t. During the previous tests we were using a chunk size of 500 and a chunk ovelap of 100, so we thought that they still couldn’t answer some questions because there is some context that is not being considered.
One way to provide more context is to enlarge the chunk size so each chunk contains more information. However, increasing the chunk size too much might dilute the most important details, reducing the precision of searching for similar information between the document and the asked question. Because of this, the chunk size was increased from 500 to 800.
Since we increased the chunk size, we are also increasing the overlap from 100 to 150 to preserve continuity between chunks. This is due to the fact that larger chunks cover more text and the boundaries of each chunk are more likely to contain critical information that overlaps with the next chunk.
This new chunk size/chunk overlap configuration yielded much better results. However, the LLM’s and embedding models’ performances still can’t be compared with eachother because the retrieved context was different between Llama 3.1 and Gemma2.
In order to make sure that for each embedding model the context that is fed to both Llama 3.1 and Gemms 2 is the same, the context retrieval step was stored before asking the questions.
In addition to the performance of both models, we also tracked some additional insights.
Document embeddings generation time (s) | |
---|---|
mxbai-embed-large | 8.68 |
snowflale-arctic-embed | 12.63 |
nomic-embed-text | 5.70 |
As it was stated in a section above, the embeddings are generated and stored in a file on server startup, and this generation only occurs if there’s no file yet, or the contents of the documents folder changes. Otherwise, the embeddings file is loaded.
Looking at the table above, for a text document with all the textual information gathered from our company’s website containing 3985 words, the document embedding creation process was quite fast, with nomic-embed-text
being the fastest and snowflake-artic-embed
being the slowest. While this information does not impact answering times, it serves as an interesting insight about the speed of these embedding models.
To measure the performance of each embedding model/LLM configuration, three answer categories were established:
- Correctly answered
- Partially answered
- Wrongfully-answered/couldn’t answer.
Llama 3.1
Correctly Answered | Partial Answer | Wrongfully Answered / Couldn’t Answer | |
---|---|---|---|
mxbai-embed-large | 13 | 2 | 2 |
snowflale-arctic-embed | 12 | 1 | 4 |
nomic-embed-text | 14 | 2 | 1 |
Gemma2
Correctly Answered | Partial Answer | Wrongfully Answered / Couldn’t Answer | |
---|---|---|---|
mxbai-embed-large | 15 | 1 | 1 |
snowflale-arctic-embed | 15 | 1 | 1 |
nomic-embed-text | 14 | 1 | 2 |
Looking at Llama 3.1, nomic-embed-text
performed the best, with the highest number of correct answers and the lowest number of answers in the wrongfully-answered/couldn’t answer category. Conversely, snowflake-artic-embed
had the worst performance, having the least amount of correct answers and the highest amount of questions where it answered wrongfully, or it couldn’t answer at all.
Gemma2’s performance is not only better overall but also more balanced between each embedding model, with mxbai-embed-large
and snowflake-arctic-embed
having the same number of answers in each category. Additionally, Gemma2 did not have any questions where it answered wrongfully, only questions where it didn’t know the answer.
To consolidate these results and have a more straightforward metric to evaluate each LLM/embedding model configuration, the information in the tables above was used to create a formula to calculate their success rate (%), where Nc is the number of correctly-answered questions, Npa is the number of partially-answered questions and Tq is the total number of questions:
(Nc + (0.5 * Npa)) / Tq
Llama 3.1
Success rate (%) | |
---|---|
mxbai-embed-large | 82.35 |
snowflale-arctic-embed | 73.53 |
nomic-embed-text | 88.24 |
Gemma2
Success rate (%) | |
---|---|
mxbai-embed-large | 91.18 |
snowflale-arctic-embed | 91.18 |
nomic-embed-text | 85.29 |
From the tables above, the only instance where Llama 3.1 outperformed Gemma2 was with the nomic-embed-text
model, and even then, the difference was marginal. For all other embedding models, Gemma2 consistently demonstrated superior performance, particularly with snowflake-arctic-embed
, showing a significant 17.65% advantage.
Answer Times
Although the success rate is a key metric when choosing an LLM/embedding model configuration, answer times also play a big role when choosing which LLM will power your chatbot depending on your application needs. Thus, the answer times (in seconds) for each question for each configuration were recorded.
LLama 3.1
mxbai-embed-large | snowflake-arctic-embed | nomic-embed-text | |
---|---|---|---|
1 | 1.61 | 1.50 | 1.64 |
2 | 1.58 | 1.24 | 1.56 |
3 | 1.64 | 1.52 | 1.67 |
4 | 1.44 | 1.30 | 1.36 |
5 | 1.59 | 1.28 | 1.55 |
6 | 1.55 | 1.45 | 1.62 |
7 | 1.68 | 1.47 | 1.46 |
8 | 1.47 | 1.05 | 1.42 |
9 | 1.58 | 1.21 | 1.61 |
10 | 1.67 | 1.56 | 1.35 |
11 | 1.58 | 1.48 | 1.49 |
12 | 1.75 | 1.47 | 1.75 |
13 | 1.30 | 1.33 | 1.20 |
14 | 1.34 | 1.20 | 1.33 |
15 | 1.74 | 1.48 | 1.50 |
16 | 1.73 | 1.14 | 1.48 |
17 | 1.67 | 1.54 | 1.67 |
Total | 26.92 | 23.21 | 25.66 |
Gemma2
mxbai-embed-large | snowflake-arctic-embed | nomic-embed-text | |
---|---|---|---|
1 | 1.99 | 1.77 | 1.99 |
2 | 1.92 | 1.52 | 1.90 |
3 | 2.04 | 1.88 | 2.07 |
4 | 1.76 | 1.76 | 1.76 |
5 | 1.95 | 1.56 | 1.92 |
6 | 1.90 | 1.77 | 2.02 |
7 | 2.07 | 1.80 | 1.76 |
8 | 1.90 | 1.26 | 1.75 |
9 | 2.00 | 1.57 | 2.03 |
10 | 2.10 | 1.97 | 1.75 |
11 | 1.98 | 1.80 | 1.75 |
12 | 2.16 | 1.80 | 2.11 |
13 | 1.62 | 1.62 | 1.54 |
14 | 1.62 | 1.44 | 1.57 |
15 | 2.14 | 1.78 | 1.76 |
16 | 2.14 | 1.43 | 1.75 |
17 | 2.09 | 1.93 | 2.02 |
Total | 33.39 | 28.68 | 31.46 |
Looking at the tables above, Llama 3.1 answered the fastest when using snowflake-arctic-embed
and the slowest when using mxbai-embed-large
, and Gemma followed the same trend.
Comparing Llama 3.1 and Gemma 2 using individual answers, because answer times are quite fast in general we might think that they are both quite similar. However, looking at the sum of the answer times of every question, Llama 3.1 is faster than Gemma 2 by quite a large margin across the board.
In summary, it’s a matter of finding out what do you want to prioritize for your application needs: success rate, answer quality, answer times, or a compromise between these metrics.
Conclusion and Next Steps
This was both an intriguing and rewarding project which gave us a lot of insight into chatbots, how to work with LLMs, and how versatile they can be.
Some ideas that we want to try in the future surrounding this project are:
- Research and implement a way of asking the model a second time in the background before answering the user with “I don’t know, could you please reformulate or ask again?”
- Caching common questions to speed up answering times
- Test the impact of different hardware configurations on answering times
- Run the embedding models multiple times, analyzing which ones result in better answers and if the embeddings change and how much they change