It’s no secret that artificial intelligence is booming. With the advent of ChatGPT in 2022, AI is becoming a hot topic. Naturally, many developers are also increasingly interested in building with AI, and there is no shortage of resources to learn from and reference.
There are a couple of glaring roadblocks of sorts right now. Most AI tutorials and resources are focused on Python and vector databases like Pinecone or Supabase PostgreSQL. These are great options, and I recommend learning them, but what about the TypeScript/Javascript and Node.js crowd who wants to experiment?
So, I set myself a challenge. Could I build an AI-based application that leverages OpenAI GPT APIs, works with vectors and doesn’t require Python or a vector database? It turns out the answer to that challenge is: absolutely.
Going even further, I wanted to avoid libraries like Langchain. The way I’ve done things here is the difficult way. You’d use a library like Langchain (or equivalent) in a real-world application instead. I’ve got an example of a Langchain-based solution I created here.
This is a classic example of doing everything the hard way and is a good introduction to working with OpenAI APIs. Because there are natural language concepts like cosine similarity and other concepts power many of these tools and libraries, like Langchain, do for you.
P.S. If you’re looking for the code to perform the embedding search, scroll down a bit.
Building a scraper
The first port of call was to create a script that scraped my blog (the one you’re reading now) and make a CSV content file.
As you will see in the below file, I load my blog, and then, using Cheerio, I scrape the HTML. It’s a big file because I am a stickler for TypeScript typing. Because we don’t want to send massive bodies of text to the API and incur a huge bill, we scrape the content and chunk it (break large content up into smaller chunks).
/* eslint-disable prettier/prettier */ import axios from 'axios'; import cheerio from 'cheerio'; import { stringify } from 'csv-stringify'; import { encode } from 'gpt-3-encoder'; import fs from 'fs'; const url = (page: number): string => `https://ilikekillnerds.com/page/${page}`; interface ChunkedContent { content: string; tokens: number; } interface PostData { title: string; content: string; tokens: number; url: string; postId: number | null; publishDate: string | null; chunkId: number; } const scrapePage = async (page: number): Promise<string[]> => { const response = await axios.get(url(page)); const $ = cheerio.load(response.data); const articles = $('article'); const postLinks: string[] = []; articles.each((_i, article) => { const link = $(article).find('.entry-title a').attr('href'); postLinks.push(link); }); return postLinks; }; const fetchPostData = async (url: string): Promise<PostData[]> => { const response = await axios.get(url); const $ = cheerio.load(response.data); const title = $('h1.entry-title').text().replace(/\s+/g, ' '); const content = $('.entry-content *') .not('img, iframe, figure, pre > code, code, video, picture') .text() .replace(/\s+/g, ' ') .replace(/\.([a-zA-Z])/g, '. $1'); const chunkedContent = chunkContentByTokens(content, 200); // Extract post ID const bodyClass = $('body').attr('class'); const postIdRegex = /postid-(\d+)/; const postIdMatch = postIdRegex.exec(bodyClass); const postId = postIdMatch ? parseInt(postIdMatch[1], 10) : null; // Extract and format publish date const publishedTime = $('meta[property="article:published_time"]').attr( 'content', ); const publishDate = formatDate(publishedTime); // Get the title, content, token length, url, post ID and publish date return chunkedContent.map((chunk, index) => ({ title, content: chunk.content, tokens: chunk.tokens, url, postId, publishDate, chunkId: index + 1, })); }; // This sub function breaks up content using loose token math to ensure scraped content // isn't too big (or we'll blow out the cost of our AI calls) const chunkContentByTokens = ( content: string, maxTokens: number, ): ChunkedContent[] => { const sentences = content.split('. '); const chunkedContent: ChunkedContent[] = []; let currentChunk = ''; let currentTokens = 0; for (const sentence of sentences) { const sentenceTokens = encode(sentence).length; if (currentTokens + sentenceTokens <= maxTokens) { currentChunk += `${sentence}. `; currentTokens += sentenceTokens; } else { chunkedContent.push({ content: currentChunk.trim(), tokens: currentTokens, }); currentChunk = `${sentence}. `; currentTokens = sentenceTokens; } } if (currentChunk) { chunkedContent.push({ content: currentChunk.trim(), tokens: currentTokens, }); } return chunkedContent; }; // Could be replaced with a package or simpler code const formatDate = (dateString: string | undefined): string | null => { if (!dateString) return null; const date = new Date(dateString); const monthNames = [ 'January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December', ]; const month = monthNames[date.getMonth()]; const day = date.getDate(); const year = date.getFullYear(); return `${month} ${day}, ${year}`; }; (async () => { let page = 1; let processing = true; const results: PostData[][] = []; while (processing && page != 51) { const postLinks = await scrapePage(page); if (postLinks.length === 0) { processing = false; break; } for (const link of postLinks) { const postData = await fetchPostData(link); results.push(postData); } l page++; } const combinedResults: PostData[] = []; for (const result of results) { combinedResults.push(...result); } stringify( combinedResults, { header: true, columns: [ 'title', 'content', 'url', 'tokens', 'postId', 'publishDate', 'chunkId', ], }, (err, output) => { if (err) { console.error(err); return; } fs.writeFileSync('posts.csv', output); }, ); })();
As you can see in the generated CSV file in the screenshot below, the same blog post appears multiple times (this is due to our quick and dirty chunking):
Create the vector embeddings
Now that we have the CSV file, the next step is to convert it to vector embeddings. OpenAI provides an API that does this and is quite affordable. I sent up seven years’ worth of blog posts, and I don’t think it cost more than $2 (I think even less).
Once again, I used TypeScript and Node.js to generate the embeddings:
/* eslint-disable prettier/prettier */ import fs from 'fs'; import csvParser from 'csv-parser'; import { Configuration, OpenAIApi } from 'openai'; interface DataRow { content: string; url: string; title: string; publishDate: string; postId: string; chunkId: string; [key: string]: string; } interface EmbeddingEntry { embeddings: number[]; content: string; url: string; title: string; publishDate: string; postId: string; chunkId: string; } const config = new Configuration({ apiKey: process.env.OPENAI_API_KEY, organization: process.env.OPENAI_ORG_ID, }); const openai = new OpenAIApi(config); async function loadData(): Promise<DataRow[]> { const data: DataRow[] = []; return new Promise((resolve, reject) => { fs.createReadStream('scraped_data.csv') .pipe(csvParser()) .on('data', (row: DataRow) => { data.push(row); }) .on('end', () => { resolve(data); }) .on('error', (error: Error) => { reject(error); }); }); } async function getEmbedding(text: string): Promise<number[]> { const response = await openai.createEmbedding({ input: text, model: 'text-embedding-ada-002', // We use the embedding model= }); return response.data.data[0].embedding; } async function generateEmbeddings(): Promise<void> { const data = await loadData(); const embeddings: { [key: string]: EmbeddingEntry } = {}; for (const row of data) { const contentEmbedding = await getEmbedding(row.content); const chunkKey = `${row.postId}_${row.chunkId}`; embeddings[chunkKey] = { embeddings: contentEmbedding, content: row.content, url: row.url, title: row.title, publishDate: row.publishDate, postId: row.postId, chunkId: row.chunkId, }; console.log('Saved embedding for post', row.postId, 'chunk', row.chunkId); await new Promise((resolve) => setTimeout(resolve, 200)); // Add delay between requests } fs.writeFileSync('embeddings.json', JSON.stringify(embeddings)); console.log('Embeddings saved'); } generateEmbeddings();
In this step, we iterate over the content in our CSV file and send it to OpenAI to get back the embeddings. You will notice that we then write the contents to a file called embeddings.json
which is going to act as our database.
Building the brains of the operation
Once we have the embeddings, we’ll write some TypeScript to take these embeddings and make them searchable. We want users to be able to ask questions about content. I am using my blog posts, but you could make this work with documents or any other type of content. Admittedly, most of the work was getting the content and chunking it.
You will need to make sure the openai package is installed.
import * as fs from 'fs'; import { Configuration, OpenAIApi, ChatCompletionRequestMessageRoleEnum, } from 'openai'; interface EmbeddingData { embeddings: number[]; content: string; title: string; url: string; publishDate: string; postId: string; chunkId: string; } interface ScoreData { [key: string]: { postId: string; title: string; url: string; publishDate: string; content: string; score: number; }; } const SYSTEM_PROMPT = `You are a blog searching chatbot for ilikekillnerds.com. Only answer the question by using the provided context. If you are unable to answer the question using the provided context, say you do not know the answer. The current year is ${new Date().getFullYear()}. When responding, do not refer to the provided content as 'the context' as it is implied already. Also, instead of author, refer to the author as Dwayne.`; const EMBEDDING_MODEL = 'text-embedding-ada-002'; const COMPLETIONS_MODEL = 'gpt-3.5-turbo'; const OPEN_AI_KEY = ''; const OPEN_AI_ORG = ''; const configuration = new Configuration({ apiKey: this.openaiKey, organization: this.organizationId, }); const openai = new OpenAIApi(configuration); function dotProduct(vectorA: number[], vectorB: number[]): number { return vectorA.reduce((sum, a, index) => sum + a * vectorB[index], 0); } function magnitude(vector: number[]): number { return Math.sqrt(vector.reduce((sum, value) => sum + value * value, 0)); } function cosineSimilarity(vectorA: number[], vectorB: number[]): number { const product = this.dotProduct(vectorA, vectorB); const magnitudeA = this.magnitude(vectorA); const magnitudeB = this.magnitude(vectorB); return product / (magnitudeA * magnitudeB); } async function getEmbedding(text: string): Promise<number[]> { const response = await openai.createEmbedding({ input: text, model: EMBEDDING_MODEL, }); return response.data.data[0].embedding; } // A lot of the contents of this file could be replaced with vector database queries async function search(query: string): Promise<any[]> { const embeddingsData = JSON.parse( fs.readFileSync('embeddings.json', 'utf8'), ); const queryEmbedding = await getEmbedding(query); const scores = Object.entries(embeddingsData).reduce<ScoreData>( (acc, [chunkKey, data]: [string, EmbeddingData]) => { const contentEmbedding = data.embeddings; const content = data.content; const title = data.title; const url = data.url; const publishDate = data.publishDate; const postId = data.postId; const score = this.cosineSimilarity(queryEmbedding, contentEmbedding); if (acc[postId]) { if (acc[postId].score < score) { acc[postId] = { postId, title, url, publishDate, content, score, }; } } else { acc[postId] = { postId, title, url, publishDate, content, score, }; } return acc; }, {}, ); const sortedScores = Object.values(scores).sort( (a: { score: number }, b: { score: number }) => b.score - a.score, ); return sortedScores.slice(0, 5); } async function getAnswer(query: string): Promise<string> { const mostRelevantItem = (await search(query))[0]; const messages = [ { role: ChatCompletionRequestMessageRoleEnum.System, content: `${SYSTEM_PROMPT} \n\n --- This provided blog post was published at the following URL: ${mostRelevantItem.url} on ${mostRelevantItem.publishDate}. When answering the user, provide a link to this post. --- ${mostRelevantItem.content}`, }, { role: ChatCompletionRequestMessageRoleEnum.User, content: query, }, ]; const response = await openai.createChatCompletion({ model: COMPLETIONS_MODEL, messages, max_tokens: 2000, temperature: 0.0, }); return response.data.choices[0].message.content; }
Wow. What a mouthful. Okay, so a few things. Because this is a dependency-free implementation (save for the OpenAI SDK), we are doing some mathematical heavy lifting ourselves. You could replace the cosine similarity and other functions with existing packages, but learning this stuff for yourself is more fun.
What’s with the nerdy math functions?
Let’s break down each function (dotProduct, magnitude and cosineSimilarity). Also, please keep in mind I am not an expert. I am a front-end developer still learning how all this AI stuff works. Feel free to correct me if you spot a mistake in my explanations here.
dotProduct(vectorA: number[], vectorB: number[]): number
This function calculates the dot product of two vectors,vectorA
andvectorB
. The dot product is the sum of the products of their corresponding components. The function uses thereduce
method to calculate this sum, taking an initial value of 0 and then iterating over each elementa
ofvectorA
while using its index to access the corresponding element invectorB
. The product of the two elements (a * vectorB[index]) is added to the current sum during each iteration.magnitude(vector: number[]): number
This function calculates the magnitude of a vector. A vector’s magnitude (or length) is the square root of the sum of the squares of its components. This function uses thereduce
method to calculate the sum of squares, starting with an initial value of 0 and then iterating over each elementvalue
of thevector
. The square of each element (value * value) is added to the current sum during each iteration. Finally, the square root of the sum is calculated usingMath.sqrt
.cosineSimilarity(vectorA: number[], vectorB: number[]): number
This function calculates the cosine similarity between two vectors,vectorA
andvectorB
. Cosine similarity is a measure of similarity between two non-zero vectors calculated by dividing the dot product of the vectors by the product of their magnitudes. The function first calculates the dot product of the two vectors using thedotProduct
function. Then, it calculates the magnitudes of both vectors using themagnitude
function. Finally, it calculates the cosine similarity by dividing the dot product by the product of the magnitudes.
The getEmbedding
function is self-explanatory. We converted our content into vectors, but when the user asks a question, we also have to turn that into a vector. This allows us to get the most relevant content, which will be fed to the API later.
The brain brain of the operation
Most of the heavy lifting in our code is the search function. It’s intimidating to look at, and I wrote it. Once again, there are most likely better ways to write this stuff and packages you can leverage. This function searches for the most relevant results based on a given text query using cosine similarity.
embeddingsData
is created by reading and parsing the ’embeddings.json’ file. The file is assumed to contain an object with keys representing document IDs (postId) and their corresponding embedding data (content, title, URL, publish date, and embeddings).- The
queryEmbedding
is obtained by calling thegetEmbedding
function with the input query string. It is responsible for generating an embedding vector for the input query. - The
scores
object is created by iterating over theembeddingsData
using thereduce
method. For each entry ([chunkKey, data]), the function calculates the cosine similarity between the query embedding (queryEmbedding
) and the content embedding (contentEmbedding
) using thecosineSimilarity
function. It compares the query to our blog posts. - If there is an existing entry for the current postId in the
scores
object, the function checks if the new score is higher than the existing score. If the new score is higher, it updates the entry with the new score and related data (postId, title, URL, publish date, content, and score). If there is no existing entry for the current postId, it creates a new entry in thescores
object. We do this because we chunk content, so the same blog post might appear multiple times. - The
sortedScores
object is created by converting thescores
object to an array and then sorting the array in descending order based on the cosine similarity scores. - The function returns the top 5 results by slicing the sorted array using
sortedScores.slice(0, 5)
. You can modify this to return more or less. This code here makes a great AI powered search.
This search
function can perform a content-based search, returning the top 5 most relevant results based on the input query. The search is performed by comparing the query’s embedding with the precomputed embeddings of the content in the ’embeddings.json’ file using cosine similarity.
Finally the getAnswer
function is where we produce our response to the provided query. The function uses the GPT API to generate an answer based on the most relevant content found through the search
function.
- The function calls
search(query)
to search for the most relevant content based on the input query string. It then takes the first result (most relevant item) from the returned array. - It creates a
messages
array with two objects:
- The first object has the role of “System” and content that includes a system prompt, the URL and publish date of the most relevant item found and the content of the most relevant item.
- The second object has a ” User ” role and content equal to the input query string.
- The function calls
openai.createChatCompletion
with the necessary parameters to generate an answer using the external AI model. The parameters include the model name, themessages
array, a maximum token limit of 2000, and a temperature of 0.0. The temperature parameter controls the randomness of the generated text; a value of 0.0 means the model will produce deterministic output (i.e., the same output for the same input). The higher you go towards 1, the more random the results will be, and GPT will hallucinate. - The AI model returns a response object containing generated choices, each containing a message object. The function extracts the content of the message object from the first choice (
response.data.choices[0].message.content
) and returns it as the answer.
Conclusion
This gives us a functional Node.js application that leverages the OpenAI GPT API and doesn’t have a database or any dependencies. Once again, this is a learning exercise, not the recommended approach. Fortunately, Langchain handles a lot of the complexity we implemented above, and I have an example on GitHub here.
You dont need to divide by magnitude/norm. Openai’s embedding vectors are already normalized. So, the magnitudes are 1 for these vectors.