Kafka in the Cloud: Why it’s 10x better with Confluent | Find out more
It's not hard, it's just new.
How can you, your business unit, and your enterprise utilize the exciting and emerging field of Generative AI to develop brand-new functionality? And once you’ve figured out your use cases, how do you successfully build in Generative AI? How do you scale it to production grade?
This series of blog posts will take you on a journey from absolute beginner (where I was a few months ago) to building a fully functioning, scalable application. Our example Gen AI application will use the Kappa Architecture as the architectural foundation, plus rely on the Kora engine—a cloud-native event streaming platform for Kafka.
Created by Jay Kreps, the co-founder of Confluent and original co-creator of Apache Kafka®, the Kappa Architecture is a data processing framework designed for real-time, stream-processing systems. It simplifies data processing by treating all data as an infinite stream, eliminating the need for batch processing and enabling low-latency stream processing on incoming data.
As we build our application, we’ll tackle some of the hard problems in building Gen AI applications:
Context: The LLM knows nothing about your customer so how do you create personalized interactions that will resonate with your users?
Memory: LLMs are stateless, meaning they have no ability to remember the last thing you talked about.
Private data: LLMs have been trained on terabytes of public data, but how can you leverage the insights and data from within your organization safely?
Scaling up: It’s great to play around with Generative AI, but how do you use it to meet the demands of large-scale usage?
By using Kappa, we gain core capabilities for solving these hard problems. Context is tackled by stream processing, Kora’s durable, long-term storage solves memory, sink connectors to a vector database allow for private data and last but not least, scaling is accomplished via the Kora engine and microservices consumption patterns. These all play a role in retrieval-augmented generation (RAG). We’ll leverage learnings from years of distributed computing, publish/subscribe, durable storage and stream processing as we build our Generative AI app.
In the next blog post, we’ll dive deeper into the Kappa Architecture and learn how the unique characteristics provide the backbone to building our Generative AI application. The third and subsequent posts will each tackle one of the hard problems outlined above, exploring each in greater detail and extending our application with increasing functionality.
In this post, we’ll dive into the basics of GenAI and large language models, and explore how you can integrate it within your own organization.
So, what exactly is this stuff? You’ve probably been inundated with a lot of new concepts lately—GPT, large language models (LLM), Generative AI, prompt engineering, and more.
ChatGPT is a large language model along with Google’s Bard, Microsoft’s Bing, Anthropic’s Claude, and Meta’s Open Source Llama 2. ChatGPT is the most well known, and currently, the name is being used almost interchangeably to describe the entire field of LLMs.
An LLM is an artificial intelligence (AI) system that is designed to understand and generate human language. It is essentially a machine learning algorithm that is trained on a massive amount of text data, such as books, articles, and websites, to learn the patterns and structures of natural language. LLMs are often used for tasks such as language translation, text summarization, and question/answering. They can also generate new text based on a given prompt or topic, such as “how do I bake a cake?” or “what are some low-impact exercise ideas?”.
Have you ever wondered what the GPT part of ChatGPT stands for? I know I did, so let’s break down the name a bit. The chat part is obvious, it’s how we interact with the AI using natural language. The GPT part stands for generative pre-trained transformer—let’s dig into that a bit further.
Generative means that it can generate the next best word, pixel, audio note, etc. that makes sense in the context of what the AI is being asked to output. Pre-trained refers to it being trained on millions, billions, or trillions of pieces of information. GPT-4 used an estimated 65,000+ NVIDIA GPUs and 10,000 hours of training, which cost upwards of USD $100 million. Those are some big numbers. Effectively it "knows" what's been passed into it.
Lastly, the transformer is the foundational component in the LLM stack and is in use by all current LLM models. It was defined in the foundational paper from Google in 2017, "Attention Is All You Need." Transformers work so well in LLMs because they model the relationship between words and concepts in text. They can distinguish the context of how words are used and ensure that fidelity and nuance are maintained. For example, a word like “novel” can refer to a book or to something that is new or original and the transformer is able to distinguish the intended meaning. Transformers are also able to learn to represent words and phrases in a more conceptual way which gives it the ability to generate natural-sounding output.
What makes this new form of general AI so different from the previous narrow AI is the LLMs ability to transform all of human knowledge and treat it as a language. That means text, audio, video, and images; all of these are transformed into a language, the very thing that homo sapiens used to become the dominant species on the planet. It's what separates us from all the other species and gives us the ability to have a shared understanding of reality, to build on prior experiences (creating knowledge), and to pass on that knowledge without having to first experience it yourself (don't eat red mushrooms).
Over tens of thousands of years and generations of humans, the development of language along with the ability to tell and pass on stories is what gave us the edge over stronger and more able competition. This ability was our superpower and gave us a shared understanding of the world and a shared purpose. Knowledge vastly accelerated with the emergence of writing and the development of the modern printing press.
The abstract concepts of our languages are starting to be better understood by machines, opening up entirely new ways to engage with our laptops, phones, and applications using natural language. And just as importantly, the AI can now communicate back to us in the very same natural language.
As touched on in the intro, many of us have used ChatGPT and are familiar with the question/response conversation format, but that's only a part of the capabilities of what an LLM can do. How do we unlock the full power and really interact and communicate with the AI?
To tap into the wider abilities of AI, you need to familiarize yourself with the prompt. The prompt converts your natural language questions into computations that are understood by the LLM.
Think of AI as having multiple assistants, each one with different capabilities. Of course, you could ask each assistant to do anything, but you might not get the best result or even what you'd be expecting. But by nudging the prompt, you can get it to perform these sorts of tasks:
Generation of text, stories, ideas
Conversation in a chat style
Transformation of existing text into a new form: JSON, md, csv
Translation from one language to another
Analysis
Summarisation of long-form content to generate new insights
Fortunately, you don’t need to learn some new programming language or even have any programming background. You interact with the LLMs using the very language that you’ve been using your entire life and that the LLM has been trained on—text. A remarkable aspect of the prompt is our ability to communicate with it like it was a human. It makes sense, of course, the LLM has been trained on billions of pieces of human knowledge and it has learned how we communicate and how we expect information to be presented to us in a casual chat or question/answer style.
Here are a few example prompts to explore a range of interaction styles:
Topic summary prompt: | "Please summarize the key points of our previous discussion about Generative AI and the Kappa architecture in 4 bullet points of no more than 2 sentences each." |
Question/answer prompt: | "Given the following blog post, what are the main benefits of using the Kappa architecture for building Generative AI applications? [blog post]:” <insert blog text> |
Story generation prompt: | "Let's imagine we are starting a new AI startup. Please generate four startup ideas and for each idea generate a short 3 paragraph story about our journey from initial idea to building an early prototype." |
Table generation prompt: | "Can you provide a table with 3 columns comparing the differences between Narrow AI, General AI, and Superintelligence?" |
Translation prompt: | "Translate the following paragraph into Spanish: Generative AI models have the ability to produce new text in a style guided by examples provided to them in natural language." |
Sentiment analysis prompt: | "Analyze the sentiment of this customer review and specify if it is positive, negative, or neutral: 'The shoes took forever to arrive but they’re the most comfortable shoes I’ve ever run in.'" |
Who makes a good prompt engineer? Stephen Wolfram said in a recent interview with Lex Fridman that “a good expository writer is a good prompt engineer”. This is great as you don’t need an ML degree, computer science degree, or even programming knowledge! Yay. All you need to do is be able to “explain stuff well”.
All current LLMs are stateless, meaning that every single interaction is a stand-alone session. You might be wondering how you can have a conversation that appears to be back and forth if there’s no memory. This is where prompt engineering starts to come into play. Think of the prompt as your working space or your short-term memory, this is then combined with the long-term memory and knowledge of the LLM.
When working with an LLM, you are starting effectively at ground zero, you have this massive brain of combined knowledge sitting there, waiting to do your bidding. The unique properties of the LLM have meant that context, knowledge, and form have all become orthogonal to each other. This provides an amazing amount of potential.
Context equals the question you've stated, all chat history, and perhaps information on who you are or how you'd like to narrow in on specific areas. Knowledge is what the LLM has been trained on and any new information/facts inserted into the prompt. Form is the tone of how you want the information presented back to you.
Let's build on this a bit more with the analogy of a movie with a set decorator and a director—you get to play both parts 🙂. (See, being a prompt engineer is kinda fun!)
Context: Who am I? What's my background, what persona is this information meant for? What have we been talking about so far?
Knowledge: What question am I asking or what task am I putting into action?
Form: Who is going to help me with the task? Einstein, Dr Seuss, Shakespeare, who do I want my persona to embody? How do I want this information presented to me? As a two-page resume, as a rap song by Eminem, as JSON or Markdown? Maybe a table? You get the picture.
If we’ve been having a chat for a bit, your questions and the LLM’s responses are stored to the side and injected into the prompt (behind the scenes) along with newly submitted questions (there's a lot going on behind the curtains!).
Sometimes the above is sufficient to elicit the information you need in the format you need it in, but sometimes your tasks might be more specific, or maybe you've asked for something entirely new to the LLM. What do you do when you don't get consistently good results even with the above? This is where you can add even more context to the prompt by giving it examples of what you're looking for which can take the following forms:
Zero-shot prompting: Just the question and any context you've fed in. This is what has been covered above and LLMs are surprisingly good at providing decent results
Single-shot prompting: You provide one example of what you're looking for. This is good if you want a summarisation of text and you want it in a certain format like [Headline] and [4 bullet points]
Few-shot prompting: Give 3-5 examples of what good looks like of both the input and expected output
“Let’s think step by step” hack: If you're asking a more complex question that has multiple parts, simply putting a "Let’s think step by step" after your question can get the LLM to break things down and come up with better results. (Weird, huh? Almost like humans…)
Chain-of-thought prompting: This is useful when you are asking more complex questions that might involve arithmetic and other reasoning tasks.
For more in-depth information on prompt engineering, I highly recommend reading Lilian Weng's amazing post as well as OpenAI’s examples.
What’s the best way to get started and become familiar with this exciting and fast-paced technology? First, just play around with it and get comfortable with it. You can sign up to ChatGPT for free, Google’s Bard is also free to use as long as you have a Google account, and Microsoft has integrated AI directly into the Bing search engine. Each of the services also includes some handy prompts you can just click on that give a great feel for what’s possible.
Once you’ve had a bit of a play with any of the chat LLMs above, I'd suggest looking at some of the examples posted by OpenAI to give yourself a broader view of what's possible. Become comfortable with the chat interface, how to tune it, how to summarize information to narrow down the response, and how to transform responses into a table or CSV output.
Next, you should start to think about your business and what could be done to improve productivity across different functions in your organization. Use greenfield thinking here. Imagine that you can break down invisible barriers, pick out difficult or repetitive tasks, and come up with a list of ideas.
With an understanding of how LLMs work and some time spent using the tools, you should start thinking about how Generative AI could apply to the list you put together. Consider what new business models might be available to your organization if you had access to Generative AI. What new use cases could be enabled or processes simplified?
Once you have a sense of what's possible, the next step is figuring out how to actually integrate these cutting-edge AI capabilities into your business and workflows.
Now that we've explored the basics of how chatbots and LLMs work, as well as getting hands-on experience using them, let's dive into the different options for integrating AI into your organization.
Here are four main ways you can integrate generative AI into your organization:
The first one is to simply encourage your team to access tools such as ChatGPT, Bard, or Bing (as mentioned above) and if they are engineers, also encourage them to use code-gen tools such as Github Copilot. It integrates into all popular IDEs and acts as a pair programmer, always there to offer suggestions, help figure out code, and even generate entire functions or methods. This can speed up coding and increase productivity. I’ve benefited from ChatGPT immensely for this exact use case. I was pretty rusty with Python when I started this series so it’s been incredibly helpful to have this helpful and encouraging brain from going off the rails.
At this stage, it’s recommended to not use any company or proprietary data in any interaction with LLMs as it can be used by the LLM in future training runs. This would most definitely not be okay with your company's data policies or Infosec teams. Let your team know that company data is not okay to upload or copy/paste. Having this as a hard rule removes a lot of risk but still allows for a lot of exploration.
The second way (and the one we'll be spending the most time on in the following blog posts) is to integrate aspects of your applications or line of business with a third-party LLM via APIs: ChatGPT, Bard (in early beta), and Claude 2 are good examples of this style of API integration. Each has a rich API along with documentation that helps to get started quickly. The barrier to entry is very low, and you can choose an API that only bills you per usage.
With API integration, you can also integrate custom or private data into the LLM prompt via Semantic Search and Embeddings, powered by a vector database. Semantic search uses word embeddings to compare the meaning of a query to the meaning of the documents in their index. Your users get more relevant results, even if the query does not contain the exact words that are present in the documents.
Embeddings are numerical representations of objects, such as words, sentences, or even entire documents, in a multi-dimensional space. By placing similar items closer together in vector space, embeddings enable us to evaluate the relationship between different entities.
In the example diagram below, we can visualize how concepts like cat and dog are quite close to each other as they have a stronger similarity to one another than, for example, a spider or a human. On the left, you can see the concepts of vehicle and car which are far removed from the organic life which is grouped on the right.
To perform a semantic search, you would first send your query to the vector database, this transforms your query into embeddings and searches the DB for semantically similar content. Any documents that have similar concepts as the query are returned and these results are appended to the prompt. You’re not storing anything in the LLM and only sending small excerpts to provide the best response to your user.
We’ll learn more about semantic search, embeddings, and vector databases in detail and build the functionality into our app in the fifth blog post, which covers how to handle private data. If you don’t want to wait until then (and who does?) this topic is partly covered in our recent GPT-4 + Streaming Data blog post.
The third option is to use an open source LLM or train your own model from scratch. Reasons to consider this might be:
You have a knowledge base that is unlikely to be present in existing pre-trained language models
You have very particular tasks to perform that aren't being met by a commercial LLM (think fraud detection or very workflow-specific tasks that can't be generalized)
You are finding the inference costs of the commercial LLMs are starting to not make business sense
You may also choose to host, though not necessarily train, your own LLM for security reasons. Any data that you pass into a third-party API may be subject to use by the model owner. Maintaining an LLM within your own business can restrict the risk that private information leaks out.
There has been an explosion of open source models such as Llama 2, Mosaic MPT-7B, Falcon, and Vicuna, many of which also provide commercial use licenses. While they aren’t yet comparable to GPT-4, they are definitely catching up to the capabilities of GPT 3.5-turbo (the model used in ChatGPT). Just a few months ago, these models weren’t even close to the level of GPT 3—the Gen AI space is moving at a rocket pace!
When looking at models, you’ll notice things like 7B, 13B, 40B, etc. The “B” represents billions and it marks the number of parameters the model has. The number of parameters in the model determines how complex the model will be and the amount of information it can both process and store. The larger the number of parameters the more sophisticated the model is, but also the more computationally expensive it is to train and run. Modern laptops such as M1 and M2 Macs and Windows machines with newer generation GPUs are able to run lower count models (7B and 13B) with acceptable performance.
Lower-parameter models like 7B and 13B can serve as a great starting point if your use cases aren’t complex, or you need faster inference. If your needs are more complex, you can select models with higher parameter counts to meet the needs of your use case.
Open source models allow you to pick a model that is specifically suited to a specific task: chat, translation, coding, customized assistant, etc. There are literally hundreds of permutations of the above models tuned for many different tasks. Hugging Face has hundreds of thousands of models to explore and download, they also maintain a leaderboard that ranks the most powerful LLMs. Once you’ve determined the right model, you can choose the size of the model (number of parameters) and then freely deploy it in any environment from private data centers, public clouds, and a growing number of platforms that help run and manage your LLM.
Another path to a custom LLM would be to train it from scratch. To do this, you're looking at an expense worth hundreds of thousands or even millions of dollars. Sam Altman has said that the cost to train GPT-4 on terabytes of data over many months ran over USD $100 million. This is mainly in procuring GPUs for compute power, time running those GPUs in the cloud, and the team training and specialist skills necessary to do the actual work and monitoring. This expertise will be hard to come by and very expensive as the Generative AI field is incredibly new and skills are in very high demand. Even with a solid team, expertise is being learned on the fly, and even with experienced ML teams you'll find they'll need time to learn these new methods and iterate. Costs are definitely going down though, MosaicML recently released MPT-7B, a new commercially licensed model that they've stated was trained for 9.5 days with zero human intervention at a cost of roughly USD $200K. The same team said in a recent podcast that they’re trying to cut those costs in half.
Smaller models with 7 billion parameters can be trained for tens of thousands of dollars. Things are moving fast and changing on a monthly basis.
If you are starting with a base-level open source LLM and you need to adapt the model to specific tasks or domains beyond its pre-trained capabilities then fine-tuning is a great option. Examples of new tasks you can teach the LLM are text summarisation, virtual assistants that use the unique voice of your company, text generation within constraints such as legal/medical, recommendation systems that use real-world customer purchases, and document classification, as just a few examples.
Another reason you might consider fine-tuning is that by incorporating numerous, highly ranked user interactions, you can better train the LLM to generate highly relevant text with fewer examples and fewer tokens needing to be passed into the prompt. If you have many interactions that are being served up on a daily basis, the savings from reductions in the size of the prompt can really add up if you are using a commercial LLM. Open AI reports that the prompt length can be reduced by up to 90% while maintaining performance.
Training an existing LLM costs in the range of USD $1-5K. The cost difference is so much lower as the amount of data required to fine-tune an LLM is in the 100-300 high-quality example range which starts to give very good task performance. If the initial training set doesn’t provide the results you are looking for, doubling the size of the examples will give you a linear increase in performance.
If you are using Open AI’s commercial API you’re in luck. In August, Open AI enhanced the GPT 3.5-turbo model to allow for fine-tuning by customers. This is the exact same powerful and widely used model powering ChatGPT and Microsoft’s Bing AI. Later this year they’ll release this powerful feature for the GPT-4 model as well.
The AI revolution has arrived. While we’re still in the early days, the possibilities are limitless as this technology continues rapidly advancing. Now is the time to start exploring how your business can ride this wave to provide next-level customer value and empower employees.
While you’re waiting for the next post in the series, definitely read Confluent’s blog post: GPT-4 + Streaming Data = Real-Time Generative AI. It follows an example of how an airline can integrate with GPT-4 to provide more customer-focused interactions.
Here’s a quick preview of what will be coming up in this Gen AI series:
2. Kappa Architecture: Building next-generation applications requires a new approach. Learn about the unique characteristics of the Kappa Architecture and how to integrate it with LLMs.
3. Context: Taking everything you know about your customer and creating personalized interactions at every touch point. You’ll learn about stream processing and creating an actionable, in-the-moment customer 360 view that will help power the chat experience.
4. Memory: Give your LLM long-term memory by taking advantage of durable storage in Kora. You’ll learn about durable, infinite storage in Kora and how it’s the perfect fit for long-term storage of conversations and user interactions.
5. Private data: Tap into your organization’s critical advantage, its data, and combine with the LLM to generate new insights and knowledge. You’ll learn about Semantic Search, Embeddings, and Vector DBs and how to integrate them into our app.
6. Scaling up: Kora is built to scale. You’ll learn about Kora’s unique scaling capabilities and how to use this to scale our microservice and application as its popularity grows.
ChatGPT and data streaming can work together for any company. Learn a basic framework for using GPT-4 and streaming to build a real-world production application.
Chances are, if you’re working to build mission-critical applications, you’re familiar with the intersections where AI, machine learning, and data (both real-time and historical) meet. At this stage in the game, AI is becoming prevalent...