Prompt Engineering, RAG, and Fine-tuning: Benefits and When to Use

Video ID: YVWxbHJakgg

YouTube URL: https://www.youtube.com/watch?v=YVWxbHJakgg

Added At: 13-06-25 21:18:59

Processed: No

Sentiment: Neutral

Categories: Education, Tech

Tags: programming, tutorial, AI, machine learning, natural language processing, data retrieval, knowledge injection

Summary

• Prompt engineering: adding dynamic content to prompts, retrieval augmented generation (RAG), and fine-tuning for large language models.
• RAG combines knowledge from external sources with model responses.
• Fine-tuning uses prompt completion pairs to teach intuition and remove unwanted behavior.

Transcript

hey I'm Mark Hennings I'm one of the
founders at entrypoint Ai and I'm
looking forward to talking to you about
prompt engineering retrieval augmented
generation fine tuning how they're
similar how they're different and how
they can work together so let's start
with prompt engineering even though
we're all probably pretty familiar with
this it's a good starting point for how
we can add Rag and fine-tuning to some
prompt engineering that we've already
done a typical prompt has some kind of
priming like you are a plumbing Q&A bot
that answers questions about Plumbing in
a helpful way
and then you might tell it to use you
know what kind of language to use based
on who your audience is um how to handle
edge cases and errors you know to ignore
things that are trying to hijack your
prompt um don't answer questions about
areas that aren't in your expertise
right and then we have this user inquiry
so this is a thing that's changing it's
the dynamic content in our prompt where
it's changing each time we're running
our requests to the
llm and then finally some kind of output
formatting now it doesn't have to be in
this order and in this case the Json
object is pretty basic it's not really
helping us we could just have it export
text um but that's a very common thing
is to want to have your output in some
kind of a consumable structured
format all right now when we add
retrieval augmented generation we're
basically just adding another Dynamic
piece of content to The Prompt it's this
knowledge we want the model to use to
answer the user inquiry so if the
inquiry is how do I fix a leaky fuc it
we're going to do some stuff in the
background we're going to take that
inquiry and we're going to try to find
the right information to answer that
question and then insert it into the
prompt so here we have some knowledge
from the plumers handbook chapter one
fixing common leaks um so if we put that
in there then the model is going to have
the information it needs to answer the
question and this is really important
because llms don't store facts they
store probabilities large language
models for example have been trained on
ACH bunch of
quotes said by people exact quotes but
they are never going to remember those
quotes verbatim they're going to predict
the most likely token and they're going
to return those quotes in kind of a
summarized or paraphrased way because
that information has kind of been
compressed into these probabilities of
what word is going to follow so if we
want a model to actually use very
specific
information then we need to provide it
in the prompt and they are very good at
dealing with information in the prompt
and staying grounded staying true to
that in their responses large language
models are very sensitive to the
information in their prompts and so
providing the right information there is
very powerful it also allows us to
expand the knowledge of the large
language model by pulling in data from
external sources which can be updated in
real time too so here's how that process
works in actually retrieving the
information first we have to set it all
up we need a database that has our
information in it so first we're going
to start with a corpor purpose of text
this could be a bunch of web pages PDFs
books you name it your company's
Internal Documentation or help center um
some kind of information that you want
the large language model to be able to
work with then we're going to split it
up this could be by paragraph
bsection um you know you want to try to
keep related text together so that if
you're inserting it into the model then
you're getting like a whole concept
because you want useful information that
the model can use to answer questions so
there's a lot of ways to split your data
but you need to split it up make a
choice there and then insert it into
your database um along with that text in
your database you're going to convert
the text into an embedding which is this
vector format it's a mathematical
structure um that can actually you
compare can compare one vector to
another vector and judge their
similarity and that's what's going to
allow us to do the knowledge retrieval
here in a minute we use an AI model to
actually generate the embeddings
and then we store the text and the
embeddings in our database now when it
comes time to calling the large language
model we need to retrieve some
information we have this user inquiry
like how do I fix a leaky faucet so we
take that inquiry and we use the same
model that we use to generate embeddings
that are stored in our database to
create an embedding from that inquiry
and then we go to our database and we
search using Vector search comparing the
distance between two vectors to find the
most similar um or relevant results then
we take those results we put them into
our prompt and we send it off for
generation seems easy right well it gets
a little more complicated so the devil's
in the details here and it really is
easy to kind of create a demo that can
work with a specific inquiry and then
knowing that it'll pull up the right
data for that inquiry but if you want
consistent results for all kinds of
different inquiries there's a lot of
optimization steps that need to be added
and I don't have an exhaustive list of
that but I'm going to show you a few so
you can get an
idea the first one is to pre-process
that inquiry into more of a topical
keyphrase that's more likely to match up
with one of your chunks of data in the
database so if the question was how do I
fix a leaky faucet you could use an llm
to summarize it and then your term would
be fix leaky faucet which maybe remove
some of the fluff or some of the
extraneous parts that could be in the
question then we're going to do exactly
what we did before create an embedding
from it search the database but before
we just pass the results directly into
the large language model for Generation
we might use an llm again to say which
of these results is most applicable
because if we're feeding the model bad
information its answer is going to be
off topic or wrong so if we can do an
intermediate step where we ask the llm
to pick the most applicable results
first we can be more confident that our
model is going to be using the right
information then we build our prompt and
we generate um but before we pass the
generated content back to the user we
could add another step and use an llm
again to do self-reflection and say is
this a good and accurate answer if not
rewrite it make it better and this just
gives it one more opportunity to fix any
errors in its response so as you can see
a full-fledged rag process has a lot of
moving parts now let's talk about
fine-tuning so with fine-tuning you're
actually training a foundation model on
examples of your own prompt completion
pairs prompt completion pairs are what
you're giving the model in the prompt
and then what is a good response back
this is what I would want the model to
give me back given this input fine
tuning is really useful when you're
trying to teach intuition where words
fall short let's imagine you're like a
really good writer and you can just get
in a flow State and write amazing
content you've been doing this for
decades but somebody asks you like what
makes you a good writer like what are
the 50 techniques you use to become a
good writer that might be a hard
question to answer and if I were to try
to explain that in a prompt to a large
language model of how it should do good
writing in my particular style I might
struggle with that but if I can write
then I can create prompt completion
pairs that show how I write for a given
topic or how I take a draft and I revise
it into something really amazing and
that's intuition and you can't teach
intuition through a prompt by describing
things with with rules and instructions
but you can teach a model intuition by
giving it examples and having it update
its weights so this is really cool for
baking in your style tone and formatting
to your outputs this actually allows you
to remove a lot of the stuff out of the
prompt because the model already
understands it um that reduces your
prompt length which in turn allows you
to have longer completions and another
cool way to use it is to train a smaller
model to perform at the level of a
higher model because making these models
bigger and bigger is just not
sustainable they get slower and more
expensive the more parameters you add so
even though they become more capable
there's tradeoffs so what we need how we
need to be thinking is selecting the
right model size for the task not just
always using the biggest model fine
tuning also Narrows the range of
possible outputs from your model which
is really helpful in preventing unwanted
Behavior now unfortunately there's a lot
of misconceptions floating around the
internet about fine-tuning and how it
works um so I want to get ahead of some
of those right now people think that
fine-tuning teaches a Model facts as we
discussed models don't really store
facts they store probabilities so while
you might incidentally get back bits and
pieces of your training data um it's not
a guaranteed functionality if you want
the model to reference facts the best
way to do that is to provide it in the
context window in the prompt using uh
technique like rag or any kind of
knowledge retrieval it's also common to
believe that fine tuning requires this
really large data set and that is not
true anymore with the foundation models
that we have today we can get really
cool results with just a handful like 20
examples another one is that it's too
expensive which is just not true maybe
it used to be true but now we have
parameter efficient fine-tuning
techniques and if you can use just 20 to
100 examples and start to get meaningful
results that we're talking pennies and
dollars we're not talking thousands of
dollars or millions of dollars another
one is that it's too complicated which I
totally get it's why we created Point AI
so that you could deal with fine-tuning
at a higher level and not worry about
all the complexities of writing code or
making API calls or the underlying
Hardware you can just focus on your use
case and the training data and then
running that and getting results and
finally people say that it's
incompatible with rag like you have to
make this choice between rag or fine
tuning and I'm going to show you exactly
how they can work together so here's two
fine tuning strategies you can keep in
mind the first strategy is going all out
on quality and in this scenario you take
the largest possible Foundation model
and you train it on your examples to get
better
output think of it as an extension to
fuse shot learning so if you could
provide a couple examples in the prompt
but it starts to get longer and longer
just move those into a training data set
find youe a model and now your model has
been trained on your examples and it's
going to be able to do a better job the
second strategy is to optimize speed and
cost in this scenario we pick a smaller
model and we train it on an example data
set to try to get it to perform at a
higher level um as good as one of the
large models would do with our
engineered prompt it may require a
larger data set especially depending on
how small of a model you want to pick an
optional part of this is reducing your
prompt size along the way so that you
can um you know have a larger context
window and save costs there too so I
mentioned fine-tuning as an extension to
fuse shot learning and here's an example
of fuse shot learning in a prompt where
you have two fuse shot examples which
adds up to 48 tokens for every request
here um and basically this is a sales
lead qualifier so they're trying to
decide if an inquiry from a uh marketing
form on the website is qualified or
unqualified help I just had a pipe break
in my house and there's water everywhere
send someone ASAP this is probably a
pretty good lead um they definitely need
someone they're probably willing to pay
for it um someone offering a small
business package to new customers um it
sounds like spam to me just some junk so
unqualified now unfortunately let's say
I have other scenarios that I want to
cover in my training data for my
particular company and like I think this
lead's qualified and that lead isn't
eventually our prompt is just going to
get longer and longer and it's going to
get more expensive but with fine tuning
we can take those examples out of the
prompt and put them into a data set and
with as few as 20 examples you can
really like I mentioned earlier you can
really start to see the Model Behavior
change so here's what that looks like
now our prompt we don't have those
examples in it anymore but we have our
training data with our examples and
we're able to add as many more examples
as we want and show the model what is a
qualifi lead and what isn't and as we go
on and we find more and more edge cases
we're just adding to our training data
so it becomes this scalable layer
backing our model um that gives us more
Assurance in the type of output we're
going to get now in terms of speed
fine-tuning can make a huge difference
even if you just jump down from GPT 4 to
GPT 3.5 turbo you can see that the
response times for 3.5 turbo are almost
three times faster than GPT 4 for a lot
of user experiences that's going to make
a really big difference smaller models
are also much cheaper so making that
same leap down and fine-tuning it we can
save almost 90% in terms of our cost um
and this adds up so much especially if
you have a large volume of requests okay
so now we understand Rag and we
understand fine tuning and unfortunately
fine tuning doesn't have a super cool
acronym like rag um and there's a lot of
different use cases for fine tuning so
um when they're actually creating a
model they do instruction tuning and
they do safety tuning um so for the type
of fine tuning that I'm talking about
where you're just trying to get better
output for your Generations I think it
would be pretty cool if we call that
tuning augmented generation so then we'd
have Rag and we'd have tag and we could
put them together and we could have a
rag tag team here's a fine-tune model
prompt with
Rag and and it has just the dynamic
content everything else has been baked
into the model the model knows what to
do with the inquiry and the knowledge
and it's going to act like a Q&A bot
whether we tell it to or not we've shown
it what to do through our training data
so let's review we have these three
techniques prompt engineering it's
awesome because it's easy to work with
you can do rapid prototyping it's very
intuitive to just write instructions and
get what you want back rag is really
powerful because it allows you to
connect external data sources you can
have Dynamic knowledge in the prompt
that grounds it to your facts and it's
real time so as you add more information
to your database that can be referenced
by the large language model both prompt
engineering and rag deal with the prompt
so it's limited to your context window
you just can't insert all of your
knowledge base into the prompt you have
to be really selective and get the right
information fine tuning allows you to
narrow the model's Behavior get more
predictable outputs and bake in the
style tone and
formatting uh just like prompt
engineering fine-tuning steers the
behavior of the model and just like rag
fine tuning allows you to apply data and
domain knowledge and your model becomes
more capable because of it and the thing
that they all have in common is that
they all allow you to get better outputs
and they can all work together and be
tools and techniques that you have in
your toolkit for working with large
language models thank you so much for
watching I hope this was a really
helpful over rview of these three
different techniques again I'm the
founder at entrypoint Ai and we've
created a fine-tuning platform to make
fine-tuning a lot easier I'd also love
for you to join our master class we host
weekly on fine tuning um it's a great
way to get hands-on experience
fine-tuning large language models and
see how it works so that you can start
applying it to solve problems with AI in
your life and business