Unlimited AI Agents running locally with Ollama & AnythingLLM

Video ID: 4UFrVvy7VlA

YouTube URL: https://www.youtube.com/watch?v=4UFrVvy7VlA

Added At: 13-06-25 21:18:58

Processed: No

Sentiment: Positive

Categories: Education, Tech

Tags: AI, Machine Learning, Quantization, Agents, LLMs, Anything LLM, Tutorial

Summary

The video is about showcasing the capabilities of Anything LLM, an application that enables users to run Large Language Models (LLMs) on their own devices. The speaker discusses quantization and agents, highlighting how Anything LLM can unlock agent capabilities in any LLM. He demonstrates the process of setting up Anything LLM and using it with a Q8 model. He also explains how this technology has privacy implications.

Transcript

hey everyone my name is Timothy kbat
founder of mlex labs and Creator and
maintainer of anything llm today I'm
actually going to Showcase anything llm
just kind of how it works but then also
show you something that makes olama
models really powerful we're actually
going to give agent capabilities to any
llm available on olama where you can
then search the web save things to
memory scrape websites do whatever you
want make charts even and I'm going to
show you how to unlock all of those
abilities by just downloading anything
LM and connecting to AMA it'll be really
simple but first I want to showcase a
little bit of Education about what AMA
is quantization and what agents even are
first ama if you found this video you've
definitely heard of AMA because ama's in
the title AMA is a application you can
install for Mac windows and Linux and it
allows you to run llms using your own
computers devices no Cloud no anything
like that so it's totally private the
way that this is possible because llama
3 is a massive model that takes dozens
of gpus to run is through a process
called quantization quantization is
basically how we can get these models
small enough to run on your CPU or your
GPU I'm not going to get into weeds of
how that works in general you should
know how quantization works it's
basically compression of an llm and when
we get into agents I'll tell you why
that's really important the next part of
this really short lecture is what is an
agent so you have llms and they respond
to you with text right they don't really
do anything an agent does something it's
an llm that is able to execute what
people call our tools or skills or
there's a whole bunch of language but
it's an llm that from your input doesn't
just respond with text it actually goes
run some program or interface or API
gets that information does that action
and then comes back to you with the
result the response or your question
question answered with that tools
supplemented help it's like rag but
you're doing things instead of just
chatting with a chunk of a document and
you can see that actually Rags on the
top part of this graph short-term and
long-term memory which is kind of a
common use case for retrieval augmented
generation chat with your docs all the
same thing what we're going to do is get
this working for any llm so you're
probably familiar with cloud-based
models like open AI or anthropics Claud
or perplexity where you can say things
to the model and sometimes it can go and
do something like search the web which
is a very common use case however if you
are using ol and you try to tell your
model to search the web it'll just tell
you that it can't do that well now with
anything llm any llm is an agent and can
be an agent and can even search the web
and do all of this for free on your
computer with 100% privacy so I'm going
to show you how we're going to unlock
that today so the first thing that we
need to do is find a good model as I
said any llm will work with anything
else l m and its agent capabilities
however when we come back to
quantization there's one detail that
people kind of Overlook when it comes to
AMA by default AMA will install a Q4
quantization now that probably doesn't
really mean anything to you but here's
kind of the goby for it q1 is the most
compressed version of that model Q8 is
the least compressed version of that
model but still compressed not the raw
model if you have a model that is 8
billion parameters and you compress it a
lot to like two or three you basically
took something that's already small and
then compressed it a lot so now you have
a pretty bad model and you'll get
hallucinations you'll get it weird
outputs it'll just go crazy not even
respond to your questions all of these
become problems at smaller models being
quantized very heavily so what we're
going to do today is intentionally
download llama 3 from AMA but use the Q8
version so that it is more robust the
calls are more reliable and the
responses are just better if we were
messing with the 70 billion model yeah
we probably wouldn't download the Q8 it
be like 30 gigs we' use the Q4 and have
a good time cuz 70 billion parameters is
a lot I know that sounds very technical
but hopefully you understand why
quantization and picking the right model
is a use case science and it's something
that you should understand if you're
messing with llms at all if you go to
AMA and you go to llama 3 and you scroll
down you'll see that the 8B tag and the
latest tag which is what downloads by
default are the same but this tag is
also matched to the instruct model which
is the same and it is a Q4 so this is a
pretty small model and it's basically
the middle of the road between size and
performance but we want really good
performance because we're dealing with
agents I'm going to go and find the Q8
version of this model which you can do
by just typing it Q8 and you'll see that
it's right here it's 8 and 1/2 gigs I'm
running on a MacBook Pro Intel it's
pretty bad for inferencing in general I
have a Windows computer in the other
room so I'm actually going to use AMA on
that computer run anything LM on this
computer all on my private Network so
here I am on my Windows computer and I
have olama installed if I type in O Lama
we have it running I need to pull in
that Q8 model and the easiest way to do
that is oama and I already have this
downloaded because I wasn't going to
wait while making this video and so
you'll see it downloads all of the
layers we're good to go so the only
thing left is AMA serve to make sure
that the server is running the server is
already running and as you can see I
have engro running and I'm tunneling my
desktop computer in one room to the
connection for in another room on my
MacBook this is where we can get into
anything llm anything LM is a all-in-one
AI agent and rag tool that just runs in
your desktop fully locally connects with
pretty much anything that you care about
and it can work on Mac windows in Linux
all you do use anything.com download and
then click on the proper operating
system and Chip architecture and since I
have anything llm downloaded we're going
to boot it up and because I've never run
it before on this computer it is going
to basically just ask us what llm do you
want to use that should be the first
question so here we are on onboarding
and it asks us what llm do you want
anything llm actually ships with ol L
inside of it so the whole setting up AMA
on my Windows computer completely
extraneous if you have a GPU device I am
on an Intel Macbook so it's really old
so I'm actually going to use the olama
external connection and all I'm going to
do is paste in that address from enro
and you'll see that my chat models are
loaded I want to use the
q88 and I know because I know about this
model uh it is a
8,192 context window it's really
annoying that they don't publish this
information for every model you have to
go and Google it it's just annoying but
anyway we'll just continue so you can
see this is kind of a privacy overview
we're going to use anything lm's buil-in
embedder so everything will embed on
this device and we're going to use the
built-in Vector database as well so that
basically none of my chats are leaving
my local network at all all of my do
data is going to stay on premises and
it'll all just work very nicely and of
course you can skip the survey it's
totally optional uh let's make a
workspace and we're going to just call
it sample for now the very first thing
that people would want to do is just
test to see does the model work so let's
just say hello and what this is doing is
sending a request to my Windows computer
and AMA on that computer is going to
stream it back and you can see it works
it works about as well as you would
expect it to and it's it's fast however
while it might be fast because I'm using
a 4090 in the other room it's still
pretty dumb and the reason that we can
say that is because it doesn't know
anything about what maybe I want it to
know about for example anything LM while
people love it and it's great and it's
cool it's not popular enough for an llm
to know about it so if we were to ask
the question what is anything llm it's
likely going to make something up and
it's going to say that anything llm is a
llm which is totally wrong and it's yeah
this is all a hallucination none of this
is accurate but what can we do to
improve its ability to know about
anything LM well the easiest way is rag
so let's do that first so we're going to
go and upload a document I actually have
anything lm's GitHub read me already
downloaded as a PDF so I'm just going to
upload that and then move it over to the
workspace so that when I am in this
workspace chatting with AMA it will use
this set of documents and you can see
that it was downloaded successfully and
so we can just close this window now
let's reset the chat and ask it that
same question again what is anything llm
what we would hope to see is to get a
response back which wow that was quick
and we get citations and we can actually
see what chunks exactly were relevant to
my query that result Ed in the llm being
able to complete this and it says
anything llm is a full stack application
blah blah blah does all this stuff that
is accurate this is actually factual
information we can go into the workspace
of settings and we can you know go to
the vector database we can increase the
number of Snippets per chat it change
the way that documents are deemed
relevant but it's actually an easier way
to just use llms and that is with agents
and as I had said before this is not a
capability built into llama it's not a
capability built into llama 3 this is
actually something that we have been
able to do to apply to any llm that
doesn't support function calling
function calling is how all of this
magic works and now you can unlock it
when you use anything LM with any llm so
what we want to do is we want to use AMA
we have Ama we have our model that's it
really don't want to use a worse model I
think we have llama 3 but let's stick
with the Q8 version and there are some
default skills that exist of course
course Rag and long-term memory we
already saw that that's built into
anything llm we should have the ability
to look at the documents in our
workspace modify them summarize them
commit new information to long-term
memory just from chatting and all of
that we should be able to summarize
these documents we should be able to
scrape websites that's a feature just
built into to anything LM we should be
able or can generate charts I will admit
this one is a little model dependent
some models just aren't great with like
you know you could paste in a CSV and
say make a bar chart some models kill it
llama 3 honestly isn't that great
generate and save files to a browser so
if we're talking to it and we say hey
can you save that contact information to
you know tim. txt it'll download it and
save it on your desktop on this device
and then of course live web search and
browsing this makes any llm that you
download running locally basically on
par with perplexity and actually you can
do it for free I'm sure you were like ah
but I need an API key you do but Google
actually offers this service totally for
free you can just click on this link
that we provide and it opens up this new
programable search engine stuff you get
100 queries a day which is honestly
pretty good we do support other like
search engine results providers but this
one's totally free and anybody with a
Google account can sign up so let's
connect mine and so we can get web
browsing okay so I have that information
put in I'm going to click update and now
everything is saved let's go back to the
chat window now keep in mind we had
information about anything LM already
stored in here so let's remove it so
we're just going to go remove that right
now if we are to reset the chat and say
um what is anything llm we should again
get a madeup response that has nothing
to do with the actual tool we can
actually get agent into the loop on this
and the way you can do that is by typing
at agent or you can click this and we
tell you about how agents work um but if
you click this you can actually see a
agent is how you would invoke this you
would say at agent can you scrape
use.com which is our website and tell me
the key features let's just call it that
and what we should hope to see is this
model go to use.com scrape that compile
all that information specifically the
key features and hopefully give us back
a pretty good text response and you can
see that we actually get what I would
consider a pretty decent response but
keep in mind this is not in long-term
memory so let's ask the model to
remember that for later so let's say
thank you can you remember that
information for later and what we should
hope to see is the model recognize this
as an available function and it say oh
yes of course I will take the chat as it
is right now summarize it and then save
that for later so that when we ask in
regular chat it would work and you can
see that it's done that but now let's
look at summarization summarization is
one of the most asked and used features
of anything llm it's not how rag works
it's actually a pretty big
misunderstanding that people think that
you can just upload a document in a
vector database and say summarize my
document it's just not how Vector
databases work but with anything LM you
can do it and so I'm going to open up a
new workspace so we'll just call it
anything
llm and we're going to upload that same
readme document because I've already
embedded in another workspace embedding
is instant and now with no other kind of
inferencing or leading or anything like
that let's just ask the agent can you
summarize readme.pdf which is the name
of the file in the workspace and you can
see it looks at the available documents
founds a document called readme.pdf and
then begins to summarize it again this
is all running locally within my network
because I'm using my Windows computer
but it is summarizing and you can see
that it says it summarized it blah blah
blah did all this stuff mentions it's
MIT licensed that is kind of the quick
preview of what agents can do for any
llm when you put them in anything llm
and while I do recognize that this list
of default skills is pretty limited
right now I do want to really really
emphasize that this is just the
beginning for anything llm we're
actually going to have the ability for
you to Define your own agents like you
would in tools like crew AI and any
other kind of agent Builder that you
know is already out there that'll just
exist in anything llm anything llm plus
oama can be your go-to for not only rag
but also AI agents that can do things
for you we have a lot more cooking on
this front and so I'm really excited to
show you this even in its current state
and I also do want to remind everybody
that anything llm is open source you can
use the app that I just showed you right
now today for free with no if ands or
butts you just download it and get it
running and the easiest way to support
us is actually by starring us on GitHub
we would really appreciate that more so
I'd also appreciate feedback suggestions
on new tools that you would like to see
agents accomplish we'd love to know what
you're working on and how anything llm
fits into that flow so that's it for
this short video I really appreciate
your time thank you