Run Local LLMs on Hardware from $50 to $50,000 - We Test and Compare!

Video ID: mUGsv_IHT-g

YouTube URL: https://www.youtube.com/watch?v=mUGsv_IHT-g

Added At: 13-06-25 21:17:14

Processed: Yes

Sentiment: Positive

Categories: Education, Tech

Tags: AI, LLaMA, language model, natural language processing, machine learning, computing, hardware comparison

Summary

• The host, Dave, demonstrates running a local large language model (LLaMA) on various hardware systems, starting from a Raspberry Pi to a high-end AI workstation. • He shows how to install LLaMA directly on Windows and runs it on different machines with varying levels of performance. • The results indicate that the Raspberry Pi is not suitable for real-time answers due to its slow processing speed, while the Mini PC (Orion Herk) can run LLaMA relatively quickly but may not utilize its GPU effectively. • Dave also demonstrates running LLaMA on his gaming machine and a high-end AI workstation.

Transcript

hey I'm Dave welcome to my shop today
we're going to run our own local chat
GPT style large language model and we're
going to do it on Hardware ranging from
literally $50 to
$50,000 and experience and compare the
results recently I did an episode
showing you how to use a local large
language model to provide chat GPT like
functionality on your own home machine
but I took a lot of heat in the video
comments for two things one I demoed it
on a Dell thread Ripper workstation with
dual Nvidia 608 series gpus and two I
used Linux on top of WSL on top of
Windows to do it the major outcries were
then that I should test it on more
budget friendly Hardware than I should
just install it on Windows and so today
we're going to remedy that as I show you
how to do it on a wide range of systems
starting with a Raspberry Pi and working
our way up through a Mini PC a
conventional gaming machine an M2 Mac
Pro and then a top in $50,000 AI
workstation from Dell and when we do it
on the Windows systems I'll show you how
to install it directly on Windows
without any Linux or wsl2
Shenanigans starting at the lowest end
to see if it's even possible we'll try
to install o llama on a Raspberry Pi and
just a pi 4 not even a pi five now it
does have 8 GB of RAM which I figured
would give it the best chance of
actually working you can't run Windows
on a pi so we'll drop into raspian for a
moment to install and run llama itself
along with the Llama 3.1 model I'm going
to start with two console Windows one
where I will install using the script
for olama that I got from the olama
website and we'll let it proceed through
the download and I'll speed up those
lengthy downloads so we can get right
into the install and it will create the
users and everything else that's
required now as you'll notice it says
it's not using the GPU because of course
the pi doesn't have a GPU and so it's
going to use the CPU only you'll get
that warning on any machine where you
don't have a GPU and you can also get
that if you do have a GPU but the model
won't fit into memory so let's slide on
over to the other console window where
we can actually download a model and
then run it first we'll make sure there
are no models installed and I'm going to
pull llama 3.1 colon latest now this is
going to take two or three minutes which
I will speed up because even though at
about 3 gbits a second it still takes a
while so yours might take significantly
longer if you're on regular internet
service verifying the shaw digest will
also take some time I'll do a quick
olama list to confirm the model is there
and then we'll run llama 3.1 latest I'll
type in my analyze the story of Goldie
locks for meeting and we'll let the
timer run about 12 seconds until it's
done thinking about its answer but as
you can see once it starts generating
its answer it's very very slow it's
about one word a second if that maybe
one word every two seconds nobody wants
to watch it at that speed so let's kick
it into high gear and watch the CPU
graphs as we actually let it produce its
answer as you can see all four cores are
pegged to 100% the CPU is getting up to
about
84° and the most active task is of
course AMA itself we're using about 6 GB
of memory which is not bad but if you
think about it if you're going to run
this on a 4 gab Pi it would be even
worse because it's not going to fit or
it's at best case going to be paging
hard and after several minutes of
struggling everything is settled down
again because it's finally done
producing its answer and so I think
you'll agree that while the pi can run
it you can't really use it for real-time
answers so it's possible but not
practical now when a dog plays the piano
it's it's not about how well it does it
it's more that it does it at all and
that's kind of how I feel about large
language models on the pi I'm impressed
that it works at all but while it's cool
that it does so as a concept no one's
going to put up with that kind of
performance so let's move up to a
consumer grade Mini PC the herk from
Orion the herk starts at $388 and this
one spec at
$676 features a ryzen 97940 HS chip with
a 4 GHz base clock and a 5.2 GHz boost
clock the CPU has a TDP of 65 watts and
the system can drop to 90 Watts out of
the box it features 140 wat external
power supply to make that happen it uses
lpddr5 sodm and has a real Vapor chamber
cooler on the CPU it has dual M2 SSD
slots Wi-Fi 6E and 2.5 GB networking the
herk also features a GPU the radon 780m
rdna 3 I GPU running at 2800 MHz and
it's marketed as an AI Mini PC so let's
put it to the test and see what it can
do along the way I'll also show you how
to install AMA directly on
Windows okay we'll visit ama.com we'll
click on download for Windows which will
kick off the download that'll proceed
pretty quickly on my 5 gigabit internet
that I'm fortunate to have after many
years of glacial ethernet in any event
we'll click on open file and that will
launch the installer I'll speed the
installer up here because we don't want
to sit there and watch it but uh as soon
as it's done we are able to go into the
command line and olama away I'll start
with AMA list to ensure there are no
models installed yet and then we will
pull llama 3.1 colon latest I'll speed
this up because even with fast internet
it takes a couple minutes so your time
and mileage may vary but it is 5 GB so
keep that in mind when downloading also
that manifest takes a long time to
verify with the model successfully
pulled we can now AMA run llama 3.1 Co
and latest that'll spin up olama and
will'll be able to type directly to it
now I'm not speeding this up this is
speed of the herk PC to me this seems
like it's about the same speed as chat
GPT and so it's fully usable and I think
it's a pretty good deal for an under
$400 machine I'm speeding it up now so
we can get through the whole rigar roll
of its explanation of Goldilocks but
when we get to the end we can then try
something else we'll ask it to do Little
Red Riding Hood but we'll watch the GPU
meter to see how much work it's actually
doing and it looks like it's not doing
anything but the CPU is very busy so why
is that well since there's only 6 GB of
dedicated GPU memory there's probably
just not enough memory to load the model
into the GPU is my guess it appears to
have loaded the model into base memory
and is using the CPU now I also noticed
that I only have 26 GB of memory and so
the video memory must count against your
system memory and so the igpu allocates
a certain section of it to the GPU now
I'd like to see if a smaller model might
fit into the GPU memory and therefore
run within the GPU and run even faster
let's go see if we can install a smaller
model my next step then was to pull
olama 3.2 which is actually a smaller
but more up-to-date model and so you can
see it's 2 GB instead of 5 gbt and as
soon as it's done downloading we'll ask
it a similar question and see if it now
fits into the GPU memory and therefore
executes in the
GPU and despite struggling with the
keyboard I will get llama 3.2 to run and
when it comes up we'll type in our
simple query again and we'll see what we
get and it's good news bad news the good
news is this model being smaller is
actually a fair bit faster and it's
really Snappy uh the downside is it's
still not using the GPU and so I'm not
sure if there's anything I can do about
it on the mini PC at this point and sure
enough if I go back and check the logs
of AMA serve I can see that it says no
compatible gpus were discovered so why
is that and it turns out it's because my
car the 780m or my Ig GPU more
accurately is not listed amongst the
compatible AMD products and so it looks
like at this point they're not doing the
igpu thing they're only doing the
desktop cards even though I'd say the
performance of the herk was admirable
for its price it was disappointing that
the igpu could not be used we can leap
frog past that problem by moving to a
modern desktop GPU like my own this
machine is a 3970x thread Ripper that I
purchased about four years ago now it's
rocking 32 cores and 128 GB of RAM but
the single core speed is still that of a
4-year-old PC it does have one trick up
its sleeve though in the form of an
Nvidia 4080 GPU nothing but the second
best for Dave now hopefully you agree
that a second tier video card married to
a four-year-old CPU can serve as a
reasonable fact simile for the average
contemporary gaming PC the astute
Observer might notice in neofetch that
the GPU is listed as a Microsoft GPU and
not an Nvidia that's because in this
case I'll still be doing this one under
WSL 2 and one of the nicest things about
the current Linux subsystem on Windows
is that it supports passing the GPU
through to Linux so let's see just how
fast we can run local inference with a
480 okay we've done Windows we've done
Linux now let's do Linux on Windows this
is my thread Ripper
3970x which is sporting an Nvidia 4080
not the uh later model the original one
now in the leftand window I'm running NV
top which is kind of like task manager
for NVIDIA cards and allows you to
monitor the progress and the use of the
video card so I've done AMA run 31 of
latest and we can now see some spikes on
the GPU as it's loading the model and as
soon as we give it a question we should
be able to see it burn through and start
using the GPU and it looks like we're
using 16 GB of host memory and 100% of
the GPU in brief spikes and now with the
model running it's averaging around 75%
GPU with spikes up to 100 the answer
came out really quickly even with this 5
GB model so the 4080 does an admirable
job of running over llama now for my
purposes the 480 is fast enough it runs
local inferences fast or faster than
chat GPT while using a reasonably
competent model and I think that's about
all you can really ask for but as I'm
fond of saying i' trade it all for a
little more so let's up the hardware
anti another notch or two and see what's
possible with even higher end Hardware I
have a Mac Pro that I do all of the
channels video editing on and it
features the M2 Ultra chip it's equipped
with 128 GB of memory and the really
nice thing about the Apple architecture
is that all of that Ram is also
available to be allocated as video RAM
meaning we should be able to run even
large models with good performance here
we are on my Mac Pro which is featuring
the M2 Ultra we'll type in our query the
same one the story of Goldilocks and get
an analysis of that we've got activity
monitor running up in the top there so
we'll be able to see the GPU use it
seems to spike around 50% and it
produces an answer in very rapid fashion
so this is absolutely usable and
actually quite nice on a Mac Pro with
the built an internal GPU the Mac Pro
turned in an impressive performance but
we're not done there we're next going to
step up to an overclocked 96 core thread
Ripper with an Nvidia 6000 Ada card
installed the CPU is set up to run at
more than 800 watts of TDP and combined
with the GPU this system pulls more than
1,200 Watts at the wall now the model
we've been running so far did well
enough on the 480 and the Mac Pro that I
don't think there's much value in
incremental gains alone so let's try
something new a much larger model the
Llama 3.1 model comes in three sizes 8
billion parameters 70 billion parameters
and 405 billion up till now we've been
running the 70 billion parameter version
of this model This Thread Ripper is
equipped with 512 GB of RAM which should
be enough memory that we can load the
405 billion parameter version and see
how it performs let's have a look at
whether we can load run and test out
this enormous local large language model
now the first thing we have to do if we
want to run the 405 billion parameter
version of the model is of course to
download it and it is 228 GB which means
you're looking at several minutes to
several hours depending on your internet
speed I don't recommend that you do this
unless you actually have a need and
you'll find out why in just a moment
once it is successfully downloaded it
will run the hash and check to make sure
the digest is all correct this can take
quite a while and as you can see it
burns through a lot of dis activity and
as the model is loaded you can see the
ram demands go up I think it will Peak
up close close to 200 GB of total memory
and finally the model is now running and
we can go and ask it a question and so
we give it the standard goldilux
question and we'll see how fast it rips
off an answer are you ready here it
comes still thinking look at all that
CPU usages it parses and gets all the
model ready and now it goes to the GPU
as we see it produce one token every
several seconds so yeah it's a very
powerful model it's very big 405 billion
parameters which it's very impressive
that you can run that at home but as you
can see you can barely run that at home
this is almost as bad as the regular
model running on the pie it's actually
pretty close to it I would say so if
we've learned nothing else the size of
the model and the complexity of the
calculations required to operate it and
do inference on it have almost as much
impact as the actual machine you're
running it on so choose your model
wisely and now I will speed it up by a
factor of 11,000 per so that it runs
about the same as the smaller model does
on the equivalent hardware and based on
the clock in the tray it looks like it
took 30 minutes to finish its answer
okay I have to admit that made me a
little sad to see a $50,000 workstation
brought to its knees like that and by
the way the big machine is on gracious
loan from Dell I think a lot of you
assumed I ponied up the money for an
outrageous machine like that but you
know what they say I didn't get rich by
writing a lot of checks still I feel
like I have to redeem the machine
somehow as it's unfair to burden it with
a load that none of the other machines
could even hope to lift so let's give it
a task where it can really Shine the new
leaner and more efficient llama 3.2
model okay let's see what the thread
ripper in the 60008 it can do with a
smaller model the more efficient llama
3.2 it is significantly smaller I
believe it's about 2 gab and we'll find
out once it is downloading but once we
get it installed we'll try a query
against it and see how fast it runs and
to clear everything from here on in is
real time so let's enter our standard
query analyze the story of Goldilocks
for meaning and we'll see how fast it
can produce an answer and as you can see
it rips off pretty quickly it's
scrolling by let's try another one we'll
try a Little Red Riding Hood yeah it's
incredibly quick this is a very Speedy
model let's get it to think about it and
compare and contrast the two
stories equally fast make up a new Story
featuring uh let's say Jeff
Bezos Jeff Bezos porridge predicament
and there you have it ama on everything
from a $50 pie to a $50,000
if you like this kind of stuff or if you
found today's episode to be any
combination of informative or
entertaining remember I'm mostly in this
for the subs and likes so I'd be honored
if you consider subscribing to my
channel and leaving a like on the video
and if you're already subscribed thanks
and do be sure to check out the second
Channel Dave's adct which features the
weekly Q&A where I try to answer all of
your random questions including about
the episodes like this one thanks for
joining me out here in the shop today in
the meantime and in between time hope to
see you next time right here in Dave's
Garage Gage do it l do it do it