Video ID: mUGsv_IHT-g
YouTube URL: https://www.youtube.com/watch?v=mUGsv_IHT-g
Added At: 13-06-25 21:17:14
Processed: No
Sentiment: Positive
Categories: Education, Tech
Tags: AI, LLaMA, language model, natural language processing, machine learning, computing, hardware comparison
Summary
• The host, Dave, demonstrates running a local large language model (LLaMA) on various hardware systems, starting from a Raspberry Pi to a high-end AI workstation. • He shows how to install LLaMA directly on Windows and runs it on different machines with varying levels of performance. • The results indicate that the Raspberry Pi is not suitable for real-time answers due to its slow processing speed, while the Mini PC (Orion Herk) can run LLaMA relatively quickly but may not utilize its GPU effectively. • Dave also demonstrates running LLaMA on his gaming machine and a high-end AI workstation.
Transcript
hey I'm Dave welcome to my shop today we're going to run our own local chat GPT style large language model and we're going to do it on Hardware ranging from literally $50 to $50,000 and experience and compare the results recently I did an episode showing you how to use a local large language model to provide chat GPT like functionality on your own home machine but I took a lot of heat in the video comments for two things one I demoed it on a Dell thread Ripper workstation with dual Nvidia 608 series gpus and two I used Linux on top of WSL on top of Windows to do it the major outcries were then that I should test it on more budget friendly Hardware than I should just install it on Windows and so today we're going to remedy that as I show you how to do it on a wide range of systems starting with a Raspberry Pi and working our way up through a Mini PC a conventional gaming machine an M2 Mac Pro and then a top in $50,000 AI workstation from Dell and when we do it on the Windows systems I'll show you how to install it directly on Windows without any Linux or wsl2 Shenanigans starting at the lowest end to see if it's even possible we'll try to install o llama on a Raspberry Pi and just a pi 4 not even a pi five now it does have 8 GB of RAM which I figured would give it the best chance of actually working you can't run Windows on a pi so we'll drop into raspian for a moment to install and run llama itself along with the Llama 3.1 model I'm going to start with two console Windows one where I will install using the script for olama that I got from the olama website and we'll let it proceed through the download and I'll speed up those lengthy downloads so we can get right into the install and it will create the users and everything else that's required now as you'll notice it says it's not using the GPU because of course the pi doesn't have a GPU and so it's going to use the CPU only you'll get that warning on any machine where you don't have a GPU and you can also get that if you do have a GPU but the model won't fit into memory so let's slide on over to the other console window where we can actually download a model and then run it first we'll make sure there are no models installed and I'm going to pull llama 3.1 colon latest now this is going to take two or three minutes which I will speed up because even though at about 3 gbits a second it still takes a while so yours might take significantly longer if you're on regular internet service verifying the shaw digest will also take some time I'll do a quick olama list to confirm the model is there and then we'll run llama 3.1 latest I'll type in my analyze the story of Goldie locks for meeting and we'll let the timer run about 12 seconds until it's done thinking about its answer but as you can see once it starts generating its answer it's very very slow it's about one word a second if that maybe one word every two seconds nobody wants to watch it at that speed so let's kick it into high gear and watch the CPU graphs as we actually let it produce its answer as you can see all four cores are pegged to 100% the CPU is getting up to about 84° and the most active task is of course AMA itself we're using about 6 GB of memory which is not bad but if you think about it if you're going to run this on a 4 gab Pi it would be even worse because it's not going to fit or it's at best case going to be paging hard and after several minutes of struggling everything is settled down again because it's finally done producing its answer and so I think you'll agree that while the pi can run it you can't really use it for real-time answers so it's possible but not practical now when a dog plays the piano it's it's not about how well it does it it's more that it does it at all and that's kind of how I feel about large language models on the pi I'm impressed that it works at all but while it's cool that it does so as a concept no one's going to put up with that kind of performance so let's move up to a consumer grade Mini PC the herk from Orion the herk starts at $388 and this one spec at $676 features a ryzen 97940 HS chip with a 4 GHz base clock and a 5.2 GHz boost clock the CPU has a TDP of 65 watts and the system can drop to 90 Watts out of the box it features 140 wat external power supply to make that happen it uses lpddr5 sodm and has a real Vapor chamber cooler on the CPU it has dual M2 SSD slots Wi-Fi 6E and 2.5 GB networking the herk also features a GPU the radon 780m rdna 3 I GPU running at 2800 MHz and it's marketed as an AI Mini PC so let's put it to the test and see what it can do along the way I'll also show you how to install AMA directly on Windows okay we'll visit ama.com we'll click on download for Windows which will kick off the download that'll proceed pretty quickly on my 5 gigabit internet that I'm fortunate to have after many years of glacial ethernet in any event we'll click on open file and that will launch the installer I'll speed the installer up here because we don't want to sit there and watch it but uh as soon as it's done we are able to go into the command line and olama away I'll start with AMA list to ensure there are no models installed yet and then we will pull llama 3.1 colon latest I'll speed this up because even with fast internet it takes a couple minutes so your time and mileage may vary but it is 5 GB so keep that in mind when downloading also that manifest takes a long time to verify with the model successfully pulled we can now AMA run llama 3.1 Co and latest that'll spin up olama and will'll be able to type directly to it now I'm not speeding this up this is speed of the herk PC to me this seems like it's about the same speed as chat GPT and so it's fully usable and I think it's a pretty good deal for an under $400 machine I'm speeding it up now so we can get through the whole rigar roll of its explanation of Goldilocks but when we get to the end we can then try something else we'll ask it to do Little Red Riding Hood but we'll watch the GPU meter to see how much work it's actually doing and it looks like it's not doing anything but the CPU is very busy so why is that well since there's only 6 GB of dedicated GPU memory there's probably just not enough memory to load the model into the GPU is my guess it appears to have loaded the model into base memory and is using the CPU now I also noticed that I only have 26 GB of memory and so the video memory must count against your system memory and so the igpu allocates a certain section of it to the GPU now I'd like to see if a smaller model might fit into the GPU memory and therefore run within the GPU and run even faster let's go see if we can install a smaller model my next step then was to pull olama 3.2 which is actually a smaller but more up-to-date model and so you can see it's 2 GB instead of 5 gbt and as soon as it's done downloading we'll ask it a similar question and see if it now fits into the GPU memory and therefore executes in the GPU and despite struggling with the keyboard I will get llama 3.2 to run and when it comes up we'll type in our simple query again and we'll see what we get and it's good news bad news the good news is this model being smaller is actually a fair bit faster and it's really Snappy uh the downside is it's still not using the GPU and so I'm not sure if there's anything I can do about it on the mini PC at this point and sure enough if I go back and check the logs of AMA serve I can see that it says no compatible gpus were discovered so why is that and it turns out it's because my car the 780m or my Ig GPU more accurately is not listed amongst the compatible AMD products and so it looks like at this point they're not doing the igpu thing they're only doing the desktop cards even though I'd say the performance of the herk was admirable for its price it was disappointing that the igpu could not be used we can leap frog past that problem by moving to a modern desktop GPU like my own this machine is a 3970x thread Ripper that I purchased about four years ago now it's rocking 32 cores and 128 GB of RAM but the single core speed is still that of a 4-year-old PC it does have one trick up its sleeve though in the form of an Nvidia 4080 GPU nothing but the second best for Dave now hopefully you agree that a second tier video card married to a four-year-old CPU can serve as a reasonable fact simile for the average contemporary gaming PC the astute Observer might notice in neofetch that the GPU is listed as a Microsoft GPU and not an Nvidia that's because in this case I'll still be doing this one under WSL 2 and one of the nicest things about the current Linux subsystem on Windows is that it supports passing the GPU through to Linux so let's see just how fast we can run local inference with a 480 okay we've done Windows we've done Linux now let's do Linux on Windows this is my thread Ripper 3970x which is sporting an Nvidia 4080 not the uh later model the original one now in the leftand window I'm running NV top which is kind of like task manager for NVIDIA cards and allows you to monitor the progress and the use of the video card so I've done AMA run 31 of latest and we can now see some spikes on the GPU as it's loading the model and as soon as we give it a question we should be able to see it burn through and start using the GPU and it looks like we're using 16 GB of host memory and 100% of the GPU in brief spikes and now with the model running it's averaging around 75% GPU with spikes up to 100 the answer came out really quickly even with this 5 GB model so the 4080 does an admirable job of running over llama now for my purposes the 480 is fast enough it runs local inferences fast or faster than chat GPT while using a reasonably competent model and I think that's about all you can really ask for but as I'm fond of saying i' trade it all for a little more so let's up the hardware anti another notch or two and see what's possible with even higher end Hardware I have a Mac Pro that I do all of the channels video editing on and it features the M2 Ultra chip it's equipped with 128 GB of memory and the really nice thing about the Apple architecture is that all of that Ram is also available to be allocated as video RAM meaning we should be able to run even large models with good performance here we are on my Mac Pro which is featuring the M2 Ultra we'll type in our query the same one the story of Goldilocks and get an analysis of that we've got activity monitor running up in the top there so we'll be able to see the GPU use it seems to spike around 50% and it produces an answer in very rapid fashion so this is absolutely usable and actually quite nice on a Mac Pro with the built an internal GPU the Mac Pro turned in an impressive performance but we're not done there we're next going to step up to an overclocked 96 core thread Ripper with an Nvidia 6000 Ada card installed the CPU is set up to run at more than 800 watts of TDP and combined with the GPU this system pulls more than 1,200 Watts at the wall now the model we've been running so far did well enough on the 480 and the Mac Pro that I don't think there's much value in incremental gains alone so let's try something new a much larger model the Llama 3.1 model comes in three sizes 8 billion parameters 70 billion parameters and 405 billion up till now we've been running the 70 billion parameter version of this model This Thread Ripper is equipped with 512 GB of RAM which should be enough memory that we can load the 405 billion parameter version and see how it performs let's have a look at whether we can load run and test out this enormous local large language model now the first thing we have to do if we want to run the 405 billion parameter version of the model is of course to download it and it is 228 GB which means you're looking at several minutes to several hours depending on your internet speed I don't recommend that you do this unless you actually have a need and you'll find out why in just a moment once it is successfully downloaded it will run the hash and check to make sure the digest is all correct this can take quite a while and as you can see it burns through a lot of dis activity and as the model is loaded you can see the ram demands go up I think it will Peak up close close to 200 GB of total memory and finally the model is now running and we can go and ask it a question and so we give it the standard goldilux question and we'll see how fast it rips off an answer are you ready here it comes still thinking look at all that CPU usages it parses and gets all the model ready and now it goes to the GPU as we see it produce one token every several seconds so yeah it's a very powerful model it's very big 405 billion parameters which it's very impressive that you can run that at home but as you can see you can barely run that at home this is almost as bad as the regular model running on the pie it's actually pretty close to it I would say so if we've learned nothing else the size of the model and the complexity of the calculations required to operate it and do inference on it have almost as much impact as the actual machine you're running it on so choose your model wisely and now I will speed it up by a factor of 11,000 per so that it runs about the same as the smaller model does on the equivalent hardware and based on the clock in the tray it looks like it took 30 minutes to finish its answer okay I have to admit that made me a little sad to see a $50,000 workstation brought to its knees like that and by the way the big machine is on gracious loan from Dell I think a lot of you assumed I ponied up the money for an outrageous machine like that but you know what they say I didn't get rich by writing a lot of checks still I feel like I have to redeem the machine somehow as it's unfair to burden it with a load that none of the other machines could even hope to lift so let's give it a task where it can really Shine the new leaner and more efficient llama 3.2 model okay let's see what the thread ripper in the 60008 it can do with a smaller model the more efficient llama 3.2 it is significantly smaller I believe it's about 2 gab and we'll find out once it is downloading but once we get it installed we'll try a query against it and see how fast it runs and to clear everything from here on in is real time so let's enter our standard query analyze the story of Goldilocks for meaning and we'll see how fast it can produce an answer and as you can see it rips off pretty quickly it's scrolling by let's try another one we'll try a Little Red Riding Hood yeah it's incredibly quick this is a very Speedy model let's get it to think about it and compare and contrast the two stories equally fast make up a new Story featuring uh let's say Jeff Bezos Jeff Bezos porridge predicament and there you have it ama on everything from a $50 pie to a $50,000 if you like this kind of stuff or if you found today's episode to be any combination of informative or entertaining remember I'm mostly in this for the subs and likes so I'd be honored if you consider subscribing to my channel and leaving a like on the video and if you're already subscribed thanks and do be sure to check out the second Channel Dave's adct which features the weekly Q&A where I try to answer all of your random questions including about the episodes like this one thanks for joining me out here in the shop today in the meantime and in between time hope to see you next time right here in Dave's Garage Gage do it l do it do it