What is the best bu...
 
Notifications
Clear all

What is the best budget GPU for local LLM training?

6 Posts
7 Users
0 Reactions
56 Views
0
Topic starter

okay so im hitting a wall here. i have about 450 bucks maybe 500 if i skip takeout for a month and i need to get a gpu like yesterday for this research project im doing on fine-tuning small models for medical terminology. im based in a small town in ohio and shipping is taking forever lately so i need to make a decision tonight so i can get it by the weekend. my logic was to just grab a used 3060 because everyone says the 12gb of vram is the king of budget builds but then i started looking at the 4060 ti 16gb and now im spiraling.

see the problem is i read that the 4060 ti has a really narrow memory bus so even though it has more vram it might actually be slower for training passes than the older cards? but then some people on reddit are saying vram capacity is literally the only thing that matters if you dont want to deal with out of memory errors every five minutes. i tried looking at rtx 3080s too but the 10gb version seems like a waste for llms even if the raw speed is better right? i just dont want to spend my entire savings on something that is gonna bottleneck me when i try to run lora or qlora stuff.

im mostly looking at 7b models maybe 13b if i can squeeze it in. is the 3060 12gb still the play in 2024 or am i just being cheap? i saw some guy mention dual p40s but i dont have the cooling setup or the patience to mess with server cards and weird drivers honestly. i just need something that works out of the box with pytorch and wont catch fire in my mid-tower case while i leave it running overnight. what would you guys actually put your own money on if you had to start training tomorrow on a shoestring budget?


6 Answers
12

> is the 3060 12gb still the play in 2024 or am i just being cheap? Honestly, I had issues with the 3060 lately because 12GB is becoming a massive bottleneck even for basic 7B fine-tuning if you want any decent context length. It is not as good as expected for 2024 workloads. Unfortunately, the memory bus on the NVIDIA GeForce RTX 4060 Ti 16GB GDDR6 is indeed narrow, which makes it feel sluggish during heavy training passes compared to high-end silicon. However, if I had to put my own money down today for a mid-tower setup, I would begrudgingly pick the 4060 Ti 16GB over the EVGA GeForce RTX 3060 XC Gaming 12GB GDDR6. Capacity is king for LLMs. That extra 4GB is the difference between actually running a 13B QLoRA and crashing with OOM errors. The 3080 10GB is a total waste of time for this specific use case despite the speed. Just get the 16GB card and deal with the slower bus.


12

Building on the earlier suggestion, you absolutely need to prioritize that VRAM overhead! I saw your post earlier and just had to jump in because I love this specific puzzle. Honestly, if you can stretch that budget just a tiny bit or find a deal, the NVIDIA GeForce RTX 4060 Ti 16GB is a total game changer for budget LLM work! Forget the noise about the memory bus speed. When you are doing fine-tuning with LoRA or QLoRA, hitting an Out of Memory error is the absolute worst vibe, and that extra 4GB over the 3060 is massive. It lets you push the context window so much further which is fantastic for medical data. Plus, the power efficiency is amazing! It stays way cooler in a mid-tower than an old power-hungry 3080 ever would. If you want something that just works with PyTorch out of the box and wont turn your room into a sauna, thats my pick for sure!


3

To add to the point above: the VRAM cap really is the hard ceiling here. I spent years messing with professional server racks before going back to a DIY desktop, and the NVIDIA GeForce RTX 4060 Ti 16GB is the only thing that makes sense for 13B models on a budget. The narrow bus sucks but OOM errors kill your flow way faster. Quick tip: just lower your batch size to 1 to offset that bandwidth bottleneck.


3

Saving this thread


1

Honestly i would just stick with NVIDIA for this. Anything else is a headache for local training. You really cant go wrong with their midrange stuff because the software support is just better. Before you pull the trigger tho... what kind of power supply are you rocking in that mid-tower? Dont want you to buy a card and then realize you need a new PSU too.


1

Building on the earlier suggestion, the memory bus issue is something that has been driving me absolutely insane lately. Its honestly exhausting trying to find a middle ground when NVIDIA keeps gimping the bus width on these mid-range cards. I spent weeks comparing the memory bandwidth specs because it feels like every time we get a win on VRAM capacity, they throttle the actual speed at which that data moves. It makes the whole shopping process a nightmare when you just want to get your work done without feeling like you are being overcharged for a crippled piece of silicon. If you are serious about those 7b and 13b models, here is why capacity usually wins over speed for training:

  • The memory bus on the NVIDIA GeForce RTX 4060 Ti 16GB GDDR6 is definitely narrow at 128-bit, but for LLM training, hitting an OOM error is a hard stop, whereas a slow bus just means the epoch takes a bit longer.
  • A NVIDIA GeForce RTX 3060 12GB GDDR6 is the absolute floor for what you are doing. Anything less than 12GB is basically a paperweight for fine-tuning.
  • The NVIDIA GeForce RTX 3080 10GB GDDR6X is faster for gaming, but that 10GB limit is a total dealbreaker for Pytorch workloads. You will run out of space before the speed even matters. I went through this same struggle last year and ended up so frustrated by the lack of budget options that actually make sense. Its a tough spot to be in, but sticking to 16GB is the only way to stay sane with 13b models.


Share: