What is the best budget GPU for local LLM training?

Question

okay so im hitting a wall here. i have about 450 bucks maybe 500 if i skip takeout for a month and i need to get a gpu like yesterday for this research project im doing on fine-tuning small models for medical terminology. im based in a small town in ohio and shipping is taking forever lately so i need to make a decision tonight so i can get it by the weekend. my logic was to just grab a used 3060 because everyone says the 12gb of vram is the king of budget builds but then i started looking at the 4060 ti 16gb and now im spiraling.

see the problem is i read that the 4060 ti has a really narrow memory bus so even though it has more vram it might actually be slower for training passes than the older cards? but then some people on reddit are saying vram capacity is literally the only thing that matters if you dont want to deal with out of memory errors every five minutes. i tried looking at rtx 3080s too but the 10gb version seems like a waste for llms even if the raw speed is better right? i just dont want to spend my entire savings on something that is gonna bottleneck me when i try to run lora or qlora stuff.

im mostly looking at 7b models maybe 13b if i can squeeze it in. is the 3060 12gb still the play in 2024 or am i just being cheap? i saw some guy mention dual p40s but i dont have the cooling setup or the patience to mess with server cards and weird drivers honestly. i just need something that works out of the box with pytorch and wont catch fire in my mid-tower case while i leave it running overnight. what would you guys actually put your own money on if you had to start training tomorrow on a shoestring budget?

BinaryBouquet · Accepted Answer

> is the 3060 12gb still the play in 2024 or am i just being cheap? Honestly, I had issues with the 3060 lately because 12GB is becoming a massive bottleneck even for basic 7B fine-tuning if you want any decent context length. It is not as good as expected for 2024 workloads. Unfortunately, the memory bus on the NVIDIA GeForce RTX 4060 Ti 16GB GDDR6 is indeed narrow, which makes it feel sluggish during heavy training passes compared to high-end silicon. However, if I had to put my own money down today for a mid-tower setup, I would begrudgingly pick the 4060 Ti 16GB over the EVGA GeForce RTX 3060 XC Gaming 12GB GDDR6. Capacity is king for LLMs. That extra 4GB is the difference between actually running a 13B QLoRA and crashing with OOM errors. The 3080 10GB is a total waste of time for this specific use case despite the speed. Just get the 16GB card and deal with the slower bus.

GalaxyGossamer · Answer

Building on the earlier suggestion, you absolutely need to prioritize that VRAM overhead! I saw your post earlier and just had to jump in because I love this specific puzzle. Honestly, if you can stretch that budget just a tiny bit or find a deal, the NVIDIA GeForce RTX 4060 Ti 16GB is a total game changer for budget LLM work! Forget the noise about the memory bus speed. When you are doing fine-tuning with LoRA or QLoRA, hitting an Out of Memory error is the absolute worst vibe, and that extra 4GB over the 3060 is massive. It lets you push the context window so much further which is fantastic for medical data. Plus, the power efficiency is amazing! It stays way cooler in a mid-tower than an old power-hungry 3080 ever would. If you want something that just works with PyTorch out of the box and wont turn your room into a sauna, thats my pick for sure!

CinderCinnamon · Answer

To add to the point above: the VRAM cap really is the hard ceiling here. I spent years messing with professional server racks before going back to a DIY desktop, and the NVIDIA GeForce RTX 4060 Ti 16GB is the only thing that makes sense for 13B models on a budget. The narrow bus sucks but OOM errors kill your flow way faster. Quick tip: just lower your batch size to 1 to offset that bandwidth bottleneck.