rodolfo.ca

Ollama makes it incredibly easy to run large language models locally on your own machine. For this example, we’re going to use the Qwen2.5:1.5b model, which is particularly interesting because it offers good performance while being lightweight enough to run efficiently on most hardware such as a regular laptop (my case).

Just to give more context, I’m currently running this 1.5b model, on a ~7.5 years old laptop with a NVIDIA GeForce GTX 960M,which has 4GB of GDDR5 memory, 640 CUDA GPU Cores, and 80.16 GB/s of Memory bandwidth. The laptop also has 16GB of memory, and an Intel Core i7 with 2.6GHz, but this is less relevant since the model will run fully on the NVIDIA GPU.

Why Choose Local AI (LLM)?

Running models locally with Ollama offers several advantages:

Privacy: Your data never leaves your machine
No API Costs: No per-token charges or monthly subscriptions
Offline Capability: Works without internet connection
Customization: Full control over the model and its responses
Low Latency: No network overhead for faster responses

What is Qwen2.5:1.5b?

Qwen2.5:1.5b is a 1.5 billion parameter model from Alibaba Cloud’s Qwen family. Despite its smaller size, it delivers impressive results for tasks like text generation, code completion, and conversational AI while requiring minimal system resources.

Installing Ollama

First, let’s install Ollama on your system:

macOS

brew install ollama

Linux

curl -fsSL https://ollama.com/install.sh | sh

Windows

Download the installer from ollama.com and run it.

Running Qwen2.5:1.5b

Once Ollama is installed, running the Qwen2.5:1.5b model is straightforward:

# Pull the model
ollama pull qwen2.5:1.5b

# Run the model
ollama run qwen2.5:1.5b

The first time you run this command, Ollama will download the model (approximately 1GB). Subsequent runs will be much faster.

Interactive Chat

ollama run qwen2.5:1.5b
>>> Hello! How can you help me today?

Single Query

echo "Explain quantum computing in simple terms" | ollama run qwen2.5:1.5b

Code Generation

echo "Write a Python function to calculate fibonacci numbers" | ollama run qwen2.5:1.5b

Ollama also provides a REST API that you can use in your applications:

# Start Ollama server
ollama serve

# Make API calls
curl http://localhost:11434/api/generate -d '{
  "model": "qwen2.5:1.5b",
  "prompt": "Write a hello world program in Go"
}'

Performance Tips

System Resources: The 1.5b model runs comfortably on systems with 4GB+ RAM
GPU Acceleration: If you have a compatible GPU, Ollama will automatically use it for faster inference
Model Management: Use ollama list to see installed models and ollama rm <model> to remove unused ones

Here is an example of the output, using –verbose. As you can see, it took 10 seconds to answer a simple question and the eval rate was done using 16.75 tokens per second. Modern NVIDIA boards can render these at ~200 tokens per second, however, this would cost you $3000 on a single video board :)

ollama run qwen2.5:1.5b "What is the capital of Australia?" --verbose

The capital city of Australia is Canberra.

total duration:       10.0650759s
load duration:        8.6905296s
prompt eval count:    36 token(s)
prompt eval duration: 824.9321ms
prompt eval rate:     43.64 tokens/s
eval count:           9 token(s)
eval duration:        537.5474ms
eval rate:            16.74 tokens/s

Here is an example on another laptop, a MacBook Pro M1, and as you can see the speed is significantly faster, with a eval rate of 86.54 tokens per second.

ollama run qwen2.5:1.5b "What is the capital of Australia?" --verbose

The capital of Australia is Canberra.

total duration:       1.137187459s
load duration:        940.9695ms
prompt eval count:    36 token(s)
prompt eval duration: 102.911875ms
prompt eval rate:     349.81 tokens/s
eval count:           8 token(s)
eval duration:        92.438959ms
eval rate:            86.54 tokens/s

Why is the Macbook faster than a dedicated GeForce GTX960 video card?

The M1 has an unified memory architecture with 400GB/s bandwidth, and can load the entire 1.5B parameter model into fast unified memory, while the GTX 960M must constantly shuffle data between slow system RAM and limited VRAM.

To give more context, here is the same question, but now running at the old laptop (NVIDIA GeForce GTX 960M), and a qwen3:8b model:

ollama run qwen3:8b "What is the capital of Australia?" --verbose

Thinking...
Okay, the user is asking for the capital of Australia. Let me recall what I know about Australian geography.
Australia is a country, and I remember that its capital is a city. Wait, isn't it Canberra? But I should make sure
I'm not confusing it with another city. Let me think. Sydney is a major city, but that's the largest city, not the
capital. Melbourne is another big city, but again, not the capital. Then there's Brisbane, Perth, Adelaide, etc. So
the capital is Canberra. Wait, why is it called Canberra? I think it's named after the British prime minister,
maybe? Or perhaps it's a combination of two words. Oh right, it's named after the Australian states of New South
Wales and Victoria, combining parts of their names. But the key point is that the capital is Canberra. I should
confirm that there's no other city that's the capital. Sometimes people might confuse it with Sydney, but Sydney is
the largest city. So the answer is Canberra. Let me just double-check in my mind. Yes, the capital of Australia is
indeed Canberra. I think that's correct. No, wait, maybe I should mention that it's the seat of government, and that
it's not the largest city. That way, the user can understand the difference between the capital and the largest
city. Alright, I'm confident that the answer is Canberra.
...done thinking.

The capital of Australia is **Canberra**. It is a planned city located in the Australian Capital Territory (ACT),
which is a federal territory. While Sydney is the largest city and a major economic hub, Canberra serves as the
political and administrative center of the country.

total duration:       1m52.1215941s
load duration:        63.3456ms
prompt eval count:    17 token(s)
prompt eval duration: 2.8612691s
prompt eval rate:     5.94 tokens/s
eval count:           346 token(s)
eval duration:        1m49.1950338s
eval rate:            3.17 tokens/s 

Cheers,
Rodolfo

Previous post:
The FLARE Framework: Mastering Effective Communication in Leadership

Most recent post:
Solo Engineers vs. Small Teams: A Trade-off in Parallelism and Focus

Running local LLM with Ollama