71.01 Nvidia Jetson and Ollama

How to get up and running an Ollama over a Nvidia Xavier NX

71.01 Nvidia Jetson and Ollama
Page content

Photo by DAVIS VARGAS on Unsplash

Running a local LLM in a Single-Board-Computer (SBC) for 10W is fun and cheap, the results? good enough (for fun, no profit). This is a personal reminder of how to get it done.

Intro - Skip this

After my Masters’ degree I wanted to try some ML models, specially those that would allow me to write about them, given that my laptop wasn’t powerful enough I bought a Xavier developer kit with some of the money that I got from tech writing.

Years later, I re-discovered the SBC getting dust in my apartment and decided to give it a new life as my personal LLM server. Ollama, docker and open source models allows me to have fun for a weekend.

Flashing the Nvidia Xavier

SBC works out-of-the-box with an SD card that is used as primary storage and boot device, this is an easy way to start but, it’s terribly slow and storage space is limited. Instead you can install an NVME disk directly into the board to be used as primary boot device since Linux For Tegra (L4T) version 4.4.

To flash the latest (35.5.0 for Xavier, but check your model) supported L4T into the SBC you can follow the standard procedure, or, get a fast-forward and simple method using the Jetson Hacks github repo instructions.

Important: The Jetson hacks repo flashes to L4T 5.12 version, if you are using a NVME disk it will be partitioned as a 16GB disk, to use the full space you have to edit the nvsdkmanager_flash.sh file to use the flash_l4t_external.xml file instead of the flash_l4t_nvme.xml (lines 136 and 141) src.

![Remember to set Recovery Mode](img/posts/xavier-pins.png Nvidia Xavier covery mode)

After flashing, Ubuntu 20.04 installation will require to set the size of the main partition.

Things to do after flashing

Install CUDA packages

Install CUDA and other ML tooling

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/sbsa/cuda-ubuntu2004.pinsudo 
mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.3.2/local_installers/cuda-repo-ubuntu2004-12-3-local_12.3.2-545.23.08-1_arm64.deb
sudo dpkg -i cuda-repo-ubuntu2004-12-3-local_12.3.2-545.23.08-1_arm64.deb
sudo cp /var/cuda-repo-ubuntu2004-12-3-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt update
sudo apt-get install \
 cuda-toolkit-12-3 \
 nvidia-jetpack \
 python3-libnvinfer-dev \
 python2.7-dev \
 python-dev \
 python-py \
 python-attr \
 python-funcsigs \
 python-pluggy \
 python-pytest \
 python-six \
 uff-converter-tf \
 libtbb-dev

Setup Wake-up–on-LAN

Create a service descriptor to restart remotely after suspend the machine (sudo systemctl suspend), in file /etc/systemd/system/wol.service with the following contents:

[Unit]
Description=Enable Wake On Lan

[Service]
Type=oneshot
ExecStart = /sbin/ethtool --change eth0 wol g

[Install]
WantedBy=basic.target

Set up the service:

sudo systemctl daemon-reload
sudo systemctl enable wol.service
sudo systemctl status wol

Use Json-stats

Install jtop

sudo apt install python3-pip
sudo pip3 install -U jetson-stats
sudo reboot now

After installing a good config balance for performance of CPU/GPU/Power is the mode 5 in which you will get 4 cores (1.9 GHz) / 3 GPU TPC (510 MHz) / 10W with quiet cooling, you can use jtop to set it or the cli:

sudo nvpmodel -m 5
sudo jetson_clocks

In my personal tests using the GPU to serve the Ollama LLMs is required to set the cooling to manual with at least 80% (5051 RPM).

Running Ollama

Nvidia introduced jetson containers as part of their cloud-native strategy, it allows to run containers using the GPU (cards and onboard) to accelerate the execution. In jetson the github repo maintains a series of ML/AI containers compatibles with several L4T kernels.

Start by installing the containers support:

git clone https://github.com/dusty-nv/jetson-containers
bash jetson-containers/install.sh

Then, you can go and run several tools, in our case the ollama container as daemon:

jetson-containers run -d --name ollama dustynv/ollama:r35.4.1

which you can set as a systemd service, creating a file /etc/systemd/system/ollama.service with the following contents:

[Unit]
Description=Starts Ollama docker server

[Service]
User=nickman
WorkingDirectory=/home/username
Restart=always
RestartSec=10
#Type=oneshot
Environment="ARGS=-d --name ollama dustynv/ollama:r35.4.1"
ExecStart = jetson-containers run $ARGS

[Install]
WantedBy=basic.target

And enabling the service:

sudo systemctl daemon-reload
sudo systemctl enable ollama.service
sudo systemctl status ollama

Demo

Once you have installed all the dependencies, a simple call to Ollama API results in a strong consumption of GPU resources:

--:----:--Fullscreen (f)

Conclusion

Despite being an old (6 years?) hardware, the Nvidia Xavier (8 GB) supports the new small LLMs, next time I´ll setup some benchmarks to check speed and quality(?).