Ask HN: Any way to borrow compute from Apple M1

enoch2090 · 2024-07-18T10:56:48 1721300208

For llama3 just ask him to install ollama and serve the model. Ollama has auto memory management and will free the model when not used, and whenever you make a call to the API (do let your friend know before you do this) ollama will reload the model back to memory again.

Not sure whether there are anything similar for SD though.

ttla · 2024-07-18T11:28:04 1721302084

This, plus connect via Tailscale and you can access it from anywhere (assuming you're friends laptop is online).

PLG88 · 2024-07-18T12:37:01 1721306221

There are tons of options - https://github.com/anderspitman/awesome-tunneling. I will advocate for zrok.io as I work on its parent project, OpenZiti. zrok is open source and has a free SaaS.

nomadic-coder · 2024-07-18T11:05:18 1721300718

[zrok](https://zrok.io/), an alternative to ngrok does access management too. It's like tailscale but can give access to a specific service.

lordnacho · 2024-07-18T10:34:46 1721298886

> i do not want to remote into his machine

> tailscale/zerotier

Same thing isn't it?

In any case it wouldn't be hard for you to just have an account on his machine, tailscale being perhaps the simplest setup. SSH in, cook his laptop at your leisure.

ttla · 2024-07-18T11:29:17 1721302157

With Tailscale you could access the port serving the models API (assumedly via ollama or similar) so the friend wouldn't have to grant any access beyond that.

condiment · 2024-07-18T11:51:13 1721303473

I had to do this fairly recently to make krita-diffusion available for my friends and family who don't have a 3090ti laying around. Probably the simplest way would be to run a local http service on your friend's M1 that is ssh tunneled to a server that you'll access over http. On the server you'll need to reverse-proxy the tunneled port to a public address and port.

You make http requests to the shared server, those get proxied via the ssh tunnel to his machine, and the client on his machine could make the determination when/whether to run the workload.

whywhywhywhy · 2024-07-18T10:56:32 1721300192

M1 can’t really handle SD the inference times are closer to a minute and with SDXL you can feel the machine straining under it, battery depletes quick and the machine often completely freezes up for a second if you’re trying to do other things at the same time (M1 Max 32gb).

Think you’d be way better off just paying for a service designed for this or renting a GPU from a service set up for this cost won’t be that significant.

neximo64 · 2024-07-13T07:03:10 1720854190

Use Ollama's api

gostsamo · 2024-07-13T07:09:22 1720854562

Or OpenWebUI over it if you want acceptable ui experience.

2Gkashmiri · 2024-07-13T07:18:31 1720855111

rcarmo · 2024-07-18T10:23:01 1721298181

This. Works fine. The M1 can run most small models (phi3, gemma, etc.) at usable speeds even with just 8GB of RAM.

nsbk · 2024-07-18T10:25:54 1721298354

This! Tailscale plus Ollama API will definitely do the job

BossingAround · 2024-07-18T09:31:55 1721295115

An off-topic question, are Apple's M-series chips any good at current AI/ML work? How does it compare with dedicated GPUs?

saberience · 2024-07-18T10:49:49 1721299789

I have an M3 chip in my laptop, it has more memory than my 4090 but it's still way slower when inferencing. So as long as the model fits in memory, Nvidia GPUs are going to be way faster just because they have more/faster compute cores.

Of course, if the model fits in memory on your M chip and doesn't in your Nvidia chip, the M chip wins by default. However, I would say, if you load a 70B model in your M chip, while it WILL work, the tokens/sec will be slow as fuck... so it kinda doesn't matter anyway.

NBJack · 2024-07-18T12:46:09 1721306769

The latest Nvidia drivers offer an option to start using system memory when VRAM is insufficient. It certainly slows things down, but it does work. It's not perfect in my experience, but it is an alternative for large models.

wiradikusuma · 2024-07-18T14:33:05 1721313185

Sounds great! Do you have source maybe a wiki?

NBJack · 2024-07-19T16:50:28 1721407828

This might be the best place to start:

https://nvidia.custhelp.com/app/answers/detail/a_id/5490/~/s...

throw_nbvc1234 · 2024-07-18T11:57:40 1721303860

How does the inference for LLMs impact battery life? For SD it can be 5% battery per image at times.

theodric · 2024-07-18T09:38:37 1721295517

Mid-tier gaming GPU performance, but (potentially) access to gobs of memory (if running on a host with gobs of memory) owing to the unified memory design. For certain use cases which require loading huge datasets but don't necessarily require massive compute (i.e. inference on large models) they can be cost-competitive relative to something like an H100.

gkfasdfasdf · 2024-07-18T12:54:45 1721307285

Compared to a GPU like a 4090 with equal vram it probably won't fare well as others point out, however it far outperforms any CPU. On an M1 Ultra MacBook Pro I was seeing like 40 tokens/second with llama3:7B vs 9 tokens/second on various Intel servers/desktops with sufficient ram.

mrweasel · 2024-07-18T11:05:45 1721300745

I've only tried it on my M1, running Llama-3 via Ollama. It works, but it's slow to the point where it's not really usable. Maybe there are smaller models you can run that will perform better.

gkfasdfasdf · 2024-07-18T13:00:41 1721307641

What size model did you try and how much memory does your M1 have? See my other comment, my experience has been that llama3 was very fast on an M1.

mrweasel · 2024-07-18T16:47:57 1721321277

I was just running: ollama run llama3, So that would be 8B parameters, on a 8GB M1 Air.

Maybe it's just my expectations, but it seems rather slow to process queries. Depending on the prompt somewhere between the 10 - 40 tokens per second, but that very much depends on the prompt.

My complaint is the time between the prompt is entered and output starts.

yieldcrv · 2024-07-18T10:44:10 1721299450

If apple offers these in a data center that’s open access it’s game over for NVDA

Keyframe · 2024-07-18T10:59:49 1721300389

Based on which fantasy premise?

yieldcrv · 2024-07-18T11:06:20 1721300780

The premise where these are readily available for mass purchase, have a hardware and software stack that already works reliably, and have a lower energy footprint than other offerings

and somewhat competitive on cost, but that wont be the main selling point, just availability

coldtea · 2024-07-18T12:22:04 1721305324

For the purposes we're discussing, they're nowhere near competitive with Nvidia.

Keyframe · 2024-07-18T17:12:21 1721322741

lower energy but at least they're slow? You need to add up to the same performance level and then consider the cost and running cost. I bet it's not close on that.

Availability maybe, but as you've noted - zero availability for data center environments. Those volumes would also then fall onto TSMC/Samsung/Whatever where Nvidia is stuck as well.

yieldcrv · 2024-07-18T17:37:20 1721324240

I’m aware Apple is also beholden to TSMC’s capacity too

sadly

hu3 · 2024-07-18T11:57:46 1721303866

They aren't even on the same league, computing-wise.

loktarogar · 2024-07-18T10:41:09 1721299269

This was bouncing around the last few days, if you have a few devices as well as the M1 (though i'm not sure it able to work over the internet as opposed to a local network): https://github.com/exo-explore/exo

Otherwise set up Ollama's API

moffkalast · 2024-07-18T10:50:29 1721299829

You mean the llama.cpp server API right? Ollama keeps taking credit for things they put a thin wrapper around, and it's seriously annoying.

loktarogar · 2024-07-21T21:50:16 1721598616

I don't, in the same way I don't say that i'm taking an internal combustion engine bus to get somewhere - what powers the bus is not relevant to the solution

Shadowmist · 2024-07-18T12:52:22 1721307142

What’s the llama.cpp CLI equivalent of `ollama pull`?

moffkalast · 2024-07-18T15:19:14 1721315954

wget <huggingface link>?

neom · 2024-07-18T13:00:45 1721307645

I feel like Ollama just came out and now y'all are doing model based laptop resource sharing.

Should I take this as an indicator that embedded GenAI is moving quite quickly?

(Also just wanted to say I find this thread incredibly cool generally, some very interesting stuff going on!!! :D )

paxys · 2024-07-18T13:05:45 1721307945

The entire point of embedded models is that you can run them locally. If it'll anyways take an internet roundtrip then what's the point of connecting to your friend's laptop over a cloud GPU or a managed service like ChatGPT-4o?

crtasm · 2024-07-18T13:44:06 1721310246

Presumably a cloud GPU is not $0 ?

drivingmenuts · 2024-07-18T13:45:36 1721310336

It won't cost actual money?

thisconnect · 2024-07-18T16:04:01 1721318641

If you are in the same network try https://pinokio.computer

exe34 · 2024-07-18T10:53:40 1721300020

With "borrow" in scare quotes, do you intend for him to be aware of his generosity?

TeMPOraL · 2024-07-18T11:12:31 1721301151

Probably just because in the full phrasing, ``i would like to "borrow" his gpu``, omitting the scare quotes would paint the picture of their friend unplugging/unsoldering their GPU and lending it to the author.

sillysaurusx · 2024-07-18T11:15:53 1721301353

My friend Kevin was once building a graphics engine, and to test it on various GPUs he’d borrow them from Best Buy and return them within the 30 day window. Since there was a restocking fee, it seemed like a non-harmful and clever way to test a bunch of configurations.

exe34 · 2024-07-19T10:00:35 1721383235

so Kevin is why we can't have nice things....