Three.js From Zero · Article 14

Hand-Tracked Floating Windows — Vision OS in Your Browser

A wall of macOS-style app windows (Safari, Xcode, Music, TV, Pages, Keynote, Numbers, Photos) floating in 3D. Webcam-driven hand tracking via MediaPipe — hover to highlight, pinch+drag to move, two-hand pinch-spread to resize, air-tap to open/close. Plus toolbar toggles, expandable camera-with-skeleton mini-cam, and scroll-wheel fallback.

← threejs-from-zero Project P14 Viral Build
Project P14 · Three.js From Zero · Single-Page Build

A wall of macOS-style app windows — Safari, Xcode, Music, TV, Pages, Keynote, Numbers, Photos — floating in 3D. Your webcam sees your hands. Hover to highlight. Pinch + drag to move and reorder the stack. Pinch + spread with both hands to resize. Quick pinch on any window — air-tap — and it opens to focus mode; the others dim and push back. Air-tap empty space (or hit ESC) to close. Mouse + scroll-wheel work too. Zero installs. Zero servers. One page.

VISION OS · HAND TRACKED
Mode: mouse camera off
Hand:
Pinch:
Hover:
Open:
FPS:
👋 Hold your hand in front of the camera (palm toward the lens, ~30 cm away, well lit).
Then: pinch + hold to drag, quick pinch to open / close, both hands pinching to resize. ESC closes.

↑ This is the entire project running on this page. No build step. View source to see all 900 lines.

STEP 01The shape of the thing

We're shipping two layers that have to dance together. Each one is simple on its own. It's the interface between them that earns the wow.

  1. A 3D scene with eight Mac-style productivity windows in a soft 4×2 grid. Each is a plane carrying a CanvasTexture with animated content — Safari (Apple Newsroom), Xcode (SwiftUI typing itself line by line), Music (Apple Music with a live waveform), TV (Foundation banner + Trending Shows grid), Pages (a document typing itself), Keynote (slide thumbnails + a pulsing live slide), Numbers (Q4 forecast spreadsheet with a growth bar chart), and Photos (12-tile library).
  2. A hand tracker driven by MediaPipe's Hand Landmarker (a ~10 MB ML model that loads from a CDN and runs entirely on-device via WebGL). It hands us 21 landmark points per frame at 30+ fps.

The bridge between them is a single THREE.Vector3 — the hand cursor. Its position comes from landmark 8 (index fingertip). Its pinch state comes from the distance between landmarks 4 (thumb tip) and 8. Smooth it with a One-Euro filter, derive a depth from hand size, drive everything off it. The rest is polish.

The 21-landmark model

IndexLandmarkWhy it matters here
0WristStable anchor for palm orientation if you need it
4Thumb tipHalf of the pinch pair
5Index MCP (knuckle)Hand size proxy → depth estimate
8Index fingertipThe cursor. Other half of pinch.
12, 16, 20Middle / Ring / Pinky tipsFuture gestures (peace sign, fist, etc.)

STEP 02Scene, camera, renderer, lights

Nothing exotic. A perspective camera at (0, 0, 3.5). A canvas inside the demo container. antialias: true. SRGBColorSpace. The background is a soft purple gradient — easy with scene.background set to null and a CSS gradient on the wrapper, which is what we do on this page.

import * as THREE from 'three';

const scene = new THREE.Scene();
const camera = new THREE.PerspectiveCamera(55, 16/10, 0.1, 50);
camera.position.set(0, 0, 3.5);

const renderer = new THREE.WebGLRenderer({
  canvas: document.getElementById('scene'),
  antialias: true,
  alpha: true,
});
renderer.setPixelRatio(Math.min(devicePixelRatio, 2));
renderer.outputColorSpace = THREE.SRGBColorSpace;

// Soft ambient + a single key light — windows mostly carry their own colour
// via the CanvasTexture, so lighting is just for the subtle highlights.
scene.add(new THREE.AmbientLight(0xffffff, 0.6));
const key = new THREE.DirectionalLight(0xc084fc, 0.5);
key.position.set(2, 3, 5);
scene.add(key);

We pick FOV 55° deliberately — wider than cinema, narrower than a fisheye. Fits the 4×2 wall at z = 3.5 with breathing room.

STEP 03The mock window — a plane with a painted canvas

A window is the simplest possible thing: a PlaneGeometry wearing a MeshBasicMaterial whose map is a CanvasTexture. The texture is a 2D canvas we draw into per frame for animated content (clock, code, waveform), once for static content (photos, weather).

function createWindow({ width, height, draw, animated }) {
  const canvas = document.createElement('canvas');
  canvas.width = 1024;
  canvas.height = Math.round(1024 * height / width);

  const texture = new THREE.CanvasTexture(canvas);
  texture.colorSpace = THREE.SRGBColorSpace;
  texture.anisotropy = 4;

  const mat = new THREE.MeshBasicMaterial({
    map: texture,
    transparent: true,
  });

  const mesh = new THREE.Mesh(
    new THREE.PlaneGeometry(width, height),
    mat,
  );
  mesh.userData = { canvas, ctx: canvas.getContext('2d'), texture, draw, animated };
  return mesh;
}

The trick: every window has its own draw(ctx, w, h, time, state) function. Inside that function we paint a rounded background, a title bar, traffic-light buttons, and the window's unique content. Look at the source on this page — drawBrowser, drawCode, drawMusic, drawTerminal, drawWeather, drawPhotos, drawClock, drawMail — each is fifty to a hundred lines of native 2D canvas. No frameworks, no DOM.

Why CanvasTexture and not iframes? An iframe is a separate document that you can not render into a WebGL texture without painful security plumbing. A 2D canvas is just pixels we own, painted however we like, and uploaded to the GPU once per frame. For mock content, it wins by a mile.

Drawing a rounded glass window background

function roundRect(ctx, x, y, w, h, r) {
  ctx.beginPath();
  ctx.moveTo(x + r, y);
  ctx.arcTo(x + w, y, x + w, y + h, r);
  ctx.arcTo(x + w, y + h, x, y + h, r);
  ctx.arcTo(x, y + h, x, y, r);
  ctx.arcTo(x, y, x + w, y, r);
  ctx.closePath();
}

function drawFrame(ctx, w, h, accent) {
  ctx.clearRect(0, 0, w, h);
  // Glassy body
  const grad = ctx.createLinearGradient(0, 0, 0, h);
  grad.addColorStop(0, 'rgba(28, 24, 44, 0.92)');
  grad.addColorStop(1, 'rgba(15, 12, 26, 0.92)');
  ctx.fillStyle = grad;
  roundRect(ctx, 0, 0, w, h, 36);
  ctx.fill();

  // Top hairline highlight = "glass edge"
  ctx.strokeStyle = 'rgba(255, 255, 255, 0.12)';
  ctx.lineWidth = 2;
  ctx.stroke();

  // Title bar
  ctx.fillStyle = 'rgba(255, 255, 255, 0.04)';
  roundRect(ctx, 0, 0, w, 70, 36);
  ctx.fill();

  // Three traffic lights
  ['#ff5f57', '#febc2e', '#28c840'].forEach((c, i) => {
    ctx.fillStyle = c;
    ctx.beginPath();
    ctx.arc(36 + i * 28, 36, 10, 0, Math.PI * 2);
    ctx.fill();
  });

  // Accent pill (top right) — gives the window its identity colour
  ctx.fillStyle = accent;
  ctx.fillRect(w - 96, 28, 52, 4);
}

Every drawX(ctx,...) calls drawFrame first, then paints its unique content into the body region below the title bar.

STEP 04Loading MediaPipe — the whole ML model in one CDN line

MediaPipe Tasks Vision ships a HandLandmarker that runs entirely in your browser. No server. No upload. The model is ~10 MB and gets cached by the browser after the first load.

// Click handler on the Start Camera button:
async function startCamera() {
  // Dynamic import — MediaPipe is heavy, don't pay for it until the user opts in.
  const mp = await import(
    'https://cdn.jsdelivr.net/npm/@mediapipe/tasks-vision@0.10.14/vision_bundle.mjs'
  );
  const vision = await mp.FilesetResolver.forVisionTasks(
    'https://cdn.jsdelivr.net/npm/@mediapipe/tasks-vision@0.10.14/wasm'
  );

  handLandmarker = await mp.HandLandmarker.createFromOptions(vision, {
    baseOptions: {
      modelAssetPath:
        'https://storage.googleapis.com/mediapipe-models/hand_landmarker/hand_landmarker/float16/1/hand_landmarker.task',
      delegate: 'GPU',   // 2–3× faster than CPU on most hardware
    },
    runningMode: 'VIDEO',
    numHands: 1,
    minHandDetectionConfidence: 0.5,
    minHandPresenceConfidence: 0.5,
    minTrackingConfidence: 0.5,
  });

  // Now grab the webcam.
  const stream = await navigator.mediaDevices.getUserMedia({
    video: { width: 640, height: 480, facingMode: 'user' },
    audio: false,
  });
  videoEl.srcObject = stream;
  await videoEl.play();

  detectLoop();
}

Notice numHands: 1. We only need one. The model can do two but each one costs another ~5 ms per frame.

The detection loop itself is one line plus a guard:

function detectLoop() {
  if (videoEl.readyState >= 2) {
    const result = handLandmarker.detectForVideo(videoEl, performance.now());
    if (result.landmarks.length > 0) {
      onHand(result.landmarks[0]);    // 21 points
    } else {
      onNoHand();
    }
  }
  requestAnimationFrame(detectLoop);
}

STEP 05From landmark to 3D cursor

MediaPipe gives you each landmark as { x, y, z } in image-normalized coordinates. x goes right, y goes down, z is relative depth (smaller = closer to camera). Origin top-left, all in [0, 1].

Three.js wants world coordinates where y goes up and the visible area is bounded by the camera's frustum. Two conversions to remember:

  1. Mirror x. The webcam image isn't mirrored by default — but humans expect their right hand to appear on the right. So we flip: worldX = (1 - 2·lm.x) · halfFrustumWidth.
  2. Flip y. worldY = (1 - 2·lm.y) · halfFrustumHeight.

The frustum size at z = 0 for a perspective camera at distance d with vertical FOV fov:

function frustumAt(z) {
  const d = camera.position.z - z;
  const h = 2 * d * Math.tan((camera.fov * Math.PI / 180) / 2);
  const w = h * camera.aspect;
  return { w, h };
}

function handToWorld(lm8) {
  const { w, h } = frustumAt(0);
  return new THREE.Vector3(
    (1 - 2 * lm8.x) * w / 2,
    (1 - 2 * lm8.y) * h / 2,
    0,    // for now — we'll add depth in step 6
  );
}

That's the entire bridge. Hand goes in, world position comes out.

STEP 06One-Euro filter — the secret to "feels real"

Raw MediaPipe output jitters by 2–3 pixels per frame even when your hand is dead still. Drop it on a 3D cursor and the cursor looks drunk. Naive lerp (prev = prev + (raw - prev) * 0.3) fixes jitter but introduces lag — fast motion feels rubbery.

The One-Euro filter (Casiez et al., 2012) is the canonical fix. It uses a low-pass that adapts its cutoff based on velocity: still hand = heavy smoothing, fast hand = barely any. Forty lines, zero dependencies, transforms the entire feel of the demo.

class OneEuroFilter {
  constructor(minCutoff = 1.0, beta = 0.0, dCutoff = 1.0) {
    this.minCutoff = minCutoff;
    this.beta = beta;
    this.dCutoff = dCutoff;
    this.x = null;
    this.dx = 0;
    this.lastTime = null;
  }
  filter(value, time) {
    if (this.lastTime === null) {
      this.x = value; this.lastTime = time; return value;
    }
    const dt = Math.max((time - this.lastTime) / 1000, 1e-6);
    this.lastTime = time;
    const dxRaw = (value - this.x) / dt;
    const aD = this.alpha(dt, this.dCutoff);
    this.dx = aD * dxRaw + (1 - aD) * this.dx;
    const cutoff = this.minCutoff + this.beta * Math.abs(this.dx);
    const a = this.alpha(dt, cutoff);
    this.x = a * value + (1 - a) * this.x;
    return this.x;
  }
  alpha(dt, cutoff) {
    const r = 2 * Math.PI * cutoff * dt;
    return r / (r + 1);
  }
}

// One filter per axis
const fx = new OneEuroFilter(1.4, 0.015);
const fy = new OneEuroFilter(1.4, 0.015);
const fz = new OneEuroFilter(1.2, 0.010);

Tune minCutoff to control "still hand" smoothness (lower = smoother). Tune beta to control "fast hand" responsiveness (higher = snappier). Our defaults (1.4 / 0.015) are intentionally smooth — you can crank beta to 0.05 for snappier flicks.

STEP 07Pinch detection — the gesture that earns the demo

The whole interaction model rests on one number: the 3D distance between landmark 4 (thumb tip) and landmark 8 (index tip), normalized by hand size so it works regardless of how close your hand is to the camera.

function pinchAmount(landmarks) {
  const a = landmarks[4];   // thumb tip
  const b = landmarks[8];   // index tip
  const dx = a.x - b.x, dy = a.y - b.y, dz = a.z - b.z;
  const raw = Math.sqrt(dx*dx + dy*dy + dz*dz);

  // Normalize by hand size = distance from wrist (0) to index MCP (5)
  const wrist = landmarks[0], mcp = landmarks[5];
  const handSize = Math.hypot(wrist.x - mcp.x, wrist.y - mcp.y, wrist.z - mcp.z);

  return raw / handSize;   // 0 = fingers touching, ~1 = open hand
}

We don't want a binary on/off — we want hysteresis. A clean pinch fires when the ratio drops below 0.30. It releases when it climbs back above 0.45. The gap between thresholds is what stops jittery on/off flickering when your fingers are hovering near the threshold:

let pinching = false;
function updatePinch(amount) {
  if (!pinching && amount < 0.30) pinching = true;
  else if (pinching && amount > 0.45) pinching = false;
  return pinching;
}
The hysteresis trick is everywhere in UX. Mouse drag thresholds, double-click windows, gamepad trigger pull — anywhere a continuous input drives a discrete state, use two thresholds with a gap. Otherwise the boundary jitters and the interaction feels broken.

STEP 08Hover, grab, drag — the state machine

Three states. One window pointer.

StateHand 1Hand 2What happens
IDLEopenClosest window scales to 1.22×, soft halo. Others rest.
GRABclosed (just transitioned)Particle burst. Window locks to hand 1, render-order maxed. homePos.z bumps above all others (rerank).
DRAGclosed (held + moved)Window XY follows hand 1, smoothed.
SCALEclosedclosedWindow resizes to baseScale × (interHandDist / startDist), clamped [0.4, 3.0]. Position follows midpoint of both hands.
OPENquick pinch + release (no drift)Air-tap. Window centres + zooms to 2.4×, others dim to 30% and push back in z. ESC or a second air-tap closes.
RELEASEopenanyDrag commit — window stays at new XY/scale, with a small "release" burst.

Closest-window selection is a simple linear scan — eight windows, no spatial index needed.

function findHovered(cursor) {
  let best = null, bestDist = SELECT_RADIUS;
  for (const w of windows) {
    const d = Math.hypot(w.position.x - cursor.x, w.position.y - cursor.y);
    if (d < bestDist) { best = w; bestDist = d; }
  }
  return best;
}

The grab transition fires exactly once — on the frame pinch goes from open to closed. That's where you spawn the particle burst:

const wasPinching = pinching;
pinching = updatePinch(pinchAmount(lm));

if (!wasPinching && pinching && hovered) {
  grabbed = hovered;
  grabOffset.subVectors(grabbed.position, cursor);
  spawnParticleBurst(cursor, hovered.userData.accent);
} else if (wasPinching && !pinching) {
  grabbed = null;
}

if (grabbed) {
  // Drag — follow cursor, keep z forward
  grabbed.userData.targetPos.set(
    cursor.x + grabOffset.x,
    cursor.y + grabOffset.y,
    0.6,
  );
}

Notice targetPos — every window animates to a target each frame (position.lerp(targetPos, 0.18)). Direct assignment would snap and feel rigid. Lerping at 0.18 is the sweet spot: snappy enough to feel responsive, smooth enough to look natural.

Air-tap → open / close

The same pinch gesture has to mean two different things: a slow pinch (held + moved) is a drag, and a quick pinch (closed and reopened in under ~280 ms without moving) is an air-tap. The classifier is two numbers:

const AIR_TAP_MAX_MS    = 280;
const AIR_TAP_MAX_DRIFT = 0.10;   // world units

// On pinch-down:
pinchStartTime = performance.now();
pinchStartPos.copy(cursor);
pinchMaxDrift  = 0;

// Every frame while pinching:
if (pinching) {
  pinchMaxDrift = Math.max(pinchMaxDrift, cursor.distanceTo(pinchStartPos));
}

// On pinch-up:
const dur = performance.now() - pinchStartTime;
const wasAirTap = dur < AIR_TAP_MAX_MS && pinchMaxDrift < AIR_TAP_MAX_DRIFT;

If wasAirTap is true, we don't commit a drag — we toggle the window's openState. If the air-tap landed on empty space (no window hovered), we use it to close whatever's currently open instead. Same gesture, dual purpose, no button needed.

if (wasAirTap) {
  if (grabbed)            toggleOpen(grabbed);
  else if (openedWindow)  toggleOpen(openedWindow);  // air-tap empty = close
} else if (grabbed) {
  // regular release — commit the new position
}

The open pose

"Open" isn't fullscreen — it's a focus mode. The opened window animates to (0, 0, 0.6) at scale ~2.4×. Every other window dims to 30% opacity and pushes back in z. The whole scene gives the focused app the spotlight, Mission-Control style.

// Per window, per frame:
win.userData.openAmount += (win.userData.openTarget - win.userData.openAmount) * 0.12;
const openA = win.userData.openAmount;

// Blend home position with centred target as openAmount rises
win.userData.targetPos.set(
  homeX * (1 - openA) + 0   * openA,
  homeY * (1 - openA) + 0   * openA,
  hoverZ * (1 - openA) + 0.6 * openA - otherOpenAmount * 0.3,
);

// Scale and opacity:
const openBump  = 1 + openA * 1.4;          // up to 2.4×
const backScale = 1 - otherOpenAmount * 0.25;
win.scale.setScalar(userScale * selBump * openBump * backScale);
win.material.opacity = 1 - otherOpenAmount * 0.7;

The otherOpenAmount = "how open is any other window?". As one window opens, the others dim and shrink proportionally — a single number drives the whole background fade. Press ESC or air-tap anywhere to close.

Two-hand pinch-spread → scale

The same model is doing all the work. Flip numHands to 2, give each hand its own One-Euro filters + cursor visuals, and you have everything you need for a Vision-Pro-style resize gesture.

// In the MediaPipe options:
numHands: 2,

// Process both detected hands per frame:
if (result.landmarks.length > 0) onHand(result.landmarks[0]);
if (result.landmarks.length > 1) onHand2(result.landmarks[1]);

The scale state machine is tiny. Enter when hand 1 is already grabbing AND hand 2 starts pinching. Record the inter-hand distance + the window's current scale. While both hands stay pinched, update the scale by the distance ratio:

let scaling = false;
const scaleStart = { dist: 1, baseUserScale: 1 };

const canScale = grabbed && hand2Visible && pinching && pinching2;

if (canScale && !scaling) {
  scaling = true;
  scaleStart.dist = Math.max(cursor.distanceTo(cursor2), 0.05);
  scaleStart.baseUserScale = grabbed.userData.userScale || 1;
  spawnBurst(cursor2, grabbed.userData.type.accent, 24);
} else if (scaling && !canScale) {
  scaling = false;
}

if (scaling) {
  const dist  = Math.max(cursor.distanceTo(cursor2), 0.05);
  const ratio = dist / scaleStart.dist;
  const target = THREE.MathUtils.clamp(
    scaleStart.baseUserScale * ratio, 0.4, 3.0,
  );
  const cur = grabbed.userData.userScale ?? 1;
  grabbed.userData.userScale = cur + (target - cur) * 0.25;

  // Window centres between the two pinch points — Vision-Pro feel
  const mid = new THREE.Vector3().addVectors(cursor, cursor2).multiplyScalar(0.5);
  grabbed.userData.targetPos.x = mid.x;
  grabbed.userData.targetPos.y = mid.y;
}

The smoothing step (cur + (target - cur) * 0.25) is what stops the window from popping every frame. Without it, even a sub-millimetre jitter on either fingertip shows up as a visible scale-pulse. With it, slow movements glide and fast movements feel snappy. The clamp [0.4, 3.0] stops users from shrinking a window into oblivion or growing it past the viewport.

Mouse + touchpad: scroll wheel = scale

Not every viewer will use two hands. Same code path, different input:

demoEl.addEventListener('wheel', (e) => {
  const target = grabbed || hovered;
  if (!target) return;
  e.preventDefault();
  // exp(-deltaY * k) gives symmetric zoom-in / zoom-out
  const factor = Math.exp(-e.deltaY * 0.0015);
  const cur = target.userData.userScale ?? 1;
  target.userData.userScale = THREE.MathUtils.clamp(cur * factor, 0.4, 3.0);
}, { passive: false });
Why an exponential factor? Linear scaling (scale += deltaY * k) is asymmetric — scrolling up by N then down by N doesn't return you to the start, because the same delta represents a bigger ratio when the value is small. exp(-deltaY * k) makes "scroll up 100" and "scroll down 100" exactly cancel. Same trick map apps use for pinch-zoom.

STEP 09Visual wow — glow halos and particle bursts

The interaction logic is done. Everything from here on is what makes the demo screenshot-worthy.

Soft halo behind each window

A second plane, slightly larger than the window, behind it. It carries a radial-gradient canvas texture tinted to the window's accent colour. Material is AdditiveBlending, opacity tied to the window's "selection" value (0 → 1).

function makeHaloTexture() {
  const c = document.createElement('canvas');
  c.width = c.height = 256;
  const ctx = c.getContext('2d');
  const grad = ctx.createRadialGradient(128,128,0,128,128,128);
  grad.addColorStop(0, 'rgba(255,255,255,1)');
  grad.addColorStop(0.4, 'rgba(255,255,255,0.3)');
  grad.addColorStop(1, 'rgba(255,255,255,0)');
  ctx.fillStyle = grad;
  ctx.fillRect(0,0,256,256);
  return new THREE.CanvasTexture(c);
}

Reuse the same texture across all eight halos. Cheap. Looks expensive.

The hand cursor itself

A small sphere with an emissive material plus its own halo plus a trail of fading spheres behind it. The cursor's colour shifts with state — cyan when idle, white when hovering a window, hot orange when pinched. Each shift is a colour lerp toward the target — never a snap.

const idleColor  = new THREE.Color(0x22d3ee);
const hoverColor = new THREE.Color(0xffffff);
const grabColor  = new THREE.Color(0xff8a3d);

const target = pinching ? grabColor : (hovered ? hoverColor : idleColor);
cursor.material.color.lerp(target, 0.18);
cursorHalo.material.color.lerp(target, 0.18);

The particle burst

A single THREE.Points object. Pre-allocate the worst-case number of particles. Keep a pool. On grab, activate N of them at the cursor with random outward velocities. Each frame, advance position, fade size + opacity. When a particle's life hits zero, return it to the pool.

const MAX = 200;
const positions = new Float32Array(MAX * 3);
const velocities = new Float32Array(MAX * 3);
const lives = new Float32Array(MAX);
const sizes = new Float32Array(MAX);
const colors = new Float32Array(MAX * 3);

// On grab:
function spawnBurst(at, color) {
  for (let n = 0; n < 40; n++) {
    const i = findDead();
    if (i === -1) break;
    positions[i*3+0] = at.x;
    positions[i*3+1] = at.y;
    positions[i*3+2] = at.z;
    const theta = Math.random() * Math.PI * 2;
    const speed = 0.5 + Math.random() * 1.5;
    velocities[i*3+0] = Math.cos(theta) * speed;
    velocities[i*3+1] = Math.sin(theta) * speed;
    velocities[i*3+2] = (Math.random() - 0.5) * 0.6;
    lives[i] = 0.6 + Math.random() * 0.4;
    // ...colour into colors[i*3+...]
  }
}

The "wow" of the burst comes from three things: the count (40 is enough), the speed spread (some fast, some slow, all the difference), and the colour matching the window you grabbed.

STEP 10Mouse fallback & ship it

Not every viewer will enable their camera. The demo should still work. Map mouse position to the same cursor target. Left mouse button = pinch. Done. Same state machine, same effects.

demo.addEventListener('pointermove', (e) => {
  if (cameraActive) return;   // hand-tracker takes over once camera starts
  const r = demo.getBoundingClientRect();
  const nx = (e.clientX - r.left) / r.width;
  const ny = (e.clientY - r.top) / r.height;
  const { w, h } = frustumAt(0);
  rawCursor.set(
    (nx * 2 - 1) * w / 2,
    (1 - ny * 2) * h / 2,
    0,
  );
});
demo.addEventListener('pointerdown', () => { mousePinching = true; });
demo.addEventListener('pointerup',   () => { mousePinching = false; });

Mouse mode is on by default. Camera takes over when you click Start Camera.

The viral checklist

This demo is built to be shared. Here's what makes it land:

  • Screenshot in 5 seconds. The wall of glowing glass windows works as a still image. Open in a fresh tab → take a screenshot → it's already a tweet.
  • Reels-friendly motion. Pinch-and-drag is 2 seconds of footage with clear visual cause and effect. Loop it. Done.
  • The two-hand spread is the money shot. Pinch both hands on a window, pull them apart — the window grows between your hands. People stop scrolling.
  • Open the app you're selling. Air-tap on the window that matches your audience (Numbers for finance, Xcode for devs, Keynote for designers) — the focused mode reads as "your software, but spatial." Three seconds of footage = the pitch.
  • "Wait, that's just a webpage?" Always under-promise. Don't say "AI" — let people watch the hand work, then drop the link.
  • Customise the windows for your audience. Selling to designers? Make one window Figma. To devs? VS Code. To gamers? Steam. The "windows" are just drawX functions — twenty lines each.
  • Hashtags that pull traffic: #threejs #webgl #mediapipe #handtracking #visionpro #webcam #generativeart #creativecoding
  • Caption template:
    "Built a Vision-OS-style hand interface in a single HTML file. Webcam → MediaPipe → Three.js. No app, no install. Tutorial linked. 👋"

Performance notes & gotchas

  • HTTPS required for getUserMedia. Localhost is exempt. If your hosted demo doesn't get a camera prompt, that's the cause 9 times out of 10.
  • Inference cost. Hand Landmarker is ~8–15 ms/frame on integrated GPU, ~25–40 ms on CPU. Always use delegate: 'GPU'.
  • Detect every frame. Don't try to "save cycles" by running inference every other frame. The latency cost on a 60Hz display is more noticeable than the FPS gain.
  • Texture updates. Animated window content sets texture.needsUpdate = true every frame. Static windows draw once and never touch the texture again.
  • Mirror the video element with CSS, not WebGL. transform: scaleX(-1) on the <video> in the mini-camera. The landmarks themselves stay unmirrored — we mirror in the world-coords math (step 5).
  • Dispose on unmount. Single-page demos leak GPU memory if you navigate away without calling geometry.dispose() / material.dispose() / renderer.dispose(). We do it in the demo's cleanup listener.
  • Hand jitter when occluded. If your hand goes off-screen, the model still guesses for a few frames. Use the onNoHand branch to decay the cursor toward idle and release any grab.

Exercises

  1. Rotate with two hands. Scale is the magnitude change between the two pinch points. Rotation is the angle change. Capture atan2 of the vector between the two cursors at scale-start, and apply the delta as a Z rotation on the grabbed window. Now you have a full manipulate gesture (move + scale + rotate).
  2. Window stack. Sort windows by z when one's grabbed — fan them out into a Mac Mission Control–style layout.
  3. Real iframe content. Replace one CanvasTexture window with a CSS3DRenderer iframe so it shows an actual page. Bonus: pinch-and-pan that iframe.
  4. Voice commands. Wire SpeechRecognition alongside hand tracking. Say "expand" while pinching → window fills the screen.
  5. Persistence. Save window positions to localStorage so layouts survive a refresh.
  6. Air-tap. Add a quick-pinch gesture (open → closed → open within 250ms) that fires a click instead of a grab. Now you have buttons.

What's next

Project P14 is intentionally single-page. Plug the same hand-tracking core into: