Track entities across videos

Follow a single subject - a person, vehicle, or object - across multiple videos. Jockey correlates visual identity across your collection and reconstructs a chronological timeline of appearances.

Cross-video entity tracking is a core capability in private beta. Behavior and accuracy may evolve as the system improves.

What you’ll build

A timeline of a subject’s appearances across multiple videos, with timestamps, locations, and context for each sighting.

Prerequisites

Complete the Quickstart to create a knowledge store with at least one item in ready status.
Read Create a response to understand the request and response format.

When to use this

Tracking a person across multiple camera angles or footage sources
Building a chronological timeline of a subject’s appearances
Identifying every moment where a specific entity appears in a collection

How it works

Describe the subject you want to track in plain language. Jockey correlates visual identity across all videos in your knowledge store and returns a timeline of appearances with timestamps and context. All tracking goes through the same POST /responses endpoint - no special configuration is needed.

Be specific about the subject. Physical descriptions, clothing, and distinguishing features improve accuracy. “The person in the blue hoodie” works better than “the suspect.”

Track with structured output

Use a schema to get machine-readable tracking results. The schema captures each appearance with a timestamp, video reference, location, and description of what the subject is doing.

1 import requests
2 import json
3 
4 API_KEY = "YOUR_API_KEY"
5 BASE_URL = "https://api.twelvelabs.io/v1.3"
6 HEADERS = {"x-api-key": API_KEY, "Content-Type": "application/json"}
7 STORE_ID = "your_knowledge_store_id"
8 
9 tracking_schema = {
10     "type": "object",
11     "properties": {
12         "subject": {"type": "string"},
13         "timeline": {
14             "type": "array",
15             "items": {
16                 "type": "object",
17                 "properties": {
18                     "timestamp": {"type": "string"},
19                     "video_reference": {"type": "string"},
20                     "location": {"type": "string"},
21                     "action": {"type": "string"}
22                 }
23             }
24         },
25         "summary": {"type": "string"}
26     }
27 }
28 
29 response = requests.post(
30     f"{BASE_URL}/responses",
31     headers=HEADERS,
32     json={
33         "model": "jockey1.0",
34         "instructions": "You are a security analyst. Prioritize temporal accuracy. Flag any low-confidence identifications.",
35         "input": [
36             {"type": "message", "role": "user", "content": "Track the person in the blue hoodie across all cameras. Give me a chronological timeline."}
37         ],
38         "knowledge_store_id": STORE_ID,
39         "text": {"format": {"type": "json_schema", "name": "entity_tracking", "schema": tracking_schema}}
40     }
41 )
42 
43 result = response.json()
44 session_id = result["session_id"]
45 
46 for output in result["output"]:
47     if output["type"] == "message":
48         for content in output["content"]:
49             data = json.loads(content["text"])
50             print(f"Subject: {data['subject']}")
51             print(f"Summary: {data['summary']}\n")
52             for entry in data["timeline"]:
53                 print(f"  [{entry['timestamp']}] {entry['video_reference']}")
54                 print(f"    Location: {entry['location']}")
55                 print(f"    Action: {entry['action']}")

Refine with follow-up turns

Use the session_id from the first response to drill into specific appearances without starting over.

1 response = requests.post(
2     f"{BASE_URL}/responses",
3     headers=HEADERS,
4     json={
5         "model": "jockey1.0",
6         "session_id": session_id,
7         "input": [
8             {"type": "message", "role": "user", "content": "Tell me more about what happened at the third timestamp. What was happening around the subject?"}
9         ],
10         "knowledge_store_id": STORE_ID
11     }
12 )

Limitations

Visual-only identification. Voice-based matching is not supported in this phase.
Single entity per request. Tracking multiple subjects requires separate conversations.
Accuracy depends on video quality. Camera angles, lighting, and how distinctive the subject is all affect results.

Variations

Different domains: Swap instructions to “video editor” or “documentary researcher” for different emphasis
Highlight reel: “Create a highlight reel of this subject’s best moments”
Multi-episode: Track a recurring character across an episode series