Building search for a multi-language video platform is tricky, especially when your content spans Japanese, Korean, Chinese, Vietnamese, and Thai. Here's how I built it for TopVideoHub using SQLite FTS5.
Why SQLite FTS5?
- Zero infrastructure — No Elasticsearch cluster to manage
- Embedded — Part of SQLite, no additional dependencies
- Fast — Sub-millisecond queries on tens of thousands of documents
- Good enough — For a content platform, FTS5 relevance ranking is sufficient
Creating the FTS Table
-- Create FTS5 virtual table for video search
CREATE VIRTUAL TABLE IF NOT EXISTS video_search USING fts5(
title,
channel_title,
description,
content='videos',
content_rowid='id',
tokenize='unicode61 remove_diacritics 2'
);
-- Triggers to keep FTS in sync with main table
CREATE TRIGGER video_search_insert AFTER INSERT ON videos BEGIN
INSERT INTO video_search(rowid, title, channel_title, description)
VALUES (new.id, new.title, new.channel_title, new.description);
END;
CREATE TRIGGER video_search_delete AFTER DELETE ON videos BEGIN
INSERT INTO video_search(video_search, rowid, title, channel_title, description)
VALUES ('delete', old.id, old.title, old.channel_title, old.description);
END;
CREATE TRIGGER video_search_update AFTER UPDATE ON videos BEGIN
INSERT INTO video_search(video_search, rowid, title, channel_title, description)
VALUES ('delete', old.id, old.title, old.channel_title, old.description);
INSERT INTO video_search(rowid, title, channel_title, description)
VALUES (new.id, new.title, new.channel_title, new.description);
END;
The CJK Challenge
CJK (Chinese, Japanese, Korean) languages don't use spaces between words. In English, "hello world" is clearly two tokens. In Japanese, "人気動画" (popular video) has no word boundaries.
FTS5's unicode61 tokenizer handles this by treating each CJK character as an individual token. This means:
- Searching for "人気" matches any text containing both "人" and "気" adjacent
- It's not perfect linguistic tokenization, but it works well for search
Search Implementation
class VideoSearch {
public function __construct(
private readonly \PDO $db
) {}
public function search(string $query, int $limit = 20, int $offset = 0): array {
$query = $this->sanitizeQuery($query);
if (empty($query)) {
return [];
}
// Use BM25 ranking for relevance scoring
// Weight: title (10), channel (5), description (1)
$sql = "
SELECT v.*,
bm25(video_search, 10.0, 5.0, 1.0) AS rank
FROM video_search
JOIN videos v ON v.id = video_search.rowid
WHERE video_search MATCH ?
ORDER BY rank
LIMIT ? OFFSET ?
";
$stmt = $this->db->prepare($sql);
$stmt->execute([$query, $limit, $offset]);
return $stmt->fetchAll(\PDO::FETCH_ASSOC);
}
private function sanitizeQuery(string $query): string {
// Remove FTS5 special characters to prevent injection
$query = trim($query);
$query = preg_replace('/["*(){}\[\]^~\\\\]/', ' ', $query);
$query = preg_replace('/\s+/', ' ', $query);
if (empty($query)) {
return '';
}
// For CJK: use each character as a token
// For Latin: use each word as a prefix match
$tokens = [];
$words = explode(' ', $query);
foreach ($words as $word) {
if (mb_strlen($word) > 0) {
if ($this->isCJK($word)) {
// CJK: exact character matching
$tokens[] = '"' . $word . '"';
} else {
// Latin: prefix matching
$tokens[] = $word . '*';
}
}
}
return implode(' ', $tokens);
}
private function isCJK(string $text): bool {
// Check if text contains CJK characters
return (bool) preg_match('/[\x{4E00}-\x{9FFF}\x{3040}-\x{309F}\x{30A0}-\x{30FF}\x{AC00}-\x{D7AF}]/u', $text);
}
}
Search with Region Filtering
public function searchInRegion(
string $query,
string $region,
int $limit = 20
): array {
$query = $this->sanitizeQuery($query);
if (empty($query)) return [];
$sql = "
SELECT v.*, bm25(video_search, 10.0, 5.0, 1.0) AS rank
FROM video_search
JOIN videos v ON v.id = video_search.rowid
JOIN video_regions vr ON vr.video_id = v.id
WHERE video_search MATCH ?
AND vr.region = ?
ORDER BY rank
LIMIT ?
";
$stmt = $this->db->prepare($sql);
$stmt->execute([$query, $region, $limit]);
return $stmt->fetchAll(\PDO::FETCH_ASSOC);
}
Caching Search Results
Search results are cached for 6 hours since trending data doesn't change rapidly:
public function cachedSearch(string $query, int $limit = 20): array {
$cacheKey = 'search:' . md5($query . ':' . $limit);
$cached = $this->cache->get($cacheKey);
if ($cached !== null) {
return $cached;
}
$results = $this->search($query, $limit);
$this->cache->set($cacheKey, $results, ttl: 21600);
return $results;
}
Testing CJK Search
Here's how I verify CJK search works correctly:
// Japanese: search for popular music
$results = $search->search('人気音楽');
assert(count($results) > 0);
// Korean: search for K-pop
$results = $search->search('케이팝');
assert(count($results) > 0);
// Mixed: English and Japanese
$results = $search->search('YOASOBI 夜に駆ける');
assert(count($results) > 0);
// Vietnamese with diacritics
$results = $search->search('nhạc việt');
assert(count($results) > 0);
Performance
On TopVideoHub with ~50,000 videos across 9 regions:
- Search query execution: 2-5ms
- With caching: 0.1ms (cache hit)
- Memory usage: negligible (SQLite manages its own memory)
SQLite FTS5 is not Elasticsearch. It doesn't do fuzzy matching, synonyms, or sophisticated NLP. But for a video platform where users search by title and channel name, it's more than sufficient — and infinitely simpler to operate.
Top comments (0)