Panoramica Tecnica v1.0Technical Overview v1.0

Sign Language
Speaker

Un'applicazione web che osserva i segni della Lingua dei Segni Italiana (LIS), li riconosce come parole, compone una frase italiana grammaticalmente corretta e la pronuncia ad alta voce.

A web application that observes Italian Sign Language (LIS) gestures, recognizes them as words, composes a grammatically correct Italian sentence and speaks it out loud.

FastAPIMediaPipeDTWOpenAIElevenLabsWebSocket

La Pipeline in SintesiThe Pipeline at a Glance

Fase 1Phase 1

MediaPipe

LandmarkLandmarks

→

Fase 2Phase 2

Feature 134-D

VettoreVector

→

Fase 3Phase 3

DTW

RiconoscimentoRecognition

→

Fase 4Phase 4

OpenAI LLM

FraseSentence

→

Fase 5Phase 5

ElevenLabs

TTS

→

Fase 6Phase 6

Browser

Real-time

Fase 1Phase 1

Estrazione dei Landmark con MediaPipe Holistic

Landmark Extraction with MediaPipe Holistic

Ogni fotogramma in ingresso viene passato attraverso MediaPipe Holistic, che esegue contemporaneamente tre sotto-modelli:

Each incoming frame is passed through MediaPipe Holistic, which runs three sub-models simultaneously:

ModelliModels

BlazePose + Mani + Face MeshBlazePose + Hands + Face Mesh

▼

BlazePose - 33 landmark del corpo (spalle, gomiti, polsi, fianchi)

Modello della mano - 21 landmark per mano (sinistra e destra indipendenti)

Face mesh - 468 landmark del viso (solo overlay, non riconoscimento)

BlazePose - 33 body landmarks (shoulders, elbows, wrists, hips)

Hand model - 21 landmarks per hand (left and right tracked independently)

Face mesh - 468 face landmarks (overlay only, not used for recognition)

Coordinate nello spazio immagine normalizzato 0..1. Frame senza mano riempiti con NaN.

Coordinates in normalized 0..1 image space. Frames missing a hand are filled with NaN.

Config

Configurazione di EsecuzioneRuntime Configuration

▼

ParametroParameter	ValoreValue
model_complexity	1
smooth_landmarks	True
min_detection_confidence	0.4
min_tracking_confidence	0.4

Fase 2Phase 2

Vettore di Feature per Fotogramma

Per-Frame Feature Vector

I landmark grezzi vengono convertiti in un vettore a 134 dimensioni, invariante rispetto a traslazione e scala.

Raw landmarks are converted into a 134-dimensional feature vector, invariant to translation and scale.

134-D

Composizione del VettoreVector Composition

▼

2 flag di visibilita - mano sinistra/destra visibile

Posizioni polsi relative - rispetto alla mezzeria spalle/fianchi, scalate

21 landmark per mano - relativi al polso, codifica forma indipendente dalla posizione

2 visibility flags - left/right hand visible

Relative wrist positions - anchored to shoulder/hip midpoint, shoulder-width scaled

21 landmarks per hand - wrist-relative, encoding hand shape independently of position

Output: Matrice (T, 134) dove T = numero fotogrammi del segno.

Output: Matrix (T, 134) where T = number of frames in the sign.

Fase 3Phase 3

Riconoscimento dei segni(Dynamic Time Warping)

Gesture Recognition (Dynamic Time Warping)

Ogni segno conosciuto ha un template di riferimento. Il sistema confronta la sequenza in ingresso con ogni template tramite DTW.

Each known sign has a stored reference template. The system compares the input sequence against every template using DTW.

DTW

Come Funziona il MatchingHow Matching Works

▼

Il DTW trova l'allineamento ottimale tra due serie temporali a velocita diverse, accumulando distanze euclidee frame per frame lungo il miglior percorso monotono. Risultato normalizzato per lunghezza del percorso.

DTW finds the optimal alignment between two time series at different speeds, accumulating frame-by-frame Euclidean distances along the best monotonic path. Result normalized by path length.

Criterio di match:Match criterion: distance < 0.35 → match affidabilereliable match

Segm.Segm.

Segmentazione del Flusso ContinuoContinuous Stream Segmentation

▼

I confini dei segni vengono rilevati tracciando un segnale di energia di movimento: i picchi separati da valli individuano i segmenti candidati. Ogni segmento viene confrontato tramite DTW e la parola migliore aggiunta alla coda.

Sign boundaries are detected via a motion energy signal: peaks separated by valleys identify candidate segments. Each segment is DTW-matched and the best word is added to the queue.

Fase 4Phase 4

Composizione della Frase con un LLM

Sentence Composition with an LLM

I segni riconosciuti si accumulano come lista ordinata. Alla riproduzione, la lista viene inviata a OpenAI gpt-4o-mini con prompt vincolato.

Recognized signs accumulate as an ordered list. On playback, the list is sent to OpenAI gpt-4o-mini with a constrained prompt.

Esempio - Input dalla codaExample - Queue input

[ Noi, Tecnologia, Vita, Migliorare, Guerra, Serve, Non ]

RegoleRules

Vincoli del System PromptSystem Prompt Constraints

▼

Usare ogni parola dell'input - nessuna puo essere omessa

Preservare l'ordine, eccezione LIS: negazione precede il sostantivo in italiano

Solo collante grammaticale - articoli, preposizioni, congiunzioni

Una sola frase - senza virgolette, senza commenti

Use every word from the input - none may be omitted

Preserve order, with LIS exception: negation must precede the noun in Italian

Only grammatical glue - articles, prepositions, conjunctions

One sentence only - no quotes, no commentary

Temperatura:Temperature: 0.2

Fase 5Phase 5

Sintesi Vocale con ElevenLabs

Speech Synthesis with ElevenLabs

La frase viene inviata all'API TTS di ElevenLabs con il modello eleven_multilingual_v2.

The sentence is sent to the ElevenLabs TTS API using the eleven_multilingual_v2 model.

Voice

Impostazioni VocaliVoice Settings

▼

ParametroParameter	ValoreValue
stability	0.5
similarity_boost	0.75

Restituisce audio MP3, codificato base64, inviato al browser sullo stesso WebSocket come JSON tipo speech.

Returns MP3 audio, base64-encoded, sent to the browser over the same WebSocket as a JSON speech message.

Fase 6Phase 6

Consegna in Tempo Reale al Browser

Real-Time Browser Delivery

Un endpoint WebSocket FastAPI trasmette due payload per ogni fotogramma:

A FastAPI WebSocket endpoint streams two payloads per processed frame:

Payload WebSocketWebSocket Payload

▼

JPEG binario - fotogramma corrente

JSON - coordinate landmark, gesto rilevato, coda parole, metadati sessione

Binary JPEG - current frame

JSON - landmark coordinates, detected gesture, word queue, session metadata

Il frontend disegna il video su un canvas, sovrappone i landmark e mostra le parole come chip. L'utente richiede la riproduzione e il backend esegue LLM + TTS.

The frontend draws video on a canvas, overlays landmarks and displays words as chips. The user triggers playback and the backend runs LLM + TTS.

RiferimentoReference

Stack TecnologicoTechnology Stack

LivelloLayer	TecnologiaTechnology
Backend	Python, FastAPI, WebSocket, OpenCV, NumPy
Computer Vision	MediaPipe Holistic (BlazePose + manihands + visoface)
RiconoscimentoRecognition	DTW custom su spazio 134-DCustom DTW on 134-D feature space
LinguaggioLanguage	OpenAI gpt-4o-mini
VoceVoice	ElevenLabs eleven_multilingual_v2 TTS
Frontend	Vanilla JS, HTML5 Canvas, WebSocket, HTML5 Audio

Sign LanguageSpeaker