Welcome! In this project, one of the most important tools is MediaPipe—Google's powerful framework for real-time hand, face, and pose tracking..
Recognizing hand signs isn't easy. We need to know where the hand is and what shape it makes. Raw pixels don't help much—but MediaPipe does. It detects 21 keypoints per hand, along with pose and face landmarks, giving us accurate (x, y, z) coordinates in real time.
Already in your requirements.txt, but if you need:
pip install mediapipe==0.10.9(Note: The version 0.10.9 is specified in the project's requirements.txt. It's generally good practice to install the specific version used in the project to avoid compatibility issues.)
Once installed, we can import the library in our Python code. In this project, we also use OpenCV (cv2) to handle the camera feed.
import cv2
import mediapipe as mpWe use the Holistic model, which detects mediapipe.solutions.holisticwhich is a more comprehensive model that includes hands, face, and pose detection.
- Hands (left + right)
- Face mesh
- Body pose
Let's see how we initialize the MediaPipe Holistic model used in the project code:
mp_holistic = mp.solutions.holistic # Import the holistic model
mp_drawing = mp.solutions.drawing_utils # Import drawing utilities
# Initialize the Holistic model for use
with mp_holistic.Holistic(min_detection_confidence=0.5, min_tracking_confidence=0.5) as holistic:
# The model is now ready to process images
# ... rest of your code to process frames ...
passHere:
mp.solutions.holistic-> access to the Holistic model.mp.solutions.drawing_utils-> helpful functions to draw the detected landmarks on our images, which is great for visualization.- We create an instance of the
Holisticmodel. Themin_detection_confidenceandmin_tracking_confidenceparameters tell the model how sure it needs to be to consider a detection valid (values range from 0.0 to 1.0).
Once the model is initialized, we can feed it an image (like a frame from our camera feed) using its process() method. The provided code wraps this in a helper function mediapipe_detection:
def mediapipe_detection(image, model):
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) # Convert color for MediaPipe
image.flags.writeable = False # Make image not writeable to improve performance
results = model.process(image) # Make prediction
image.flags.writeable = True # Make image writeable again
image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR) # Convert color back for OpenCV display
return image, resultsThis function takes an image (a NumPy array representing the frame) and the MediaPipe model. It processes the image and returns the image (with write permissions restored and color converted back) and the results object, which contains all the detection information.
The results object returned by model.process(image) holds the key information.
For the Holistic model, it can contain detections for face, pose, left hand, and right hand. We are primarily interested in the hand landmarks for this project.
The hand landmark data is stored in results.left_hand_landmarks and results.right_hand_landmarks. If a hand is detected, these will contain a list of landmark objects. Each landmark object has x, y, and z coordinates, representing the position of that point in 3D space relative to the detected hand.
A simple way to see what was detected is to draw these landmarks back onto the original image using the mp_drawing utilities.
def draw_landmarks(image, results):
# Draw left hand landmarks
mp_drawing.draw_landmarks(
image, results.left_hand_landmarks, mp_hands.HAND_CONNECTIONS) # Use mp_hands connections for clarity
# Draw right hand landmarks
mp_drawing.draw_landmarks(
image, results.right_hand_landmarks, mp_hands.HAND_CONNECTIONS)
# Note: The full project code also draws face and pose landmarks!
# We only show hands here for simplicity in this chapter.
# For full drawing code, see the original notebook and draw_styled_landmarks.(Note: While the project uses mp_holistic.POSE_CONNECTIONS and mp_face_mesh.FACEMESH_TESSELATION for other parts, using mp_hands.HAND_CONNECTIONS directly here visually clarifies which connections are specific to the hands).
When you run code like this with a camera feed showing your hands, you'll see dots (the landmarks) and lines (the connections between them) drawn on your hands in the video window.
Here's a simple diagram showing the core process with MediaPipe:
sequenceDiagram
participant Camera;
participant Image Frame;
participant MediaPipe Holistic Model;
participant Detection Results;
Camera->>Image Frame: Capture frame
Image Frame->>MediaPipe Holistic Model: Process image
MediaPipe Holistic Model->>MediaPipe Holistic Model: Detect hands and landmarks
MediaPipe Holistic Model->>Detection Results: Return results object
Detection Results->>Image Frame: (Contains x,y,z coordinates)
Image Frame->>Image Frame: (Optional) Draw landmarks
Image Frame->>You: Display image
Note over Detection Results: Results contain lists of landmarks <br/> for detected body parts (hands, face, pose).
Note over Detection Results: Each landmark has x, y, z coordinates <br/> and sometimes visibility data.
MediaPipe does all the heavy lifting of analyzing the image to find these specific points in real-time. It handles different lighting conditions, hand orientations, and backgrounds remarkably well.
From each frame:
- results.left_hand_landmarks
- results.right_hand_landmarks
- results.pose_landmarks
- results.face_landmarks
def extract_keypoints(results):
# Get pose landmarks or zeros if not detected
pose = np.array([[res.x, res.y, res.z, res.visibility] for res in results.pose_landmarks.landmark]).flatten() if results.pose_landmarks else np.zeros(33*4)
# Get face landmarks or zeros if not detected
face = np.array([[res.x, res.y, res.z] for res in results.face_landmarks.landmark]).flatten() if results.face_landmarks else np.zeros(468*3)
# Get left hand landmarks or zeros if not detected
lh = np.array([[res.x, res.y, res.z] for res in results.left_hand_landmarks.landmark]).flatten() if results.left_hand_landmarks else np.zeros(21*3)
# Get right hand landmarks or zeros if not detected
rh = np.array([[res.x, res.y, res.z] for res in results.right_hand_landmarks.landmark]).flatten() if results.right_hand_landmarks else np.zeros(21*3)
# Combine all landmarks into a single NumPy array
return np.concatenate([pose, face, lh, rh])The total number of keypoints collected per frame is:
- Pose: 33 points * 4 values (x, y, z, visibility) = 132 values
- Face: 468 points * 3 values (x, y, z) = 1404 values
- Hands (Left and Right): 2 * (21 points * 3 values (x, y, z)) = 2 * 63 = 126 values
Total values per frame: 132 + 1404 + 126 = 1662 values.
This single NumPy array of 1662 numbers is the core data output from MediaPipe for each frame that we will use in the next steps of our project.
In this chapter, we learned about the MediaPipe library and how it acts as our "eyes" for hand sign detection. We saw that it can detect and track key points (landmarks) on hands (and other body parts) in real-time from video or images. The primary output we get from MediaPipe is the spatial coordinates (x, y, z) of these landmarks, providing a simplified numerical representation of the hand pose in each frame.
This raw data is crucial, but a single frame's pose isn't enough to recognize a dynamic hand sign, which is often a sequence of movements.
In the next chapter, we will learn how we handle these sequences of keypoints over time to capture the full motion of a hand sign.
Ready to see how we turn individual frame data into sequences? Let's go!