Tutorial on Image Stitching: From Theory to Practice
This guide explains the core concepts behind our image stitching implementation. We focus on the “why” and “how” of registering, warping, and blending images. Note: This tutorial primarily discusses the simple case of stitching two images, but the concepts extend to multiple images (see Appendix C).
1. Registering Images
To stitch two images together, we first need to “register” them—that is, figure out how they align.
- Feature Detection: We use algorithms like SIFT (Scale-Invariant Feature Transform) to find distinctive keypoints (like corners or high-contrast spots) in both images.
- Feature Matching: Each keypoint comes with a “descriptor” (a vector describing its local neighborhood). We match keypoints across images by finding descriptors with the smallest distance (e.g., using K-Nearest Neighbors and Lowe’s ratio test).
# Detect SIFT features
pts1, des1 = detector.detectAndCompute(image1, None)
pts2, des2 = detector.detectAndCompute(image2, None)
# Match using KNN
matches = bf_matcher.knnMatch(des2, des1, k=2)
# Apply Lowe's ratio test
good_matches = [m for m, n in matches if m.distance < n.distance * 0.7]
- Transformation Estimation: Given these matching point pairs, we estimate a transformation matrix that maps points from Image 2’s coordinate space to Image 1’s coordinate space.
2. Why Homography?
Generally, the transformation between two images of a 3D scene taken from different viewpoints is complex and depends heavily on the distance to the objects (depth) in the scene. This relationship is typically described using Epipolar Geometry (e.g., via the Fundamental Matrix).
However, there are two specific cases where the transformation between pixel coordinates in two images can be perfectly described by a much simpler matrix called a Homography:
- Planar Scene: The cameras are viewing a completely flat, 2D plane (e.g., taking a picture of a painting, a document, or a completely flat wall).
- Pure Camera Rotation: The cameras are located at the exact same point in 3D space, but are rotated and/or zoomed. There is no translation (movement) of the camera’s optical center.
In image stitching, we are in the case of pure camera rotation because usually, multiple images taken for a panorama can be approximated as rotating the camera around its optical center without translation (like a photographer standing still and turning, or using a tripod).
Because there is no translation, the 3D depth of the scene doesn’t matter, and the mapping between the two images is a pure homography. (See Appendix A for the mathematical proof).
Using RANSAC (Random Sample Consensus) along with our matched features, we can robustly estimate this homography matrix , which maps pixels from Image 2 to Image 1.
3. Warping the Images
Once we have our homography , we must “warp” Image 2 so it aligns with Image 1.
Redefining the Canvas Boundary
To ensure both images fit into a single canvas without cropping:
- We calculate the new coordinates of Image 2’s corners using .
- We find the global minimum and maximum coordinates (, etc.) across both the original Image 1 and the warped Image 2.
- If the minimum or is negative, it means the warped image extends to the top or left of Image 1. We must introduce a Translation Matrix (Offset) to shift everything into positive coordinates:
Applying the Warps
To warp Image 1 onto the new canvas, we just apply the translation:
To warp Image 2, we combine the homography and the translation:
# Combine homography with canvas offset
H_final = H_offset @ H_2to1
# Warp images onto the shared canvas
warped1 = cv2.warpPerspective(image1, H_offset, (canvas_w, canvas_h))
warped2 = cv2.warpPerspective(image2, H_final, (canvas_w, canvas_h))
(Under the hood, warpPerspective performs inverse warping: iterating over the new canvas coordinates, applying the inverse matrix to find the source coordinate, and using bilinear interpolation to sample the color).
4. Simple Blending: Defining the Mask
Once warped, we have two aligned images. To combine them, we need to decide which image to use for each pixel on the canvas. We do this using a Mask ().
What is a Mask?
A mask is a grayscale image of the same size as our canvas where:
- A value of 1 (White) means “Use Image 1”.
- A value of 0 (Black) means “Use Image 2”.
The final image is calculated as:
# Create a mask to define the seam
final_image = mask * warped1 + (1 - mask) * warped2
Defining the Seam
For a beginner, the simplest mask is a Binary Mask that splits the overlap right down the middle. If Image 1 is on the left and Image 2 is on the right, we find the horizontal center of the overlapping region and create a mask that is 1 to the left of that line and 0 to the right.
While simple, this often leaves a visible “seam” because the two images might have slightly different brightness or colors. To solve this, we can use more advanced techniques like Laplacian Pyramids (see Appendix B).
Appendix A: Proof that Pure Camera Rotation is a Homography
Let a 3D point be . A camera projects this 3D point onto a 2D pixel coordinate (in homogeneous coordinates) using the camera intrinsic matrix and its rotation and translation .
If the first camera is at the origin with no rotation, its projection equation is: which gives .
If the second camera shares the exact same center but is rotated by , its projection is:
Substituting from the first equation into the second:
Since homogeneous coordinates are scale-invariant, the scalar doesn’t change the 2D point. Therefore, the pixels are related by a linear transformation matrix:
This proves that for pure camera rotation, the mapping between the two images is purely a homography , entirely independent of the depth of the 3D point !
Appendix B: Advanced Blending (Laplacian Pyramids)
To eliminate visible seams, we use Multi-band Blending:
- Seam Finding (Distance Transform): To avoid artifacts from the sharp image boundaries, we compute a “distance transform” for both images. We place the blending seam exactly in the middle of the overlap—where both images have the most reliable data.
- Pyramid Decomposition: We break both images and the weight mask into Gaussian and Laplacian Pyramids.
- Multi-scale Blending: We blend the Laplacian levels scale-by-scale. This allows us to blend low-frequency color changes over a wide area while keeping high-frequency details sharp and localized.
- Reconstruction: Collapsing the blended levels creates a seamless, professional result.
# Build pyramids
lp1 = build_laplacian_pyramid(image1)
lp2 = build_laplacian_pyramid(image2)
gm = build_gaussian_pyramid(weight_mask)
# Blend scale-by-scale
blended_lp = [m * l1 + (1 - m) * l2 for l1, l2, m in zip(lp1, lp2, gm)]
# Reconstruct
result = reconstruct_from_pyramid(blended_lp)
Visual Comparison
As seen below, Laplacian blending (right) effectively hides the exposure differences and seam artifacts that are visible in simple binary blending (left).

Appendix C: Multi-Image Stitching Strategies
When stitching more than two images, we have two primary implementation options in this repository:
1. Planar Multi-Image Stitching
This approach extends the 2-image homography logic. We select a central image as the “Anchor” reference and match all other images directly to it.
# Select anchor index
ref_index = len(images) // 2
# Match every image to the same anchor
for i in range(len(images)):
H = cv2.findHomography(pts_moving, pts_anchor, cv2.RANSAC)
Limitations:
- Range Limitation: This only works for short sequences (3-5 images). If the first and last images do not overlap with the central anchor, feature matching will fail.
- Geometric Distortion: For wide fields of view, the planar projection causes “infinite stretching” at the edges.
- Alignment Strategy: A more robust strategy for long sequences is “Chaining” (matching adjacent pairs , ), but this is prone to “drift” (accumulated errors).
2. Cylindrical Multi-Image Stitching
This is the ideal model for panoramas taken with a rotating camera. We first project each planar image onto a cylinder before aligning them.
The Math: We project planar coordinates onto a cylinder with radius :
Once all images are projected onto the cylinder, aligning them becomes a simple matter of finding the pure 2D translation between the overlapping regions:
# do SIFT detections and get matched points
# ...
# Project planar image to cylindrical surface
cylindrical_img = cv2.remap(planar_img, x_src, y_src, cv2.INTER_LINEAR)
# transform points coords to cylindrical space
# ...
# Align in cylindrical space (Pure 2D translation)
shift = np.median(pts_ref - pts_moving, axis=0)
H = np.array([[1, 0, shift[0]], [0, 1, shift[1]], [0, 0, 1]])
Benefits & Limitations:
- Benefit: Once on the cylinder, the relationship between images becomes a pure 2D translation , which is much simpler to estimate and prevents edge stretching.
- Limitation (Wavy Boundaries): The top and bottom edges of the panorama appear wavy because straight lines in the original images become curves on the cylinder. Standard software uses Auto-Cropping to fix this.
- Limitation (Focal Length): This method requires an accurate estimate of the camera’s focal length () to project the curves correctly.