3D Gaussian splatting (3DGS) has demonstrated significant potential in audio-driven talking head synthesis. However, despite notable advancements in speed and fidelity, current methods still face challenges such as inaccurate lip movements and facial artifacts. To address these issues, we propose LMTalker, a sparse landmark-guided 3DGS method, applying facial landmarks for the first time in 3DGS-based talking head synthesis. Our method explicitly leverages sparse facial landmarks to guide the deformation of dense Gaussians, effectively reduces inconsistencies between the input audio and facial dynamics, leading to improved lip movement accuracy and facial fidelity. Furthermore, we utilize facial landmarks in a hierarchical way to achieve region-specific generation. By integrating audio information, we enhance the clarity and reduce artifacts in the inner mouth region. Experimental results demonstrate that our method surpasses existing methods in terms of fidelity and lip movements accuracy, while maintaining high rendering speed.
Overview of our proposed framework. We begin with the canonical Gaussian $\mathcal{G}_c$ through a smooth-and-merge initialization. During training, we utilize landmarks as guidance to jointly learn the deformations of the head and inner mouth Gaussians through two deformation modules. The $\Delta\mu$ is calculated as a weighted sum of displacements from its nearest $K$ landmarks. The deformed Gaussians are rendered and stitched together to produce the final result $I'_{full}$. During inference, we adopt the pre-trained audio2motion model to generate the motion of landmarks that guide the Gaussians deformations.
Visualization comparison with representative methods, including Wav2Lip[1], AD-NeRF[6], GeneFace[11], GaussianTalker[21], TalkingGaussian[22], and GaussianTalker[23]. Keyframe syllables are annotated below the GT. Red boxes highlight poor image quality and incorrect lip movements in the generated images. The dashed box shows our comparison with GaussianTalker[23], which employs audio clips from the training set, unlike other methods.