On Tuesday, Microsoft Research Asia unveiled VASA-1, an AI model that can create a synchronized animated video of a person talking or singing from a single photo and an existing audio track. In the future, it could power virtual avatars that render locally and don’t require video feeds—or allow anyone with similar tools to take a photo of a person found online and make them appear to say whatever they want.

  • simple@lemm.ee
    link
    fedilink
    English
    arrow-up
    11
    ·
    7 months ago

    That lip sync is scary good. It’s still a little off, the teeth are weirdly stretchy, but nobody would notice it’s a deepfake on first glance.

    Seems very similar to Nvidia’s idea of only having a moving photo for video calls to reduce bandwidth needed. Very nice.

    • Aatube@kbin.melroy.org
      link
      fedilink
      arrow-up
      4
      ·
      edit-2
      7 months ago

      We’d need better optimization and more powerful processing on ye average laputopu for that to happen.