Whisper speech-to-text for digitized historical audiovisual materials
There is significant activity in the cultural heritage community exploring the use of Whisper to improve accessibility of audiovisual materials. As a freely available model capable of automated speech recognition across multiple languages, it has the potential to enable institutions to create transcripts for large collections in a cost-effective manner. As a result of the massive media digitization project at Indiana University, we have a collection of over 350,000 digitized items that represent a range of physical formats, from wax cylinders to audiocassettes, motion picture film, VHS, and more. Owned by many units across the Indiana University system, the content varies in genre and includes field recordings, raw unedited TV footage and home recordings, in addition to lectures and events, broadcast TV and radio, and educational recordings. As we think about the accessibility of this collection, we are asking if Whisper can be leveraged to provide transcripts across this diversity.
This talk will share the outcomes of testing several Whisper models against 58 items from this collection, as well as initial work with Whisper for a specific collection. Finally, we’ll point out how to get involved and learn more about ongoing work with automated speech recognition in the community.