Recommendations for audio and video clip length for using Amazon Polly text-to-speech voice narration

  • 1
  • Question
  • Updated 2 months ago
I'm working on some proof of concept videos using Camtasia 9 and Amazon Polly text-to-speech narration (mp3 files). Is anyone else working with text-to-speech narration? It's a lengthy manual process and I think I'm creating the audio/video clips too small.  Amazon Polly seems to work better using sentences that are more or less 10 words long (easier to manage SSML tags for pauses, pronunciation, etc.). I've been working on editing some Camtasia projects that were done previously, where the audio/video clips were very long (2-3 minutes).  I can manage the synchronization with the Amazon Polly files as long we their in English. But had to had off the projects when we're using Amazon Polly files for other languages -- too difficult to sync.
So my question is, how long should the audio/video clips be when you're planning on using Amazon Polly or any other text-to-speech narration?

Thanks,

Gina
Photo of Gina Fevrier

Gina Fevrier

  • 13 Posts
  • 1 Reply Like

Posted 2 months ago

  • 1
Photo of Daniela

Daniela

  • 17 Posts
  • 8 Reply Likes
We are using TTS voice narration in our videos. We generally use one mp3-file per sentence because that makes it easy to position them exactly where we need to, when editing the video.

In our workflow we start of with the script and then produce the mp3-files. Then we start with Camtasia, creating screen recordings and  sometimes use some callouts, animations and the like to cover other kind of content like concepts or general workflows. Since I know how long a mp3 file is, I can time the steps in the screen recording and the animations exactly to the length of time I need them to have the visual content in sync with the audio content.

We have determined that our TTS-voice sounds more natural when we vary sentence lengths regularly in the script and try to use active verb-based sentences, at least in English and in German. Sentences with a lot of nouns, especially with longer ones and passive verbs, tend to sound more robotic. We write our scripts in a way that accommodates the voice's quirks. Therefore, I think that writing a script for a human voice should differ from a script written vor a TTS-voice.
Photo of Gina Fevrier

Gina Fevrier

  • 13 Posts
  • 1 Reply Like
Hi Daniela,

I think it's easier to use shorter audio clips as well. I'll try your idea about varying sentence length.  We do use active voice. Writing a script for TTS definitely has more requirements than a more casual script with a human voice.  

Here's a an example of a video we created with a Spanish Amazon Polly voice (software overview). https://youtu.be/kyJIJMX285w

I don't know if your German videos are publicly available, but it would be great to see an example.

Thanks very much for your feedback,

Gina
Photo of Joe Morgan

Joe Morgan

  • 5631 Posts
  • 2923 Reply Likes

Hi Gina,

I have a program called Text Speaker. I don’t know how extensive your text to speech work is? If this is something you going to be doing a lot of. You might be better served by Text Speaker.

I purchased the program to give voices to animated characters because they can sound so realistic.

https://www.deskshare.com/text-to-speech-software.aspx

I was checking out Amazon poly text-to-speech this morning and I see they use the same voices as Text Speaker. Text Speaker cost of $30 and you must purchase the voices for $30 apiece. There are a few foreign languages available. You can download a free trial version of the program “Forever”. You can get the voices for 30 days on a trial basis. You can playback and export 200 words in trial mode. You can do this as often as you wish. You just cannot exceed 200 words. There are other restrictions that make the program frustrating to use in trial mode. But you might want to give it a whirl.

With text Speaker you can insert pauses. What happens with these computerized generated voices. Is the words tend to run together and sound very mechanical.Or the voice has a unnatural pitch shift.

 In text Speaker you’re essentially working in a Word document. If you save your work you save it as a Word document.


 Anyway, you insert your text and you play it back in real time. When the words run together and it sounds phony. You can usually just insert a comma to break up a sentence. There are times that changing 1 or 2 words makes it sound better. In those cases, the original words didn’t sound quite right in the 1st place in some cases. LOL

You can export the audio in mp3 or .wav. A .wav is a much better choice for creating video than MP3.

I’ve never used it for text-to-speech in Camtasia. But you could literally do the entire script for a video in one shot. With the insert pause feature. You can insert audio pauses of X amount of seconds. Making it easy to cut the audio into individual segments in the editor and reposition them. Take for instance, if you inserted 3 second pauses between certain sentences/segments. That would be easy to spot in the timeline because no waveform would exist. You could cut the audio at that point, slide it to where it belongs, etc. and so forth.

I took your question and inserted it in text Speaker. I inserted a few commas and made a couple of corrections to make the voice I chose sound good. Here’s a link to the audio file if you would like to hear it.

http://www.mediafire.com/file/4arfk5ecg55itzy/gina.WAV/file

Regards, Joe

Photo of Gina Fevrier

Gina Fevrier

  • 13 Posts
  • 1 Reply Like
Hi Joe,

Thanks for all of the information.  I listened to the video on the Deskshare site. (I couldn't download the WAV sample you created, because the site was trying to also download software.) I do think that Amazon Polly has more natural-sounding voices, at least from the sample video.  With Amazon Polly's SSML tags, you can insert pauses, change pronunciation, etc. And so far for us it as been free.  Here are the pricing details.

A drawback with Amazon Polly is that for now you cannot export WAV files. And the mp3 files don't work in Camtasia 9 or 2018 (but they do in Camtasia 8).  It would be interesting to know if you can import the Text Speaker WAV files into Camtasia (if you do, please let us know what Camtasia version you're using).

The main issue I'm trying to resolve is the length of the audio/video files in Camtasia when designing for text-to-speech.  So far I'd prefer to use sentences about 10 words or less.  It's easier to synchronize with the video clips.

Gina
Photo of kayakman

kayakman, Champion

  • 6243 Posts
  • 1828 Reply Likes
just curious ... but regarding "And the mp3 files don't work in Camtasia 9 or 2018" ...

have you tried importing them in free Audacity, and exporting as WAV, to use in Camtasia?
Photo of Joe Morgan

Joe Morgan

  • 5631 Posts
  • 2923 Reply Likes
This file is hosted on MediFire http://www.mediafire.com/file/4arfk5ecg55itzy/gina.WAV/file
There's no reason you can't download it and listen to it. It has absolutely nothing to do with any software. It's Amy, British accent

I don't know much about the Amazon site. I watched a you tube video about it this morning. And at least one of the voices come from Deskshare/Text Speaker. Amazon didn't actually create that one.
Brian, British accent was the one featured in this video.https://youtu.be/j77ZwRHh3BE
I purchased that voice.

But these .wav files play in any version of Camtasia. 8,9, and 2018

Photo of Gina Fevrier

Gina Fevrier

  • 13 Posts
  • 1 Reply Like
Hi Joe,
I still don't want to open the WAV file.  All sorts of pop-ups are happening.
Thanks very much for the info and for confirming that the Text Speaker WAV files work for Camtasia 8, 9, and 2018.
Photo of Gina Fevrier

Gina Fevrier

  • 13 Posts
  • 1 Reply Like
Hi kayakman,
I should have mentioned that I was able to run the Amazon Polly mp3s through Audacity and import into Camtasia 9. It just requires extra steps.  So Robert at TechSmith Support entered a defect.

Thanks,

Gina
 
Photo of Joe Morgan

Joe Morgan

  • 5631 Posts
  • 2923 Reply Likes
So Gina Fevrier,

I don't know why your getting pop ups, MediaFire doesn't work like that.

However, here's that audio uploaded to Vimeo. It's easier for anyone interested to hear it this way.

I probably should have done this in the first place



Photo of Gina Fevrier

Gina Fevrier

  • 13 Posts
  • 1 Reply Like
Hi Joe,

"Amy" sounds great!  She speaks very slowly, though.  In Amazon Polly you can change the speaking rate with SSML tags.

I don't know why Text Speaker charges so much when Amazon Polly is so cheap.

Thanks,

Gina
Photo of Joe Morgan

Joe Morgan

  • 5631 Posts
  • 2923 Reply Likes
Same with text speaker, you can change the speed of the voice. If you look at my image


You will see Speed, Volume and Pitch at the bottom of the editor to the right of the selected voice profile.That's her default speed.
Text speaker was around long before Amazon Text to Speech. I suspect they may have purchased Deskshares copyrights. Amazon can afford to go cheaper. Like I said I don't know much about the program.
I'm willing to bet they have Amy British accent as well.
It's a shame they don't have .wav export.

I like the stand alone program myself. I'm not sure I'd like being dependent on Amazon and the internet to function. .wav to me is importent
(Edited)