So my question is, how long should the audio/video clips be when you're planning on using Amazon Polly or any other text-to-speech narration?
In our workflow we start of with the script and then produce the mp3-files. Then we start with Camtasia, creating screen recordings and sometimes use some callouts, animations and the like to cover other kind of content like concepts or general workflows. Since I know how long a mp3 file is, I can time the steps in the screen recording and the animations exactly to the length of time I need them to have the visual content in sync with the audio content.
We have determined that our TTS-voice sounds more natural when we vary sentence lengths regularly in the script and try to use active verb-based sentences, at least in English and in German. Sentences with a lot of nouns, especially with longer ones and passive verbs, tend to sound more robotic. We write our scripts in a way that accommodates the voice's quirks. Therefore, I think that writing a script for a human voice should differ from a script written vor a TTS-voice.
I have a program called Text Speaker. I don’t know how extensive your text to speech work is? If this is something you going to be doing a lot of. You might be better served by Text Speaker.
I purchased the program to give voices to animated characters because they can sound so realistic.
I was checking out Amazon poly text-to-speech this morning and I see they use the same voices as Text Speaker. Text Speaker cost of $30 and you must purchase the voices for $30 apiece. There are a few foreign languages available. You can download a free trial version of the program “Forever”. You can get the voices for 30 days on a trial basis. You can playback and export 200 words in trial mode. You can do this as often as you wish. You just cannot exceed 200 words. There are other restrictions that make the program frustrating to use in trial mode. But you might want to give it a whirl.
With text Speaker you can insert pauses. What happens with these computerized generated voices. Is the words tend to run together and sound very mechanical.Or the voice has a unnatural pitch shift.
In text Speaker you’re essentially working in a Word document. If you save your work you save it as a Word document.
Anyway, you insert your text and you play it back in real time. When the words run together and it sounds phony. You can usually just insert a comma to break up a sentence. There are times that changing 1 or 2 words makes it sound better. In those cases, the original words didn’t sound quite right in the 1st place in some cases. LOL
You can export the audio in mp3 or .wav. A .wav is a much better choice for creating video than MP3.
I’ve never used it for text-to-speech in Camtasia. But you could literally do the entire script for a video in one shot. With the insert pause feature. You can insert audio pauses of X amount of seconds. Making it easy to cut the audio into individual segments in the editor and reposition them. Take for instance, if you inserted 3 second pauses between certain sentences/segments. That would be easy to spot in the timeline because no waveform would exist. You could cut the audio at that point, slide it to where it belongs, etc. and so forth.
I took your question and inserted it in text Speaker. I inserted a few commas and made a couple of corrections to make the voice I chose sound good. Here’s a link to the audio file if you would like to hear it.
Thanks for all of the information. I listened to the video on the Deskshare site. (I couldn't download the WAV sample you created, because the site was trying to also download software.) I do think that Amazon Polly has more natural-sounding voices, at least from the sample video. With Amazon Polly's SSML tags, you can insert pauses, change pronunciation, etc. And so far for us it as been free. Here are the pricing details.
A drawback with Amazon Polly is that for now you cannot export WAV files. And the mp3 files don't work in Camtasia 9 or 2018 (but they do in Camtasia 8). It would be interesting to know if you can import the Text Speaker WAV files into Camtasia (if you do, please let us know what Camtasia version you're using).
The main issue I'm trying to resolve is the length of the audio/video files in Camtasia when designing for text-to-speech. So far I'd prefer to use sentences about 10 words or less. It's easier to synchronize with the video clips.