Carla King interviews Ryan Scott: How to create audiobooks with synthetic (AI) voices, including your own voice.
Nonfiction Authors Association Podcast | May 24, 2023
“I think, for some authors, they do want to read their own book, but find it pretty challenging to sit down and read their whole entire book on audio. It is a very time-consuming process, and I think authors maybe aren’t necessarily, ‘Hey, I would like my voice, but I don’t necessarily consider myself to be a professional reader.’ We could still take their voice and allow them to read their book through the power of AI.”
-Ryan Scott
About Ryan Scott
Ryan Scott is an entrepreneur who makes software for publishers and authors. He helped build the Internet marketing industry in the early days and is now focused on using AI to help authors find success with their books in the marketplace of ideas. Scott has built numerous AI applications including one for automatic blog posts creation, a chatbot service, and custom software for clients. His latest is Audie.ai, website where authors can quickly and automatically create an audiobook from a manuscript. Ryan lives in the Austin area with his wife, two kids, and their dog named Fluffles.
Nonfiction Authors Podcast: Ryan Scott
Find the video podcast, show notes, links, quotes, and podcast transcript below.
Listen and subscribe to our podcast wherever you listen to podcasts. Watch the video interview on Facebook, LinkedIn, Twitter, or on our YouTube Channel where you can subscribe to our playlist.
Got feedback on our podcast? Want us to consider a guest or topic? Please let us know on our feedback form here.
Show Notes
Links
In this episode…
- How the advancement of AI will help you reach a wider audience.
- How AI Speech Technology works.
- Why you would use a synthetic voice instead of your own.
- How to upload your own voice on Audie.ai.
- How to use Audie.ai to create an audiobook.
- How to use voice tags for your audiobooks.
- How AI databases are developing and feeding into systems.
- The cost breakdown of audiobook creation with synthetic voices.
Transcript
[00:00:00] Carla King: Hello and welcome to the Nonfiction Authors Podcast, where we interview experts who help you learn how to write, market, publish, promote, and profit from your book. The podcast is brought to you by nonfiction authorsassociation.com, which is the home of nonfiction writers conference.com. We have several membership levels, all of which offer discounts on our live courses and so many other benefits.Find out more at nonfictionauthorsassociation.com slash/join.
Hi everybody, and welcome to this week’s edition of the Nonfiction Authors Podcast. I’m your host, Carla King, and today our guest is Ryan Scott, who’s here to talk about synthetic or AI voices for your audiobooks and other content. Let me introduce our guest, and then we’ll start the conversation.
Ryan has a long history in technology and the internet. He began with co-developing opt-in email marketing, then created Causecast–a platform that empowers large corporations to make a difference through volunteering and giving. In recent years, though, his focus shifted toward cryptocurrencies and augmented and virtual reality, and he’s working to demystify AI and make it accessible for everyday marketers, entrepreneurs, and of course, authors.
Most recently, Scott has been building AI applications, including one for automatic blog post creation, and a chatbot service. His latest is Audie.ai, a website where authors can quickly and automatically create an audiobook from a manuscript.
Hi, Ryan, welcome to the podcast.
[00:01:40] Ryan Scott: Thank you for having me.
[00:01:41] Carla I’m excited to discuss the latest in AI. And specifically, today, synthetic voices, as regards to your Texas speech platform, Audie.ai, which I’ve looked at, and tested a little bit, and it’s awesome. When did you launch Audie.ai, by the way?
[00:01:59] Ryan: I feel like we’re in a continuous process of launching. And in fact, after we launched, we found out right away what the issues were that people were having, and what they really needed. You can’t know what people really need until you put it out there for them. And we found very quickly that some of the features that we had planned were needed sooner rather than later.
And so we kept it out there, but we also went back to the drawing board and created some of those features. For example, the ability to do multiple voices in one document was a very popular idea with folks. So we went back and created that. And then I would say we kind of relaunched it yesterday.
[00:02:34] Carla: Kind of like books, and beta readers, and drafts and all.
[00:02:37] Ryan: It’s exactly like it. It’s exactly like it. But the only difference–we really need a beta test on the public. Because we don’t know, either, until everybody starts using it, really. What are the things that really resonate with folks, and how do we double down on those?
[00:02:50] Carla: And it’s developing so quickly. I read an AI breakfast a few weeks ago, actually, that I’m gonna quote here. We may be in the last few months of AI generated audio being indistinguishable from authentic recordings. What do you think that means for authors and readers?
[00:03:10] Ryan: Well, I think it really provides an opportunity for authors to reach a wider audience.
A lot of authors don’t necessarily have the time or the money to go and get a book read by a professional. And I think that there’s still room for professionals to be doing the reading, but what about the folks that can’t afford it, and don’t even know where to find the readers? I think Audie.ai really provides an opportunity to get your audiobook without having to go through that long process or that big expense.
And I think, for some authors, they do want to read their own book, but find it pretty challenging to sit down and read their whole entire book on audio. It is a very time-consuming process, and I think authors maybe aren’t necessarily, ‘Hey, I would like my voice, but I don’t necessarily consider myself to be a professional reader.’ We could still take their voice and allow them to read their book through the power of AI.
[00:04:00] Carla: Well, I’m raising my hand on that one, because I have tried to narrate my own books, and my voice just gets tired. So I gave up. Let’s wait on the voice cloning, because I really wanna talk about that and just start with–how does AI speech technology actually work?
[00:04:23] Ryan: There’s a couple layers to the process. One is just the ability to synthesize somebody’s voice. And there’s been many years of research through folks that get paid a lot more than me to figure this stuff out–who have been working tirelessly for many, many years to come up with synthetic voices. And I think even today, you’ll still see synthetic voices on YouTube videos, for example. They don’t sound very realistic, and are completely expressionless, and kind of sound stilted and awkward.
And that really was the state of the art until very, very recently. But as technology continues to advance, and we have larger and larger computers, and have more memory, and they run faster and faster, we’re able to put a lot more data, a lot more samples of people speaking into the computer, for it to look at. It’s what’s called machine learning.
It’s less where you’re typing code and saying exactly what to do, and more observing what it’s seeing in the data, and trying to recognize patterns. Much in the same way that a human brain works–that is really what it’s based on. So it’s kind of learning like a person.
[00:05:27] Carla: So then it’s learning not only the words and the grammar, but also intonations, for instance?
[00:05:34] Ryan: Yes. Inflection. Context. You can read a sentence many different ways. And in English, you could change the meaning quite significantly with your tone. In terms of an audio book–reading it in different ways will just sound–it’ll sound more natural.
You’re going in it with different pacing at different parts of the sentence. And the AI now can do that. It sounds quite realistic simply because of that. I think people listen for that syntheticness, and want to hear things being repeated the same way. And when it doesn’t do that, they’re very surprised.
[00:06:04] Carla: I want to go back to something you said about why authors would use a synthetic voice instead of their own, or an actor. I think, for a lot of fiction, people want to use actors, because they really act out the parts, right? And I know AI speech technology isn’t coming with that as well. But for the nonfiction author, I was thinking about the difference between my memoir, which is a creative nonfiction project, and needs to be acted, maybe, by me. And my self-publishing guide, which is a straight instructional manual, which doesn’t need to be acted at all. So that seems, to me, perfect for an AI project.
[00:06:47] Ryan: Yes. That is definitely natural. Although when I’m in there playing with the voices, I do find that there’s some of them that are particularly good for nonfiction books. And they’ve got a very direct tone.
Some that might be good for a nonfiction book about a war would be different than a nonfiction book about something from something else entirely–business, for example. And some voices are more commanding, some voices are more deep and strong. I guess the one thing that ties them together is just the realism behind it, and that you can listen to it for quite some time, like you could a human voice.
But I think–to address one other thing that you said, if you don’t mind–the ability to have it acted was something that I thought, too, would be very tough to do. But there’s two ways that that’s approached, and I’ve heard some very convincing voice acting done through the AI. Part is in picking the right voices that had the right tone to begin with. The other is the context of what’s written. The AI can pick up on the context, and start to add excitement, just like a human reader would do. I would say it’s still a little shy of being Hollywood actor level, but the second part of this is what’s coming out soon–the ability to add not only the context part of it, but to add clues about what’s going on.
So even out of context, somebody could yell something in an excited way. We’re not going to pick it up from context clues, but a human would pick it up, and say it in a more excited way. So you’ll be able to add those types of things in, to cue the AI up so that they should read something in an excited way, or in a way that’s kind of different than the context.
[00:08:25] Carla: So perhaps in a memoir–or any creative nonfiction book that has emotion in it, or a story in it–there would be clues in the story about setting, and situation, that would key the synthetic voice to read it in a depressed way, or an excited way, without having to program in some special code in into the writing?
[00:08:54] Ryan: Yeah. So what you talked about there was the context side of it. And as you’re reading a book, you do pick up the context, of course. And when you’re reading in your head, you kind of know how to read it. But in the case of the AI, it can pick up those clues as well. But when there’s cases where you do want to override what the AI thinks it is, you will be able to without writing code. You just click like, ‘Hey, Go depressed mode right over here.’
They don’t have the ability to do it with the emotion, but they’re doing it with the voices. ‘Oh, I really want this voice here. I want this voice here.’ Spending the time to actually tag their books, even though that could be pretty time consuming.
[00:09:29] Carla: Okay, so let’s go through that. So you go to Audie.ai and you upload your manuscript. I would say you probably strip out the copyright page and all of that. You choose a voice. How many voices are there to choose from now? How many do you think there will be by the end of 2023?
[00:09:49] Ryan: That’s a great question. So we have about a hundred voices in there. Some of them have been put in as samples, and some of them have been generated programmatically, based on different parameters. And we create new ones. Every day or so, we pop another one in there, based on–really–what the folks using it have asked for. Like, ‘I need somebody who has a New York voice,’ or that type of thing. But then, what’s becoming just more and more popular–especially as folks find out about it–is they want to clone their own voices.
‘So I really do want to read my own book, I just don’t have that kind of energy to keep it up for that long.’ And so that is something that we’ve been doing quite a bit. So those voices go in privately. So the author can only have access to their own voice. Matter of fact, we did an anthology. We’ve done four anthology books that had 10 authors in each. So they spent the time to get their voices from YouTube videos, where they were speaking on stage, or doing a webinar or something, and cloned them, put them in there, and then we just assign them to the correct spot. And then the book is read in 10 different voices.
[00:10:51] Carla: What if I haven’t been a public speaker? Is there a place where I can program in my voice? Or do I have to create a YouTube video or something?
[00:11:01] Ryan: Oh, that’s a great question. There’s going to be the ability right on the site–this should be up there in a day or two–to clone your voice. So you just hit a button, and then you speak for a few minutes. We really only need about five minutes of audio to be able to clone a voice.
[00:11:16] Carla: Really? So you could just artificially create intonations? Or that thing that I just said– ‘Really!’ –so I was surprised. Or, ‘Let’s talk seriously now,’ right? You get that from just five or ten minutes?
[00:11:30] Ryan: You can. Look, if you–in your five minutes–never do that type of thing where you say, ‘Really!’ It’s not going to pick up on it, of course. But if we took this interview, and I hold your voice out of it, it would pick up on that type of thing. And that’s actually an interesting experiment that we should try.
[00:11:46] Carla: Okay. Let’s do it. I’m excited. Okay, I have my manuscript. I go to the website, I upload the text. It takes a while to process?
[00:11:57] Ryan: Oh no, not at all. It takes really seconds. You can even select some of the text and say, ‘I’d like to test it.’ To make sure it’s coming in properly.
[00:12:03] Carla: So then let’s back up. So then it appears on the screen. Does it appear with a bunch of tools on the left or the right? What are the options?
[00:12:14] Ryan: All the text is in the middle. And then on the right hand side, you can see your chapters. And there’s a button to generate text for each one of the chapters.
So you don’t do the whole book at one time, and then you wait, and then you realize, ‘Oh, I should have changed something,’ or whatever. That wastes a lot of time for the AI. But then there’s also a little button where you can click to select the voices. There– if you have multiple voices–you could just click in your text where you want the next voice to be, and hit the voice, and it puts a little tag in there. It says, ‘Carla speaks here’, and then at a particular part, ‘Ryan speaks here.’ And then that’s it. You just hit generate and it generates. To do 10 minutes of audio, it takes about a minute to process. It’s very, very quick. So you could do your whole book in a very short amount of time.
[00:13:02] Carla: Okay. So the feature that’s coming–it’ll probably be out by the time this podcast is published–is I could have a paragraph, or a chapter spoken by one person, click that. So that is the slow process of assigning voices. So if you have dialogue, for instance. In my memoir, I have dialogue. So I could actually have a male, South African voice do the male South African guy in my book, right?
[00:13:27] Ryan: That’s exactly right. And in fact, that slow process of tagging is another part that we are able to automate using AI. And so that’s not here yet, but I was literally working on it today, and it’s just incredible to see. You feed it to something like ChatGPT and say, ‘identify all the speakers and put a tag in front of their name.’ You give it no context whatsoever–it figures it all out and it puts the tags in front. And so we’ll have that ability on the website.
[00:13:55] Carla: Wait a minute. So I put my–or you help me put my manuscript in ChatCPT?
[00:14:03] Ryan: No, you would just come to our website and you would load the manuscript. It would load it up, and then you’d hit a button that would say, ‘identify speakers,’ and it would just put all the little voice tags in front of all of them.
And then you would just need to say, ‘Oh, I want this one to use this voice. This character uses this voice.’
[00:14:20] Carla: Does it look at dialogue tags? How does it know who’s talking?
[00:14:26] Ryan: Well, the same way a human does, right? When they’re reading.It really just reads it.
[00:14:34] Carla: And as long as the author has made it clear whether they don’t tag it. Hemingway is famous for not tagging. It’s just, ‘he said, she said, they said.’ He doesn’t identify them, because he is a skilled writer.
[00:14:48] Ryan: And you know what? ChatGPT is actually very good at picking that out–seems to be as good as a human.
I did some tests and I deliberately made some text that was confusing. I didn’t identify one of the people at all. It was just something about a mother, and then there was a voice from the other room saying, ‘Hey, I heard that.’ And that should come from the mother, and it identified that it was the mother.
So I was pretty impressed with that. I took a segment of a book from one of our clients, and I didn’t take enough of it, and at the end it said, ‘I’m unable to identify who the speaker is.’ So yeah, it just talks to you like a human. It’s a little eerie. Very powerful.
[00:15:28] Carla: So there’s a lot of technologies involved, and you’re probably using some of those behind the scenes.Big database. What database are you using for that? And you just said you are using–behind the scenes–chatGPT for that. Can you talk about how those are developing and feeding into your systems and other people’s systems?
[00:15:48] Ryan: Yes, and I would say–like most online applications–we’re built on the shoulders of giants. And Audie.ai came about because I saw ChatGPT, of course, and some of the power that was there to work with long documents. And I also saw there was a lot of different voice synthesizers coming out. They don’t really work in the sense of–you could type your whole book into them and then hit submit, and then you have a book. It didn’t have that capability. They were more like, ‘Here, you could type in a little box and we’ll convert it into audio for you.’ They work through an API so that our system can then talk to their system. So we created a system that allows an author to use something like that in a way that would be more familiar to an author.
That’s called the Eleven Labs and out of everything that we’ve ever seen, that’s far and away the best.
[00:16:40] Carla: I did test that early on–Eleven. A friend of mine alerted me to them and I was like, ‘Oh, that sounds a lot more real than any of the other text to speech.’
[00:16:49] Ryan: Yeah. It gives you goosebumps sometimes. And I think with some of the voices, you really want to hear them. It sounds kind of delicious in your ear, you know?
[00:16:59] Carla: Well, of course–they’ve done that on purpose. And we’re used to it now. I mean, Siri’s talking to us all the time. My dad has a British woman Siri. It’s whatever you like.
I do want to make clear that this tool that you have–it does the audio. And then do you get it in good shape, so then you can upload it to Findaway or ACX or whatever? How do you take it from there?
[00:17:33] Ryan: Yeah, that’s a good question. There’s very standard formats everybody wants to receive it in, and they’re all MP3 files. They’re all the music files, essentially, that we’ve been using for years. And they are just numbered with the chapter as we’ve found them, and then the name of the chapter ‘.mp3,’ and those files can be directly uploaded anywhere. You can just get them right from the site. You just hit, ‘download my files.’
[00:18:00] Carla: Great. And I know Amazon–we went to that earlier–has forbidden artificial voices, but pretty soon they’re not gonna be able to tell.
[00:18:10] Ryan: They’re not going to be able to tell, and I think it’s going to be very hard to resist this, But the top fastest growing segment of books are AI written or AI contributed to. And if people want to read them, they’re going to be selling them. And in fact, they are selling the AI books–AI written books. So I completely understand why they wouldn’t have wanted these AI read books. In the past, they did sound terrible. I think they have a quality standard they need to adhere to, or it really lowers the value of their entire catalog.But I think that that day has come to an end.
[00:18:46] Carla: It has. And I just do want to mention that the caveat is–this podcast episode is about how authors can use synthetic voices in AI. And I know there’s a lot of legal, and political, and ethical issues that are being talked about right now about AI and, and rights, and voice training and training.Like my website is being used for knowledge but doesn’t fall into fair use and all of that. And I will definitely have another episode that has a lawyer in it to discuss that.
But for now, we have these cool tools, which I always appreciate. Authors are a huge market, and there are books being written right now that have a lot of potential for audio. And it doesn’t cost that much to create an AI generated audio book. Can you talk about the price ranges, and the word counts, and the actual cost for the author?
[00:19:43] Ryan: Sure. So we charge based on the usage, which is very much in line with how we’re charged. So if the user author, let’s say, is using it to do a lot of tests, those can rack up some charges. And by rack up, I mean they are still incredibly small compared to having a human do it.
So the average book would be–say maybe it was around 500,000 characters, which I know isn’t the normal way to talk about a book, but that’s kind of how we do in the tech world–would be maybe $330. I mean, that’s the size of our biggest account. And so that’s enough to do a book, and do a bunch of testing on it as well, and making changes, and really kind of spending your time to do it right.
[00:20:21] Carla: I see. So you’re paying Eleven Labs and the ChatGPT databases. What is it called?
[00:20:28] Ryan: OpenAI’s. Paying OpenAI, that’s for sure.
[00:20:30] Carla: You’re paying OpenAI. You’re paying Eleven Labs, and so you’re paying for time. So we’re paying for that time. But under $300.
[00:20:39] Ryan: From what I’ve seen with process of doing an audiobook, it’s quite expensive.
[00:20:44] Carla: Yeah, it is. And besides books, what do you think the other potential for audio is now? I was thinking about just making your blog posts audio, for example.
[00:20:57] Ryan: Yes. I think that’s a great use case, and we’ve heard about that quite a bit. So folks are using it for that. I think you need to be multimodal as a marketer. You need to be multimodal. And as a book author, you kind of are a marketer as well, right?
And so yeah, taking your blog posts and having audio for them–very important. People are using it to put voiceovers on their videos. Who can talk like a radio announcer? Well, Audie.ai can do it, but I certainly can’t do it. So they’re using it for things like that as well. And it’s, again, indistinguishable from a human. Especially in that kind of use case, where you’re putting some music and using it for that.
But one use case that actually is still audiobooks that I was kind of surprised about–I just, today, was looking up the history of audiobooks, and found that it started in the 1930s. The American Foundation for the Blind and the Library of Congress, they had a project–The Adult Blind Project. They were the first to produce audiobooks.
One of our very first clients that came to us is an organization called Christian Recording Service. And what they do is they take copyrighted books–and they’re allowed to do this, the government allows them to do it–take those books, turn them into audio books, and then they provide both braille and audiobooks to the folks that can’t see as well, which I thought was really a great use for this kind of technology.
Here is a use case that I think is important. A podcast. You can max out–in a country–how many listeners that you have, but you can really expand your audience by converting it into other languages. It’s in the early stages right now, but yes, it’s happening. And it’ll be–say Carla, you were doing it and you were speaking Afrikaans, or something like that. You could absolutely do that, and it would sound just like you. It would act like you, but you’d be speaking a completely different language.
[00:22:41] Carla: That’s kind of amazing. Because I do have a lot of readers in Holland, Germany, and Sweden. And China. So I could literally read my own memoir in all of those languages.
[00:22:56] Ryan: Yes. And it would take you five minutes of talking.
[00:22:59] Carla: So I want to know when that happens. Stay tuned.
[00:23:06] Ryan: I think, with marketers–especially because of the rise of AI– we’re seeing them automate a lot of their marketing processes. And so a natural part of that is you’re creating ads in an automated way. And when you want an automated process, you’re going to need to automate the audio part of it as well. It’s just a natural part of it. So I think that’s the biggest thing right now. There’s obviously a ton of different ways that it could be used.
[00:23:31] Carla: What are you most excited about?
[00:23:33] Ryan: I kind of like the idea of the old-timey radio plays, where they’re reading with a ton of expression. And even, I think, adding audio sound effects and things like that. I think it just–I don’t know, maybe it shows my age, but it’s just sort of very evocative. And I kind of like the idea of being able to create them without having to assemble a gigantic team to make it happen.
How much can you do as one person–it’s gotten to be so much. You could make your own productions, and that to me is very exciting.
[00:24:08] Carla: You could produce your own play with sound effects and that’s great. I mean, I listen to audio probably more now than I read. Because I’m doing a lot of driving or walking or housework–I think a lot of people are. And since we look at our screens so much–and I look at my screen so much. I have my Kindle, but I’m reluctant to open that up, because it’s more screen time. And I really love the voice acting as well. I really do. So I’m really glad you have the time to talk with us, and I hope you’ll come back and give us an update.
[00:24:40] Ryan: I’d love to.
[00:24:43] Carla: And in the meantime, where do we find you and your progress and all of this?
[00:24:46] Ryan: I think the best place to go would be Audie.ai. And that’s it.
[00:24:49] Carla: Thanks again, and we’ll talk later about more cool new stuff in the future–in the very near future, I hope.
[00:24:58] Ryan: Thanks Carla.
[00:24:59] Carla:
And thank you to our listeners for joining us today and every week. For a list of guests and topics just check our schedule on the site, use your favorite search engine, or better yet, sign up for our mailing list at NonfictionAuthorsAssociation.com.
Quotes from our guest
“I think, for some authors, they do want to read their own book, but find it pretty challenging to sit down and read their whole entire book on audio. It is a very time-consuming process, and I think authors maybe aren’t necessarily, ‘Hey, I would like my voice, but I don’t necessarily consider myself to be a professional reader.’ We could still take their voice and allow them to read their book through the power of AI.”
“The ability to have [the book] acted was something that I thought, too, would be very tough to do. But there’s two ways that that’s approached, and I’ve heard some very convincing voice acting done through the AI. Part is in picking the right voices that had the right tone to begin with. The other is the context of what’s written. The AI can pick up on the context, and start to add excitement, just like a human reader would do. I would say it’s still a little shy of being Hollywood actor level, but the second part of this is what’s coming out soon–the ability to add not only the context part of it, but to add clues about what’s going on.”
“Yes, and I would say–like most online applications–we’re built on the shoulders of giants. And Audie.ai came about because I saw ChatGPT, of course, and some of the power that was there to work with long documents. And I also saw there was a lot of different voice synthesizers coming out. They don’t really work in the sense of–you could type your whole book into them and then hit submit, and then you have a book. It didn’t have that capability. They were more like, ‘Here, you could type in a little box and we’ll convert it into audio for you.’ They work through an API so that our system can then talk to their system. So we created a system that allows an author to use something like that in a way that would be more familiar to an author.”
We want to hear from you!
Who do you want us to interview? What topics would you like to explore? Take this short survey to let us know!