We were in our early twenties and we were bold. We knew that everyone in the world was working on speech synthesizers of better quality. And there was us: five guys from Gdańsk struggling to make ends meet, recalls Łukasz Osowski, a co-creator of the Amazon voice assistant, in conversation with Monika Redzisz
Monika Redzisz: I remember how surprised I was when I learned that Alexa, which is currently the most popular voice assistant in the world, had been largely created in Tricity based on the Polish speech synthesizer called Ivona. When did you create it? And how did Ivona turn into Alexa?
Łukasz Osowski*: The idea came when I was studying at the Faculty of Electronics, Telecommunications and Informatics, Gdańsk University of Technology. When I was a fourth-year student I took interest in speech recognition. It was in 2000. At that point, it was more or less clear how a speech recognition system should look like although no good product had yet been launched. Me and my friend came up with a simple speech recognition system as part of our student project. I realized that it was very difficult and I decided to take on something easier but still underdeveloped: speech synthesis. I found the opensource system from Dublin called Festival. It was great, so I decided that speech synthesis would be the subject of my master’s thesis.
Were there any decent speech synthesizers back then?
A few. But they sounded terrible, much like robots in old science fiction movies. It was hard to understand what they were saying. Festival made it possible to improve the quality.
But then another important thing happened… One of our professors tested our personality with the tests developed in the United States in the 1960s. for the soldiers coming back from Vietnam. The tests were supposed to help the soldiers to find their place in the society. It turned out that I was predisposed to start my own venture.
Didn’t you expect that?
No, it came to me as a surprise. But I thought it would be worth following the recommendation as it might increase my chances for doing something I could be satisfied with in my life. Speech synthesis was a very cool topic to explore: not only could I write about it in my master’s thesis but also make it the core line of business in my own company. It was to be my ideal personal development path. When we started to work on the project, there were three of us: me, Michał Kaszczuk, who was a friend of mine I had met at the university, and another friend of mine, who backed out after several months. A part of that solution was the basis of my thesis, but me and Michał would work 12-14 hours a day and after few months we came up with the first version of the product. We called it Spiker. A couple of months later Michał defended his master’s thesis, whose topic also referred to speech synthesizers.
I understand that at that time no one heard about voice assistants.
No, but in my head I had a vision straight from sci-fi movies, like “2001: A Space Odyssey”, in which you had a computer that could talk and understand, i.e. a prototype of a voice assistant.
I can still remember the moment when Ivona spoke for the first time. Wow!
We wanted to develop both speech synthesis and speech recognition, which were necessary to create a friendly assistant. But that challenge would be over our heads; we would also need financial support. So we focused on synthesis.
What was your synthetic speech like back then?
Better than others available on the market although it was far from being natural and pleasant to the ear. We set up a company and we started selling Spiker. Our customers included only the blind and the visually impaired. Spiker was used to read the text displayed on a computer screen, in e-mails and on websites. Although we managed to sell several dozens of copies of our product, our income was low.
How did you earn a living?
We were students so our parents provided for us and our costs were minimal.
Did you have an idea of what other purposes it might serve? What were your expectations?
Me and Michał arrived at a conclusion that we would never be able to create a prosperous company if we sold the product to such a small group of customers and that we wouldn’t be able to earn a living. It was then when we decided to work on a completely new quality of synthesis. We wanted the quality sought after not only by the blind and the visually impaired but also by other people. We realized that there existed many situations in which sighted persons couldn’t read the text on the screen, for instance while they were driving, at the bus and train stations, at the airports, in streetcars and buses. Our desire was to create a speech synthesizer to which people would listen everywhere without noticing a significant difference between the words spoken by the machine and by the voice actor. The only problem was that nobody had ever constructed anything like that before.
Why did you call it Ivona?
Its original name was Ivo, which is an abbreviation of “intelligent voice”. But then we realized that since we wanted to create a voice that would sound as human as possible, that voice should also have a human name.
Our works on Ivona began in 2002. It was rough. We had already graduated from the university, so we had to somehow get along on the money we were getting from selling Spiker copies. Unfortunately, we were also forced to employ other people to strengthen the team responsible for the development of Ivona. We managed to survive, but for about four years we had been teetering on the edge of bankruptcy.
Weren’t there any investors willing to invest in such an innovative project?
No. Those were hard times for investment projects; stock exchanges around the world had just seen the burst of the dot-com bubble. People were suspicious about any IT ventures. Luckily for us, our technology was slowly proving that it might be successful. I can still remember the moment when Ivona spoke for the first time. Wow! If compared to Spike, Ivona was a great leap forward.
We used many more recordings. We hired a voice actor and we worked with him for several days in a studio. That laid the foundation for a speech synthesizer. The synthesizer’s utterance is composed of fragments of the recordings. The more recordings we have at our disposal, the better the speech quality is. Spiker used about 1.5 thousand words; Ivona used several hundred times as many. However, the most difficult part consisted in developing an algorithm that would search for relevant recordings, cut them and seamlessly glue them together to form a new utterance. But what we cared about was not only words but also intonation, pauses and stress, which make an utterance understandable. It’s about differences between questions and statements but also about the melody of speech. Each sentence has its own melody; in complex sentences it’s even more complicated.
And that was one of the reasons why we used for example Fourier analysis and a number of artificial intelligence solutions, such as decision trees, neural networks or fuzzy logic. We had to create algorithms that would learn everything by themselves and that would use that knowledge in the synthesizer. Their role was to propose a natural melody for a given sentence.
Computer modeling of something that has never been written down must be difficult.
Very. Around 2000, forecasts predicted that natural speech synthesis, considered to be one of the most difficult artificial intelligence challenges, would be developed about 2010. We managed to achieve that in 2006. Our synthesis did not differ significantly from the recordings of voice actors.
At that time, Microsoft, Google and Apple were also working on similar solutions. They had much more money and people than your company. How is it possible that you surpassed them?
I often think about it. Well, we were in our early twenties and we were bold. We surely were determined. It was our venture, we worked for ourselves. We knew that everyone in the world was working on speech synthesizers of better quality. And there was us: five guys from Gdańsk struggling to make ends meet. We were almost completely divorced from the world; we couldn’t attend scientific conferences, we had no money to hire more people. Although me and Michał were reading publications and developing Ivona by devising and testing new ideas, it was extremely hard for us to say what the progress of works of other teams was.
So, when did you learn that you were the best?
In 2006. We knew how fluently Ivona could speak and we were very excited. We also knew that she was good but we didn’t know how good she was. How could we measure Ivona’s quality and how to compare it to what other centers were doing? Then we received the information that the team which had created Festival speech synthesizer was organizing Blizzard Challenge. The goal was to develop a synthesizer based on the recordings provided by the organizers. “A perfect opportunity to see how good we really are,” we thought. We got several hours of voice actor recordings which we had to use to quickly create a speech synthesizer speaking in English. Then we were sent a text containing several hundred sentences. We launched a just completed synthesizer which read the sentences, we saved them as audio mp3 files and sent them to the organizers. We only had several days to do all that. The results were to be announced several weeks later at a conference in the United States but long before that we got a call from Alan Black, the supervisor of the competition, who asked us to come to the US to attend the ceremony.
We had to create algorithms that would learn everything by themselves and that would use that knowledge in the speech synthesizer. Their role was to propose a natural melody for a given sentence
We knew that we had done well. We flew to the States. We won. Ivona proved the best and outdid two IBM teams, from New York and Israel, the Microsoft team, and the universities from Tokyo, Beijing and Barcelona. That was a breakthrough for us. We gained confidence in our ability to create an innovative product in Poland and to win with international tycoons despite limited outlays.
What changes did it trigger?
After we came back to Poland, we sent a press release to the Polish Press Agency. After 15 minutes someone called us and asked if we could prove it. We referred the caller to professor Alan Black and two hours later all newspapers wrote that the Poles were the best in the world. That was the turning point. We sold Ivona to a number of customers: PKP, Polish railway operator, decided to use it in the trains; municipal transport authorities wanted to play it in streetcars, buses and trolleybuses; the Polish army wished to test in simulations systems; there were also manufacturers of telephone systems and, to no surprise, the blind and the visually impaired. Our revenues of 100 thousand zlotys leapt to several million just over a few years. We started thinking about leaving Poland. In 2008 we commercialized Ivona on the American market. We found cool clients, for example BlackBerry, which used to be one of the biggest manufacturers of telephones, or Barnes & Noble, a famous online bookseller and an Amazon competitor. In 2010 we were approached by Amazon, which was looking for a speech synthesizer for its Kindle. We soon discovered that, in fact, Amazon didn’t want to buy Ivona – they were interested in purchasing our company. They wanted us to stay in within the corporation and to develop speech synthesis for a completely new product – the voice assistant. We had thought about it several years earlier and now we could implement the project with the financial support from Amazon.
Didn’t you feel bad about selling your own company?
Me and Michał gave this a lot of thought. We decided that it was the right moment to go for it.
Because back then we were in a very dangerous situation. After getting a foothold in the American market, we stepped on the toes of the then biggest producer of synthesis and speech recognition in the world. The corporation would generate revenues of 2 billion dollars and their voice was used by Apple’s Siri. When we stole a BlueBerry contract from them, they decided to fight us. They were infamous for being one of the most aggressive companies in the United States. They sued small companies that had messed up with them to take them over. Between 1995 and 2010 they took over several dozen companies.
But they needed a reason to sue somebody, right?
They can sue you for anything, for instance for an alleged infringement of someone else’s intellectual property rights. In the United States, you have to defend yourself even if it is obvious that you are right; if you don’t, you may be down for the count. Defending, especially in the case of commercial claims, will cost you a fortune. Small businesses can’t afford it. In the period we are talking about the American tycoon could easily bring several actions at the same time. All companies that didn’t have enough money to stand against the corporation in the court were offered an agreement under which they were later taken over and destroyed. That was the fate of VLingo, one of our USA based clients. We weren’t safe either.
How did you know that?
We got a call from none other than their CEO. We didn’t pick up the phone but everything was dead obvious – we were in their cross hairs and they could sue us at any moment. If they did, we would never stand a chance. About that time Amazon approached us. We informed other entities about our intention to sell the company. We received many offers from around the world. Those several months were the most hectic period in my entire life. Me and Michał flew several hundred thousand miles and visited Korea, United States and other places on the globe. We held talks. The Amazon’s offer was the best. We were guaranteed that the Ivona team would further develop speech synthesis and, additionally, work on a new incredible product.
Around 2000, forecasts predicted that natural speech synthesis, considered to be one of the more difficult artificial intelligence challenges, would be developed about 2010. We managed to achieve that in 2006
We decided to open our office in Gdańsk, where currently close to one thousand people are employed. When we joined Amazon, our Alexa team consisted of about a dozen of persons. Including ourselves, we created a team of 50 members. Me and Michał created and managed the Amazon Development Center in Tricity; we reported to a deputy of Jeff Bezos. We created Alexa in Gdańsk although works were also carried out in Seattle and, for a few months, in Boston, Cambridge and several other places around the world. Over a couple of years the Alexa team reached thousands of engineers. Alexa is used by several hundred million people and has been built into thousands of different devices.
Has the Ivona brand disappeared completely?
Phasing out our brand and replacing it with the Amazon’s over the next years was a natural process. The business name of our company has been changed from “Ivona” to “Amazon Development Center Poland” and our synthesizer is now called “Amazon Polly”. Amazon Polly has been fitted with new technologies, which have made it even more efficient. Considering the whole situation, I feel that things couldn’t have been better.
Why did you leave?
I guess it was against my nature: I don’t like being supervised by anyone. I am satisfied with what I have achieved. I also wanted to spend more time with my family (I have five kids) and to enjoy my hobby, that is to sail small fast boats. Two years ago I made a decision to do something new. Together with my friends, Tomek, who developed Gmail in Google Switzerland, and Piotr, who is a doctor and researcher at the Medical University of Gdańsk, we agreed we would take on the issue of lifespan and of how long we can stay healthy throughout our lives. It took us several months to build a team consisting of researchers, doctors, psychologists, dieticians and engineers. Our motto is: “Extend Poles’ healthy lives by one million years”. At the end of June, in Poland, we are going to present our product: Vika, an application that will help people to monitor their health and to enjoy it longer. Can you think of a better activity than looking for a way for a longer and healthier life?
What does this solution consist in?
Our lifestyle is crucial for our health. Our health depends on so many factors: nutrition, physical activity, sleep, vaccines and medical checkups. We can extend our lifespan and we can enjoy a healthier life; neither the medicine nor drugs will do it for us. Atherosclerosis, hypertension or diabetes result from malnutrition and lack of physical activity; no cure is going to prevent those. We have to act and change our lifestyle.
We all know the theory but it is hard to change your lifestyle in practice. Do you think that things may get easier?
Yes. An average Spaniard enjoys a healthy life for seven years more than a Pole. The same goes for Italians. Those nations are well known for their joyful and open lifestyle. Our research team have examined thousands of people and identified the main factors allowing to stay healthier for an extended period of time. We believe that Poles can do that too! Vika will first ask you a number of questions. Based on your answers, the application is able to assess for how long you will stay healthy. It will then send you on health missions during which you will have to complete new tasks, challenges and advance to new levels of a healthy lifestyle. Your reward will be days, months and even years of being healthy. Over time, you will be developing positive eating, exercising and sleeping habits.
Is it going to be a sort of a virtual trainer?
I would rather say that our application is like a game developed by trainers, psychologists and dieticians. In this game you can win back your life.
A game of life?
That is a good slogan.
And yet… there are so many applications to chose from? Aren’t you afraid that you may find it difficult to convince people to use it?
We commissioned a public opinion poll in which w asked the following question: “Would you install a free mobile application using the results of your medical checkups to extend the period of your life during which you stay healthy and fit?” Almost 70 percent of respondents answered “Strongly agree” or “Somewhat agree”. I think that anyone would love to see what the future holds for them. Once they discover how beneficial it may be, they will want to boost their health. Vika is now being tested by several hundred testers. We know that they are devoted players trying to improve their health. It really works!
*Łukasz Osowski – graduate of the Faculty of Electronics, Telecommunications and Informatics, Gdańsk University of Technology. Co-founder of IVONA Software, a company purchased by Amazon. Co-creator of Ivona speech synthesis technologies. A year ago he set up Lab4Life, a company developing an application helping people to enjoy their healthy lives for longer.
Przeczytaj polską wersję tego tekstu TUTAJ