Human infants acquire language with little formal teaching, but machines need large amounts of annotated data, which makes the development of natural language processing technology challenging. Although deep learning has already had tremendous success in natural language processing, the training is usually supervised. Only large companies can collect large amounts of labeled data to build high-quality automatic speech recognition (ASR) systems and then build systems that understand spoken language based on the ASR transcriptions. Can a machine learn human language without human teaching? We assess this possibility using a new deep learning technology: the generative adversarial network (GAN).
The National Taiwan University (NTU) Speech Lab is developing a series of unsupervised deep learning technologies based on the GAN to enable machines to understand natural language without labeled data. The GAN is a new idea for training models. In it, a generator and a discriminator compete to improve the generation quality. The GAN primarily uses unsupervised learning. In typical supervised learning, paired data are needed. For example, to transform a photograph into an anime-style drawing in the typical machine learning approach, one must provide the machine with many photos and transformation results for training. However, with the GAN, the generator learns to generate an anime face from a photo, and the discriminator determines whether the generator’s output looks like an anime face. In this way, paired data are not needed, which is one critical step toward unsupervised learning. Recently, the GAN has shown amazing results in image generation, and many wide-ranging new ideas, techniques, and applications have been developed based on it. Here, we introduce the applications of the GAN to speech signal and natural language processing.
Unsupervised Abstractive Summarization 
Summarization is the generation of a summary that describes the core ideas of a document. “Abstractive” summarization means that the summary is not directly extracted from the input document but rather automatically written by the machine in its own words. Based on the typical sequence-to-sequence model, reading documents and the corresponding human-written summaries trains machines to generate abstract summaries. However, to train a summarizer to perform reasonably well, in general, millions of training examples are needed, which limits the applicability of the technology. With the GAN, a generator learns to shorten an input document, and the discriminator checks whether the generator’s output is readable. In this way, the generator learns to generate summaries without supervision.
Sentiment Controllable Chatbot 
A chatbot is designed to chat about any subject in daily life with human users. The conventional chatbot is also based on a sequence-to-sequence model and generates meaningful responses given the user’s input. It is, in general, emotionless, which is a major limitation of chatbots today because emotions play a critical role in human social interactions, especially in chatting. Therefore, we wish to train a chatbot to generate responses with scalable sentiment by setting the mode for chatting. For example, given the input, “How was your day today?”, the chatbot may respond, “Today was wonderful” or “Today was terrible” depending on the sentiment set, in addition to simply generating a reasonable response. This can be achieved by the GAN, which transforms the written style of a response from negative to positive, for example. The techniques mentioned here may be extended to conversational style adjustment, and thus, the machine may imitate the conversational style of someone the user is familiar with to make the chatbot friendlier or more personable.
Unsupervised Voice Conversion 
Voice conversion (VC) converts speech signals from one acoustic domain to another while keeping the linguistic content unchanged. Transforming the voice of Speaker A into that of Speaker B is a typical example of VC. To achieve this, it is necessary to collect a large amount of parallel data. That is, Speakers A and B must read hundreds of sentences with the same content to teach the machine how to transform content between their voices, which is not practical. A new GAN-based approach to VC is proposed. Only audio content from Speakers A and B is needed. They do not have to read the same sentences, and they do not even have to speak the same language.
Unsupervised Speech Recognition [4,5]
A completely unsupervised speech recognition framework in which only unrelated speech utterances and text sentences are needed for model training is proposed. An unsupervised phoneme recognition accuracy of 36% is achieved in the preliminary experiments. This is the first attempt at reaching the goal of completely unsupervised speech recognition. With this technology, a machine can learn a new language in a novel linguistic environment with little supervision. Imagine an intelligent assistant bought by a family speaking Taiwanese. Although it does not understand Taiwanese at the beginning, by hearing people speaking Taiwanese, it automatically learns the new language.
1. Wang, Y., and Lee, H. (2018). Learning to Encode Text as Human-Readable Summaries using Generative Adversarial Networks. Submitted to EMNLP, 2018.
2. Lee, C., Wang, Y., Hsu, T., Chen, K., Lee, H., and Lee, L. (2018). Scalable Sentiment for Sequence-to-sequence Chatbot Response with Performance Analysis. ICASSP, 2018.
3. Chou, J., Yeh, C., Lee, H., and Lee, L. (2018). Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations. INTERSPEECH, 2018.
4. Liu, D., Chen, K., Lee, H., and Lee, L. (2018). Completely Unsupervised Phoneme Recognition by Adversarially Learning Mapping Relationships from Audio Embeddings, INTERSPEECH, 2018.
5. Chen, Y., Shen, C., Huang, S., and Lee, H. (2018). Towards Unsupervised Automatic Speech Recognition Trained by Unaligned Speech and Text only. arXiv:1803.10952 [cs.CL]
Assistant Professor, Department of Electrical Engineering