Date of Award

4-30-2021

Document Type

Thesis

Degree Name

Bachelor of Science

Department

Computer Science

First Advisor

Dr. Douglas Szajda

Second Advisor

Dr. Jon Park

Abstract

The field of voice processing has seen great advancements thanks in part to the rise of deep learning. However, the application of these deep learning techniques with an audio input space leads to an interesting result not commonly found when dealing with other input domains. Namely, common techniques for generating auditory adversarial samples using gradient-based optimization have been observed to have extremely low transferability among even the same model structure. This implies an inherent difference in the latent representations of audio samples that may be worth investigating in the pursuit of a more resilient and interpretable voice processing framework. Our core contribution is an investigation of the decision-making processes of modern voice processing implementations. Specifically, we are interested in explaining the impacts of audio input features on the alphabetic character outputs of a modern speech-to-text system such as DeepSpeech2. We investigate this with the aid of the Local Interpretable Model-agnostic Explanations (LIME) explanation technique as applied to an appropriate and contextually-aware representation of the problem space. For every alphabetic character, we select samples of audio that center on the value and use them as inputs for the voice processing system. The model predictions of these inputs are explained via LIME and the collection of all letter-use clusters are aggregated for analysis. With an understanding of the reasoning behind the classification of characters, we will be able to better understand why attacks succeed or fail, develop novel new attacks, and better defend voice processing systems against adversarial attacks in general.

Share

COinS