publications | Nathan Pruyne

2024

HARP 2.0: Expanding Hosted, Asynchronous, Remote Processing for Deep Learning in the DAW

Christodoulos Benetatos, Frank Cwitkowitz, Nathan Pruyne, and 4 more authors

In ISMIR 2024 Late Breaking Demos, 2024

Abs Code

HARP 2.0 brings deep learning models to digital audio workstation (DAW) software through hosted, asynchronous, remote processing, allowing users to route audio from a plug-in interface through any compatible Gradio endpoint to perform arbitrary transformations. HARP renders endpoint-defined controls and processed audio in-plugin, meaning users can explore a variety of cutting-edge deep learning models without ever leaving the DAW. In the 2.0 release we introduce support for MIDI-based models and audio/MIDI labeling models, provide a streamlined \textttpyharp Python API for model developers, and implement numerous interface and stability improvements. Through this work, we hope to bridge the gap between model developers and creatives, improving access to deep learning models by seamlessly intrgrating them into DAW workflows.
Fine-Grained and Interpretable Neural Speech Editing

Max Morrison, Cameron Churchwell, Nathan Pruyne, and 1 more author

In INTERSPEECH 2024, 2024

Abs PDF Code Website

Fine-grained editing of speech attributes—such as prosody (i.e., the pitch, loudness, and phoneme durations), pronunciation, speaker identity, and formants—is useful for fine-tuning and fixing imperfections in human and AI-generated speech recordings for creation of podcasts, film dialogue, and video game dialogue. Existing speech synthesis systems use representations that entangle two or more of these attributes, prohibiting their use in fine-grained, disentangled editing. In this paper, we demonstrate the first disentangled and interpretable representation of speech with comparable subjective and objective vocoding reconstruction accuracy to Mel spectrograms. Our interpretable representation, combined with our proposed data augmentation method, enables training an existing neural vocoder to perform fast, accurate, and high-quality editing of pitch, duration, volume, timbral correlates of volume, pronunciation, speaker identity, and spectral balance.
Crowdsourced and Automatic Speech Prominence Estimation

Max Morrison, Pranav Pawar, Nathan Pruyne, and 2 more authors

In 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024

Abs PDF Code Website

The prominence of a spoken word is the degree to which an average native listener perceives the word as salient or emphasized relative to its context. Speech prominence estimation is the process of assigning a numeric value to the prominence of each word in an utterance. These prominence labels are useful for linguistic analysis, as well as training automated systems to perform emphasis-controlled text-to-speech or emotion recognition. Manually annotating prominence is time-consuming and expensive, which motivates the development of automated methods for speech prominence estimation. However, developing such an automated system using machine-learning methods requires human-annotated training data. Using our system for acquiring such human annotations, we collect and open-source crowdsourced annotations of a portion of the LibriTTS dataset. We use these annotations as ground truth to train a neural speech prominence estimator that generalizes to unseen speakers, datasets, and speaking styles. We investigate design decisions for neural prominence estimation as well as how neural prominence estimation improves as a function of two key factors of annotation cost: dataset size and the number of annotations per utterance.
Machine learning-guided discovery of gas evolving electrode bubble inactivation

Jack R. Lake, Simon Rufer, Jim James, and 7 more authors

Nanoscale, 2024

Abs PDF Code

The adverse effects of electrochemical bubbles on the performance of gas-evolving electrodes are well known, but studies on the degree of adhered bubble-caused inactivation, and how inactivation changes during bubble evolution are limited. We study electrode inactivation caused by oxygen evolution while using surface engineering to control bubble formation. We find that the inactivation of the entire projected area, as is currently believed, is a poor approximation which leads to non-physical results. Using a machine learning-based image-based bubble detection method to analyze large quantities of experimental data, we show that bubble impacts are small for surface engineered electrodes which promote high bubble projected areas while maintaining low direct bubble contact. We thus propose a simple methodology for more accurately estimating the true extent of bubble inactivation, which is closer to the area which is directly in contact with the bubbles.

2023

Cross-domain Neural Pitch and Periodicity Estimation

Max Morrison, Caedon Hsieh, Nathan Pruyne, and 1 more author

In arXiv preprint arXiv:2301.12258, 2023

Abs PDF Code

Pitch is a foundational aspect of our perception of audio signals. Pitch contours are commonly used to analyze speech and music signals and as input features for many audio tasks, including music transcription, singing voice synthesis, and prosody editing. In this paper, we describe a set of techniques for improving the accuracy of widely-used neural pitch and periodicity estimators to achieve state-of-the-art performance on both speech and music. We also introduce a novel entropy-based method for extracting periodicity and per-frame voiced-unvoiced classifications from statistical inference-based pitch estimators (e.g., neural networks), and show how to train a neural pitch estimator to simultaneously handle both speech and music data (i.e., cross-domain estimation) without performance degradation. Our estimator implementations run 11.2x faster than real-time on a Intel i9-9820X 10-core 3.30 GHz CPU—approaching the speed of state-of-the-art DSP-based pitch estimators—or 408x faster than real-time on a NVIDIA GeForce RTX 3090 GPU. We release all of our code and models as Pitch-Estimating Neural Networks (penn), an open-source, pip-installable Python module for training, evaluating, and performing inference with pitch- and periodicity-estimating neural networks. The code for penn is available at github.com/interactiveaudiolab/penn.
Segmentation of tomography datasets using 3D convolutional neural networks

Jim James^*, Nathan Pruyne^*, Tiberiu Stan, and 6 more authors

Computational Materials Science, 2023

Abs PDF Code

Dendritic microstructures are ubiquitous in nature and are the primary solidification morphologies in metallic materials. Techniques such as X-ray computed tomography (XCT) have provided new insights into dendritic phase transformation phenomena. However, manual identification of dendritic morphologies in microscopy data can be both labor intensive and potentially ambiguous. The analysis of 3D datasets is particularly challenging due to their large sizes (terabytes) and the presence of artifacts scattered within the imaged volumes. In this study, we trained 3D convolutional neural networks (CNNs) to segment 3D datasets. Three CNN architectures were investigated, including a new version of FCDenseNet which we extended to 3D. We show that using hyperparameter optimization (HPO) and fine-tuning techniques, both 2D and 3D CNN architectures outperform the previous state of the art. The 3D U-Net architecture trained in this study produced the best segmentations according to quantitative metrics (intersection-over-union of 95.56% and a boundary displacement error of 0.58 pixels), while 3D FCDense produced the smoothest boundaries and best segmentations according to visual inspection. The trained 3D CNNs are able to segment entire 852 × 852 × 250 voxel 3D volumes in only ∼60 s, thus hastening the progress towards a deeper understanding of phase transformation phenomena such as dendritic solidification.