Speech Toolkit
================
.. raw:: html
=========
**Malaya-Speech** is a Speech-Toolkit library for bahasa Malaysia, powered by Tensorflow and PyTorch.
Documentation
--------------
Stable released documentation is available at https://malaya-speech.readthedocs.io/en/stable/
Installing from the PyPI
----------------------------------
::
$ pip install malaya-speech
It will automatically install all dependencies except for Tensorflow and PyTorch. So you can choose your own Tensorflow CPU / GPU version and PyTorch CPU / GPU version.
Only **Python >= 3.6.0**, **Tensorflow >= 1.15.0**, and **PyTorch >= 1.10** are supported.
Development Release
---------------------------------
Install from `master` branch,
::
$ pip install git+https://github.com/huseinzol05/malaya-speech.git
We recommend to use **virtualenv** for development.
While development released documentation is available at https://malaya-speech.readthedocs.io/en/latest/
Features
--------
- **Age Detection**, detect age in speech using Finetuned Speaker Vector.
- **Speaker Diarization**, diarizing speakers using Pretrained Speaker Vector.
- **Emotion Detection**, detect emotions in speech using Finetuned Speaker Vector.
- **Force Alignment**, generate a time-aligned transcription of an audio file using RNNT, Wav2Vec2 CTC and Whisper Seq2Seq.
- **Gender Detection**, detect genders in speech using Finetuned Speaker Vector.
- **Clean speech Detection**, detect clean speech using Finetuned Speaker Vector.
- **Language Detection**, detect hyperlocal languages in speech using Finetuned Speaker Vector.
- **Language Model**, using KenLM, Masked language model using BERT and RoBERTa, and GPT2 to do ASR decoder scoring.
- **Multispeaker Separation**, Multispeaker separation using FastSep on 8k Wav.
- **Noise Reduction**, reduce multilevel noises using STFT UNET.
- **Speaker Change**, detect changing speakers using Finetuned Speaker Vector.
- **Speaker overlap**, detect overlap speakers using Finetuned Speaker Vector.
- **Speaker Vector**, calculate similarity between speakers using Pretrained Speaker Vector.
- **Speech Enhancement**, enhance voice activities using Waveform UNET.
- **SpeechSplit Conversion**, detailed speaking style conversion by disentangling speech into content, timbre, rhythm and pitch using PyWorld and PySPTK.
- **Speech-to-Text**, End-to-End Speech to Text for Malay, Mixed (Malay, Singlish) and Singlish using RNNT, Wav2Vec2 CTC and Whisper Seq2Seq.
- **Super Resolution**, Super Resolution 4x for Waveform using ResNet UNET and Neural Vocoder.
- **Text-to-Speech**, Text to Speech for Malay and Singlish using Tacotron2, FastSpeech2, FastPitch, GlowTTS, LightSpeech and VITS.
- **Vocoder**, convert Mel to Waveform using MelGAN, Multiband MelGAN and Universal MelGAN Vocoder.
- **Voice Activity Detection**, detect voice activities using Finetuned Speaker Vector.
- **Voice Conversion**, Many-to-One and Zero-shot Voice Conversion.
- **Hybrid 8-bit Quantization**, provide hybrid 8-bit quantization for all models to reduce inference time up to 2x and model size up to 4x.
- **Real time interface**, provide PyAudio and TorchAudio streaming interface to do real time inference.
Pretrained Models
------------------
Malaya-Speech also released pretrained models, simply check at `malaya-speech/pretrained-model `_
- **Wave UNET**, Multi-Scale Neural Network for End-to-End Audio Source Separation, https://arxiv.org/abs/1806.03185
- **Wave ResNet UNET**, added ResNet style into Wave UNET, no paper produced.
- **Wave ResNext UNET**, added ResNext style into Wave UNET, no paper produced.
- **Deep Speaker**, An End-to-End Neural Speaker Embedding System, https://arxiv.org/pdf/1705.02304.pdf
- **SpeakerNet**, 1D Depth-wise Separable Convolutional Network for Text-Independent Speaker Recognition and Verification, https://arxiv.org/abs/2010.12653
- **VGGVox**, a large-scale speaker identification dataset, https://arxiv.org/pdf/1706.08612.pdf
- **GhostVLAD**, Utterance-level Aggregation For Speaker Recognition In The Wild, https://arxiv.org/abs/1902.10107
- **Conformer**, Convolution-augmented Transformer for Speech Recognition, https://arxiv.org/abs/2005.08100
- **ALConformer**, A lite Conformer, no paper produced.
- **Jasper**, An End-to-End Convolutional Neural Acoustic Model, https://arxiv.org/abs/1904.03288
- **Tacotron2**, Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions, https://arxiv.org/abs/1712.05884
- **FastSpeech2**, Fast and High-Quality End-to-End Text to Speech, https://arxiv.org/abs/2006.04558
- **MelGAN**, Generative Adversarial Networks for Conditional Waveform Synthesis, https://arxiv.org/abs/1910.06711
- **Multi-band MelGAN**, Faster Waveform Generation for High-Quality Text-to-Speech, https://arxiv.org/abs/2005.05106
- **SRGAN**, Modified version of SRGAN to do 1D Convolution, Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network, https://arxiv.org/abs/1609.04802
- **Speech Enhancement UNET**, https://github.com/haoxiangsnr/Wave-U-Net-for-Speech-Enhancement
- **Speech Enhancement ResNet UNET**, Added ResNet style into Speech Enhancement UNET, no paper produced.
- **Speech Enhancement ResNext UNET**, Added ResNext style into Speech Enhancement UNET, no paper produced.
- **Universal MelGAN**, Universal MelGAN: A Robust Neural Vocoder for High-Fidelity Waveform Generation in Multiple Domains, https://arxiv.org/abs/2011.09631
- **FastVC**, Faster and Accurate Voice Conversion using Transformer, no paper produced.
- **FastSep**, Faster and Accurate Speech Separation using Transformer, no paper produced.
- **wav2vec 2.0**, A Framework for Self-Supervised Learning of Speech Representations, https://arxiv.org/abs/2006.11477
- **FastSpeechSplit**, Unsupervised Speech Decomposition Via Triple Information Bottleneck using Transformer, no paper produced.
- **Sepformer**, Attention is All You Need in Speech Separation, https://arxiv.org/abs/2010.13154
- **FastSpeechSplit**, Faster and Accurate Speech Split Conversion using Transformer, no paper produced.
- **HuBERT**, Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units, https://arxiv.org/pdf/2106.07447v1.pdf
- **FastPitch**, Parallel Text-to-speech with Pitch Prediction, https://arxiv.org/abs/2006.06873
- **GlowTTS**, A Generative Flow for Text-to-Speech via Monotonic Alignment Search, https://arxiv.org/abs/2005.11129
- **BEST-RQ**, Self-supervised learning with random-projection quantizer for speech recognition, https://arxiv.org/pdf/2202.01855.pdf
- **LightSpeech**, Lightweight and Fast Text to Speech with Neural Architecture Search, https://arxiv.org/abs/2102.04040
- **VITS**, Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech, https://arxiv.org/abs/2106.06103
- **Squeezeformer**, An Efficient Transformer for Automatic Speech Recognition, https://arxiv.org/abs/2206.00888
- **Whisper**, Robust Speech Recognition via Large-Scale Weak Supervision, https://cdn.openai.com/papers/whisper.pdf
- **Emformer**, Efficient Memory Transformer Based Acoustic Model For Low Latency Streaming Speech Recognition, https://arxiv.org/abs/2010.10759
References
-----------
If you use our software for research, please cite:
::
@misc{Malaya, Speech-Toolkit library for bahasa Malaysia, powered by Deep Learning Tensorflow,
author = {Husein, Zolkepli},
title = {Malaya-Speech},
year = {2020},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malaya-speech}}
}
Acknowledgement
----------------
Thanks to `KeyReply `_ for private V100s cloud and `Mesolitica `_ for private RTXs cloud to train Malaya-Speech models,
.. raw:: html
.. raw:: html