ODT - témakiírás: Gyires-Tóth Bálint: Joint Audio-Visual-Textual Analysis and ...

Joint Audio-Visual-Textual Analysis and Modeling of Heterogeneous Data with Deep Learning

TÉMAKIÍRÁS

Intézmény: Budapesti Műszaki és Gazdaságtudományi Egyetem
informatikai tudományok
Informatikai Tudományok Doktori Iskola

témavezető: Gyires-Tóth Bálint
helyszín (magyar oldal): Távközlési és Médiainformatikai Tanszék
helyszín rövidítés: TMIT

A kutatási téma leírása:

Due to the revolutionary increase in the amount of available data, the rise of high performance GPUs and the novel results in neural networks, deep learning has received high attention among machine learning techniques. The numerous layers of deep architectures are able to extract different abstractions of the input data (based on observations of real life) and predict or classify them efficiently.
State-of-the-art audio, image and video segmentation, classification and recognition solutions are generally based on deep learning methodology. Novel elements, like various types of deep convolutional and recurrent neural networks are able to learn the descriptive features of the audio/image/video content in many representation levels. This approach is proven to overcome the previously used feature extraction methods and can even surpass the accuracy of human annotators.
Audio and visual information are often completed or accompanied with textual information. The textual information may be presented in various formats including precise labels, textual description or even free text. Deep neural networks, including Long Short-Term Memory networks are capable to extract information from such sources. Combining the features extracted from the audio/visual data with the computed ‘meaning’ of the textual information can increase the modeling capacity of deep neural networks.
The goal of this PhD research is to elaborate novel deep learning methods to jointly analyze and model audio, visual and textual information. The effectiveness of the elaborated method must be proven at least in one application scenario. Such an application scenario can be (1) speech synthesis, (2) medical images and textual information of patients with skin diseases, (3) sentiment analysis, etc.
The research can be conducted both in English and in Hungarian. For training the models public and private databases and high performance GPUs are available.

The possible research tasks of the PhD student are the following:
- Overview the related scientific papers, including the basic deep neural networks elements and novel results in deep learning based classification.
- Design and implement baseline systems for separate analysis of audio/visual/textual data from heterogeneous sources with basic deep learning algorithms and enhance it with novel deep learning methods (e.g. adversarial, dilated convolutional or deep ensemble models).
- Conduct research on joint analysis of audio, visual and textual data with deep learning. Propose a novel method with improved modeling capacity.
- Demonstrate the effectiveness of the results at least in one application scenario.
- Objective and subjective evaluation.

előírt nyelvtudás: english
további elvárások:
Basic programming and mathematical skills

felvehető hallgatók száma: 1

Jelentkezési határidő: 2017-06-26