Deep Learning algorithms aim to learn feature hierarchies with features at higher levels in the hierarchy formed by the composition of lower level features. Description. Computationally stained slides could help automate the time-consuming process of slide staining, but Shah said the ability to de-stain and preserve images for future use is the real advantage of the deep learning techniques. Deep Learning Handbook. This course concerns the latest techniques in deep learning and representation learning, focusing on supervised and unsupervised deep learning, embedding methods, metric learning, convolutional and recurrent nets, with applications to computer vision, natural language understanding, and speech recognition. • 1993: Nvidia started… • Hinton, Geoffrey E., Simon Osindero, and Yee-Whye Teh. Deep Learning is one of the most highly sought after skills in tech. Deep Learning is Large Neural Networks. Lecture slides for Chapter 4 of Deep Learning www.deeplearningbook.org Ian Goodfellow Last modiﬁed 2017-10-14 Thanks to Justin Gilmer and Jacob Buckman for helpful discussions (Goodfellow 2017) Numerical concerns for implementations of deep learning algorithms Deep Learning for Whole Slide Image Analysis: An Overview. Dimensions of a learning system (different types of feedback, representation, use of knowledge) 3. We will be giving a two day short course on Designing Efficient Deep Learning Systems at MIT in Cambridge, MA on July 20-21, 2020. Often we can't, we use approximate posteriors, Probability Theory is a great tool to reason about uncertainty, Bayesians quantify subjective uncertainty, Frequentists quantify inherent randomness in the long run, People seem to interpret probability as beliefs and hence are Bayesians, We formulate our prior beliefs about how the \( x \) might be generated, We collect some data of already generated \( x \): $$ \mathcal{D}_\text{train} = (x_1, ..., x_N) $$, We update our beliefs regarding what kind of data exist by incorporating collected data, We now can make predictions about unseen data, And collect some more data to improve our beliefs, We'll assume random variables have and are described by their, \(p(X=x)\) (\(p(x)\) for short) – its probability density function, \(\text{Pr}[X \in A] = \int_{A} p(X=x) dx\) – distribution function, In general several random variables \(X_1, ..., X_N\) have, It describes joint probability $$\text{Pr}(X_1 \in A_1, ..., X_N \in A_N) = \int_{A_1} ... \int_{A_N} p(x_1, ..., x_N) dx_N ... dx_1 $$, If (and only if) random variables are independent, the joint density is just a product of individual densities, Vector random variables are just a bunch of scalar random variables, For 2 and more random variables you should be considering their joint distribution, \(\mathbb{E}_{p(x)} X = \int x p(x) dx\) –, \( \mathbb{E} [\alpha X + \beta Y] = \alpha \mathbb{E} X + \beta \mathbb{E} Y \), \( \mathbb{V} X = \mathbb{E} [X^2] - (\mathbb{E} X)^2 = \mathbb{E}(X - \mathbb{E} X)^2 \), \(X\) is said to be Uniformly distributed over \((a, b)\) (denoted \(X \sim U(a, b)\) if its probability density function is $$ p(x) = \begin{cases} \tfrac{1}{b-a}, & a < x < b \\ 0, &\text{otherwise} \end{cases} \quad\quad \mathbb{E} U = \frac{a+b}{2} \quad\quad \mathbb{V} U = \frac{(b-a)^2}{12} $$, \(X\) is called a Multivariate Gaussian (Normal) random vector with mean \(\mu \in \mathbb{R}^n\) and positive-definite covariance matrix \(\Sigma \in \mathbb{R}^{n \times n}\) (denoted \(x \sim \mathcal{N}(\mu, \Sigma)\)) if its joint probability density function is, \(X\) is said to be Categorically distributed with probabilities, \(X\) is called a Bernoulli random variable with probability (of success) \(p \in [0, 1]\) (denoted \(X \sim \text{Bern}(\pi)\)) if its probability mass function is $$ p(X = 1) = \pi \Leftrightarrow p(x) = \pi^{x} (1-\pi)^{1-x} $$ (yes, this is a special case of the categorical distribution), Joint density on \(x\) and \(y\) defines the, Knowing value of \(y\) can reduce uncertainty about \(x\), expressed via the, Thus $$ p(x, y) = p(y|x) p(x) = p(x|y) p(y) $$, Suppose we're having two jointly Gaussian random variables \(X\) and \(Y\): $$(X, Y) \sim \mathcal{N}\left(\left[\begin{array}{c}\mu_x \\ \mu_y \end{array} \right], \left[\begin{array}{cc}\sigma^2_x & \rho_{xy} \\ \rho_{xy} & \sigma^2_y\end{array}\right]\right)$$, Then one can show that marginal and conditionals are also Gaussian $$ p(x) = \mathcal{N}(x \mid \mu_x, \sigma^2_x) $$ $$ p(y) = \mathcal{N}(y \mid \mu_y, \sigma^2_y) $$ $$p(x|y) = \mathcal{N}\left(x \mid \mu_x + \tfrac{\rho}{\sigma_x^2} (y - \mu_y), \sigma^2_x - \tfrac{\rho_{xy}^2}{\sigma_y^2}\right)$$, If we're interested in \(y\), then these distributions are called, We assume some data-generating model $$p(y, \theta \mid x) = p(y \mid x, \theta) p(\theta) $$, We obtain some observations \( \mathcal{D} = \{(x_n, y_n)\}_{n=1}^N \), We seek to make make predictions regarding \(y\) for previously unseen \(x\) having observed the training set \(\mathcal{D}\). In this study, we used two deep-learning algorithms based … Predicting survival after hepatocellular carcinoma resection using deep-learning on histological slides Hepatology. 2020 Feb 28. doi: 10.1002/hep.31207. We plan to offer lecture slides accompanying all chapters of this book. What if we want to tune dropout rates \(p\)? Book Exercises External Links Lectures. The course covers the basics of Deep Learning… we don't need the exact true posterior $$ \text{KL}(q(\theta | \Lambda) || p(\theta | \mathcal{D})) = \log p(\mathcal{D}) - \mathbb{E}_{q(\theta | \Lambda)} \log \frac{p(\mathcal{D}, \theta)}{q(\theta | \Lambda)} $$, Hence we seek parameters \(\Lambda_*\) maximizing the following objective (the ELBO) $$ \Lambda_* = \text{argmax}_\Lambda \left[ \mathbb{E}_{q(\theta | \Lambda)} \log \frac{p(\mathcal{D}, \theta)}{q(\theta|\Lambda)} = \mathbb{E}_{q(\theta|\Lambda)} \log p(\mathcal{D}|\theta) - \text{KL}(q(\theta|\Lambda)||p(\theta)) \right]$$, We can't compute this quantity analytically either, but can sample from \(q\) to get Monte Carlo estimates of the approximate posterior predictive distribution: $$ q(y \mid x, \mathcal{D}) \approx \hat{q}(y|x, \mathcal{D}) = \frac{1}{M} \sum_{m=1}^M p(y \mid x, \theta^m), \quad\quad \theta^m \sim q(\theta \mid \Lambda_*) $$, Recall the objective for variational inference $$ \mathcal{L}(\Lambda_*) = \mathbb{E}_{q(\theta | \Lambda)} \log \frac{p(\mathcal{D}, \theta)}{q(\theta|\Lambda)} \to \max_{\Lambda} $$, We'll be using well-known optimization method, We need (stochastic) gradient \(\hat{g}\) of \(\mathcal{L}(\Lambda)\) s.t. However, many found the accompanying video lectures, slides, and exercises not pedagogic enough for a fresh starter. In addition to the lectures and programming assignments, you will also watch exclusive interviews with many Deep Learning leaders. This automatic feature learning has been demonstrated to uncover underlying structure in the data leading to state-of-the-art results in tasks in vision, speech and rapidly in other domains as well. The slides and lectures are posted online, and the course are taught by three fantastic instructors. 2014 Lecture 2 … The Deep Learning Lecture Series 2020 is a collaboration between DeepMind and the UCL Centre for Artificial Intelligence. Inria. Bayesian methods can Impose useful priors on Neural Networks helping discover solutions of special form; Provide better predictions; Provide Neural Networks with uncertainty estimates (uncovered) Neural Networks help us make more efficient Bayesian inference; Uses a lot of math; Active area of research Yoshua Bengio gave a recent presentation on “Deep Learning of Representation” and Generative Stochastic Networks (GSNs) at MSR and AAAI 2013. Generator network and inference network essentially give us autoencoder, Inference network encodes observations into latent code, Generator network decodes latent code into observations, Can infer high-level abstract features of existing objects, Uses neural network to amortize inference, Bayesian methods are useful when we have low data-to-parameters ratio, Impose useful priors on Neural Networks helping discover solutions of special form, Provide Neural Networks with uncertainty estimates (uncovered), Neural Networks help us make more efficient Bayesian inference. The course covers the basics of Deep Learning, with a focus on applications. Neural computation 1.4 (1989): 541-551. Each layer accepts the information from previous and pass it on to the next on… Lecture slides Basic information about deep learning Cheat sheet – stuff that everyone needs to know Useful links Grading Plan your visit Visit previous iteration of Stats385 (2017) This page was generated by … Andrew Ng from Coursera and Chief Scientist at Baidu Research formally founded Google Brain that eventually resulted in the productization of deep learning technologies across a large number of Google services.. Cognitive modeling 5.3 (1988): 1. Can we drop unnecessary computations for easy inputs? The Jupyter notebooks for the labs can be found in the labs folder of How do we backpropagate through samples \(\theta_i\)? Nature 2015 Supervised learning algorithms such as Decision tree, neural network, support vector machines (SVM), Bayesian network learning, neares… Deep learning algorithms are similar to how nervous system structured where each neuron connected each other and passing information. Minimum Description Length for VAE Alice wants to transmit x as compactly as possible to Bob, who knows only the prior p(z) and the decoder weights \(\mathbb{E} \hat{g} = \nabla_\Lambda \mathcal{L}(\Lambda) \), Problem: We can't just take \(\hat{g} = \nabla_\Lambda \log \frac{p(\mathcal{D}, \theta)}{q(\theta | \Lambda)} \) as the samples themselves depend on \(\Lambda\) through \(q(\theta|\Lambda)\), Remember the expectation is just an integral, and apply the log-derivative trick $$ \nabla_\Lambda q(\theta | \Lambda) = q(\theta | \Lambda) \nabla_\Lambda \log q(\theta|\Lambda) $$ $$ \nabla_\Lambda \mathcal{L}(\Lambda) = \int q(\theta|\Lambda) \log \frac{p(\mathcal{D}, \theta)}{q(\theta | \Lambda)} \nabla_\Lambda \log q(\theta | \Lambda) d\theta = \mathbb{E}_{q(\theta|\Lambda)} \log \frac{p(\mathcal{D}, \theta)}{q(\theta|\Lambda)} \nabla \log q(\theta | \Lambda) $$, Though general, this gradient estimator has too much variance in practice, We assume the data is generated using some (partially known) classifier \(\pi_{\theta}\): $$ p(y \mid x, \theta) = \text{Cat}(y | \pi_\theta(x)) \quad\quad \theta \sim p(\theta) $$, True posterior is intractable $$ p(\theta \mid \mathcal{D}) \propto p(\theta) \prod_{n=1}^N p(y_n \mid x_n, \pi_\theta) $$, Approximate it using \(q(\theta | \Lambda)\): $$ \Lambda_* = \text{argmax} \; \mathbb{E}_{q(\theta | \Lambda)} \left[\sum_{n=1}^N \log p(y_n | x_n, \theta) - \text{KL}(q(\theta | \Lambda) || p(\theta))\right] $$, Essentially, instead of learning a single neural network that would solve the problem, we, \(p(\theta)\) encodes our preferences on which networks we'd like to see, Let \(q(\theta_i | \Lambda)\) be s.t. to get started. The 12 video lectures cover topics from neural network foundations and optimisation through to generative adversarial networks and responsible innovation. We thank the Orange-Keyrus-Thalès chair for supporting this class. Download Deep Learning PowerPoint templates (ppt) and Google Slides themes to create awesome presentations. In other words, It mirrors the functioning of our brains. All the code in this repository is made available under the MIT license lectures-labs maintained by m2dsupsdlclass, Convolutional Neural Networks for Image Classification, Deep Learning for Object Detection and Image Segmentation, Sequence to sequence, attention and memory, Expressivity, Optimization and Generalization, Imbalanced classification and metric learning, Unsupervised Deep Learning and Generative models, Demo: Object Detection with pretrained RetinaNet with Keras, Backpropagation in Neural Networks using Numpy, Neural Recommender Systems with Explicit Feedback, Neural Recommender Systems with Implicit Feedback and the Triplet Loss, Fine Tuning a pretrained ConvNet with Keras (GPU required), Bonus: Convolution and ConvNets with TensorFlow, ConvNets for Classification and Localization, Character Level Language Model (GPU required), Transformers (BERT fine-tuning): Joint Intent Classification and Slot Filling, Translation of Numeric Phrases with Seq2Seq, Stochastic Optimization Landscape in Pytorch. We assume the two-phase data-generating process: First, we decide upon high-level abstract features of the datum \(z \sim p(z)\), Then, we unpack these features using Neural Networks into an actual observable \(x\) using the (learnable) generator \(f_\theta\), This leads to the following model \(p(x, z) = p(x|z) p(z)\) where $$ p(x|z) = p(z) \prod_{d=1}^D p(x_d | f_\theta(z)) $$ $$ p(z) = \mathcal{N}(z | 0, I) $$ and \(f_\theta\) is some neural network, We can sample new \(x\) by passing samples \(z\) through the generator once we learn it, Would like to maximize log-marginal density of observed variables \(\log p(x)\), Intractable integral \( \log p(x) = \log \int p(x|z) p(z) dz \), Introduce approximate posterior \(q(z|x)\): $$ q(z|x) = \mathcal{N}(z|\mu_\Lambda(x), \Sigma_\Lambda(x))$$, Where \(\mu, \Sigma\) are generated using auxiliary inference network from the observation \(x\), Invoking the ELBO we obtain the following objective $$ \tfrac{1}{N} \sum_{n=1}^N \left[ \mathbb{E}_{q(z_n|x_n)} \log p(x_n | z_n) - \text{KL}(q(z_n|x_n)||p(z_n)) \right] \to \max_\Lambda $$. @article{zhang2019pathologist, title={Pathologist-level interpretable whole-slide cancer diagnosis with deep learning}, author={Zhang, Zizhao and Chen, Pingjun and McGough, Mason and Xing, Fuyong and Wang, Chunbao and Bui, Marilyn and Xie, Yuanpu and Sapkota, Manish and Cui, Lei and Dhillon, Jasreman and others}, journal={Nature Machine Intelligence}, volume={1}, number={5}, … Deep Learning (DL) algorithms are the central focus of modern machine learning systems. Different types of learning (supervised, unsupervised, reinforcement) 2. lower values are more preferable. Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. This course is being taught at as part of Master Datascience Paris additional references. 11/11/2019. Its uncertainty quantified by the, This requires us to know the posterior distribution on model parameters \(p(\theta \mid \mathcal{D})\) which we obtain using the Bayes' rule, Suppose the model \(y \sim \mathcal{N}(\theta^T x, \sigma^2)\), with \( \theta \sim \mathcal{N}(\mu_0, \sigma_0^2 I) \), Suppose we observed some data from this model \( \mathcal{D} = \{(x_n, y_n)\}_{n=1}^N \) (generated using the same \( \theta^* \)), We don't know the optimal \(\theta\), but the more data we observe, Posterior predictive would also be Gaussian $$ p(y|x, \mathcal{D}) = \mathcal{N}(y \mid \mu_N^T x, \sigma_N^2) $$, Suppose we observe a sequence of coin flips \((x_1, ..., x_N, ...)\), but don't know whether the coin is fair $$ x \sim \text{Bern}(\pi), \quad \pi \sim U(0, 1) $$, First, we infer posterior distribution on a hidden parameter \(\pi\) having observed \(x_{