Machine Learning - think "learning a machine", not "learning by a machine" Types of Learning Supervised Learning Unsupervised Learning Reinforcement Learning Examples Supervised: map bitmap of handwritten digit to digit class {0,1,...,9,not-a-digit}. map credit card transaction authorization requests to Pr(fraudulent) map patch of lung X-Ray to Pr(cancerous) Unsupervised: figure out the different letters in some unknown alphabet figure out the 3D structure of the world from seeing lots of movies figure out the structure of English from hearing many spoken sentences Reinforcement Learning: figure out which moves are good (ie lead to winning) in Chess Success Stories of Machine Learning PAPnet - multi-billion dollar success - saved tens of thousands of women's lives - maps image of plated-out cell from stained Pap smear slide to rating of abnormality, top-ranked abnormal cells in slide presented to human operator Handwritten digit recognition for postal zip-code sorting Handwriting recognition for PDAs Speech recognition - used to be hand-crafted rules, now dominated by machine learning techniques, especially HMMs (hidden markov models). Allows replacement of phone menu trees, hands-free phone dialing, word-based searching of corpus of audio data, etc. Fraud detection for credit cards Example Learning Theorem! Simple example of theory in machine learning: we will define a very simple learning situation and then analyze it. Typically we are concerned with performance on new (unseen) inputs. Let us assume that inputs x are drawn iid from some unknown distribution p(x) and that future inputs will be drawn from the same distribution. One sided error: assume there is some threshold x_c and that all x=Perceptron Referencesx_c should be classified "not okay". Imagine that incorrectly classifying an x that is really okay as not okay (false negative) is costly but not very costly, while incorrectly classifying an x that is actually not okay as okay is a major disaster. (Think of classifying fruits as to whether they are poisonous. Acceptable to throw away non-poisonous fruit, but disastrous to eat poisonous fruit.) Consider the following "online" learning system. At each time step we get a fruit, with a 1D measurement x(t). We need to decide whether x(t) is okay. After making our decision, we are told the correct classification. We would like to come up with a simple algorithm that (a) never makes a false positive error, (b) minimizes the number of false negative errors, (c) has a provable bound on the number of false negative errors it will make. Consider the following algorithm. We maintain a threshold \hat{x}_c(t), initialized at \hat{x}_c(1) = -\infty. We use \hat{x}_c(t) to classify x(t). If we make a mistake (which had better be a false negative) we raise \hat{x}_c(t) = x(t). Theorem: the expected number of false negatives after the T-th actual (true) negative is log(T). Proof: let T(i) be the time we see the i-th positive exemplar. If we get x(T(t)) wrong, it can only be because we have never seen such a large positive exemplar. The chance of that one being the largest we have ever seen is 1/t. So the odds of getting the t-th positive examplar wrong are 1/t, and summing that up, the total number wrong up to the t-th is 1/1 + 1/2 + 1/3 + ... + 1/t = log(t). (Meaning the natural logarithm.) We start with supervised learning. Let us define some terminology. learning a machine that looks like this: y = h(x; w) where x is the input (eg sensory input), w are the free parameters of the machine we need to set properly to make it match the training data, and y is the output of the machine given input x and parameter setting w. We need a name for the actual output of the machine given input x_p (which we call y_p) and the desired output of the machine given input x_p, which we call d_p. Input Space X, x \in X Output Space Y, y \in Y. distinguish between space of possible machine outputs and space of desired (target) outputs: these may differ. So y \in Y but d \in D. Examplars examples of inputs of corresponding output. We say x_p \mapsto d_p is an "examplar", or more colloquially a "training pattern." Weight Space can be continuous, discrete, or both in potentially complicated ways. w \in W. Hypothesis Space H. Parametrized by w, perhaps over-parametrized. h_w \in H \subset (X \rightarrow Y) H is that subset of (X \rightarrow Y) that can actually be constructed by setting w appropriately. So H = { the mappings x \mapsto h(x; w) such that w \in W } We use standard variable names for members of these spaces. Sometimes this standardization conflicts with other sorts of standardization, in which case we punt in the interest of clarity. Linear Classifier (aka Perceptron) input space is R^n output space is {-1,+1}, corresponding to no/yes dichotomy (why {-1,+1} instead of {0,1} ? The symmetry will prove useful in simplifying some formulas later.) weight space is R^(n+1) hypothesis space is linear decision surface if w.x > theta then y = +1 else y = -1 division between +1 region and -1 region is w.x = theta. (the boundary, in input space, between different outputs is called the separation surface.) note that changing theta keeps the direction of the separation plane the same, just translating it. And note that if we set theta=0 then the separating plane goes through the origin, and if perpendicular to w if w is regarded as a vector in input space. This means that even for nonzero theta, the surface is perpendicular to the vector w. Note that we can switch the "sense" of the separation surface, ie make the +1 region into the -1 region and vice-versa, by changing the sign of w, ie the sign of all entries of w. (Exercise: does theta also need to be adjusted?) Note that the mapping from (w,theta) to a "hypothesis" is many-to-one, as for instance in 2D (w1=2, w2=3, theta=4) and (w1=4, w2=6, theta=8) give the same mapping from x to y. In fact, scaling w and theta by some alpha>0 will leave the hypothesis unchanged.

- Wikipedia Perceptron entry
- Perceptron Learning Rule chapter from the textbook Neural Network Design by Martin T. Hagan, Howard B. Demuth, and Mark H. Beale, ISBN 0-9717321-0-8