The greatest thing about the human mind is the ability to detect patterns. This pattern detection and connection capability is in essence the holy grail of artificial intelligence. I’ve noticed such a pattern across a broad spectrum of technical endeavors associated with the 1-norm. The 1-norms latest foray into my consciousness comes from “compressed sensing,” but it resonates with a bunch of other things. These patterns are always trying to tell us, “there is something here worth paying attention to”.
This is all the more amazing since the pattern detection capability that humans have is probably more related to our staying alive as nomadic hunters, 10’s of thousands of years ago. In those days the pattern would say, “there’s something to eat” or “something is over there that might kill me” or “here is a good place to sleep” and so on. Now the same wiring is providing increased acuity applied toward abstract constructs like functional spaces. Go figure.
I’m trying to write something up about compressed sensing and related fields, but don’t feel up to the task quite yet. So instead of nursing that as a bonus this week I’ll just give my brain dump of what is intriguing me, and what tendrils are connected to other areas of science that are perhaps far from compressed sensing at first blush.
- L1 is the sharpest and least convex norm.
- “When a traveler reaches a fork in the road, the ℓ1 -norm tells him to take either one way or the other, but the ℓ2 -norm instructs him to head off into the bushes. ” – John F. Claerbout and Francis Muir, 1973. I If you think about the rest of the post, it might be clear what this means.
- Error and convergence for discontinuous (hyperbolic) problems should be measured in the L1 norm. The L1 norm is very closely related to compactness arguments and the total variation norm. Total variation concepts have been essential in numerical approximations to hyperbolic conservation equations, and image processing. At the close of this post it might be worth keeping this firmly in mind.
- The least squares problem is ubiquitous in mathematics, physics and engineering largely because it is so easy to compute. L2 is related to energy, which adds to the appeal. People apply L2 almost without thought as it is built into numerous computing packages and it is an introductory topic. This ubiquity is unfortunate because of all the bad things about L2 minimization framework. Among these bad things are the numerous unstated assumptions that come along for the ride and aren’t even known (or acknowledged) by those that use it. In many applications L1 is more appropriate and useful, but computing solutions to the L1 minimization is inherently nonlinear and there aren’t any non-trivial algorithms. Trivial introductory algorithms exist for L2. Some minimization is better than none, but too much begins and ends with least squares.
- Until recently I pondered what sort of math takes place in fractional “norms”. I’ve received a partial answer with the compressed sensing work. Little “L” norms aren’t norms because of lack of convexity, but can be computed. L0 tells one how sparse a set is, that is it measures all the non-zero entries. Because of the lack of convexity it isn’t very computable (NP-hard!). The cool thing is that with “high probability” minimizing in L1 can give the answer to the same question. This is the key to the utility of compressed sensing.
- The L1 norm is related to robust statistics. In other words, if you construct estimates using L1 minimization you are not susceptible to outliers in the data. If instead you do the standard least squares estimator (L2), you are very susceptible to outliers in the data, that is if you have a corrupt data point and do a L2 fit, your result can be influenced greatly, L1 will cope by ignoring the outlier. It turns out that most of the outlier detection schemes are predicated on assumptions from least squares, and assumptions associated with Gaussian statistics.
- I found that a L1 fit to local data was a nice way of differencing hyperbolic PDEs. It gave a much-improved solution compared to L2. When in doubt my suggestion is solve the problem using L1 minimization first. You’ll probably have to do it outside of MS Excel!
- In analyzing the convergence of numerical approximations to PDE (i.e., solution verification), the L1 norm gives by far the best solutions. If you were to use on norm to fit your data I believe L1 is the best.
- Moving onto the ideas in the frame of compressed sensing, I came across the LASSO, which is a way of regularizing least squares minimization (which you might want or need to do if your system is under-determined or terribly conditioned). The classical regularization is Tikhonov’s where a L2 penalty term is added to the matrix you solve (basically adding a term to the diagonal). LASSO is instead a L1 based penalty. Like most L1 things it works better once you solve the system. There is more, and this is where things start to get really really cool.
- LASSO is really a way to select models. Usually the regularization as discussed in the previous point uses a small penalty, because you want the problem you solve to be very close to the original problem. What if instead you make the penalty large? With the L1 regularization you start to find that the minimized solution starts to set most of the solution to zero. Ultimately as the penalty becomes large enough everything is zero except a single value. This usually happens in a well-defined sequence and the ordering compared to penalty size tells you what variables are most important to the data. You can then choose how complex you want to your model to be and which variables to keep and which to throw away. I’m playing around with this concept right now.
- Compressed sensing arose from signal processing work using wavelets. A couple of techniques arose “matching pursuit” basically a greedy algorithm to find out what basis was the best representation of an image, and “basis pursuit” which is an all-at-once algorithm that finds the sparsest representation. A certain version of basis pursuit that applies to noisy data, “basis pursuit denoising (BPDN)” is basically identical to LASSO. The big difference is that LASSO is generally applied to over-determined problems and BPDN is applied to under-determined problems. Both methods apply the magic of the L1 norm to do the selections. These pursuits are just another model selection problem.
- Finally compressed sensing can be discussed albeit briefly. This concept arose from basis pursuit and has a phenomenal mathematical basis from the assurances associated with using L1 to get the L0 solution of the sparsest basis. Much is made of the RIP, the restricted isometry property, which characterizes systems that compressed sensing can tackle although it has been noted that the theory is a bit pessimistic compared to what can be achieved in practice. It looks like an awesome bit of math that has massive implications on signal processing, graphics, and data analysis.
I came across the area from a blog post on the most influential papers in statistics in the last decade. David Donoho’s paper introducing the concept has about 8000 citations in less than 10 years! I met David last year at a workshop, and had no idea about why he was so well known, he just gave a really nice talk on reproducible software in an educational setting.
If you really want an introduction to compressed sensing read Terry Tao’s wonderful, wonderful post (http://terrytao.wordpress.com/2007/04/13/compressed-sensing-and-single-pixel-cameras/) on the single pixel camera. His math papers on related topics are pretty great too!
I hope this was interesting and perhaps inspiring. There is cool new math out there, and better yet it is powerful and useful.