Keynotes

	Geoff Hinton Google and University of Toronto Title: "Dark Knowledge" Abstract: A simple way to improve classification performance is to average the predictions of a large ensemble of different classifiers. This is great for winning competitions but requires too much computation at test time for practical applications such as speech recognition. In a widely ignored paper in 2006, Caruana and his collaborators showed that the knowledge in the ensemble could be transferred to a single, efficient model by training the single model to mimic the log probabilities of the ensemble average. This technique works because most of the knowledge in the learned ensemble is in the relative probabilities of extremely improbable wrong answers. For example, the ensemble may give a BMW a probability of one in a billion of being a garbage truck but this is still far greater (in the log domain) than its probability of being a carrot. This "dark knowledge", which is practically invisible in the class probabilities, defines a similarity metric over the classes that makes it much easier to learn a good classifier. I will describe a new variation of this technique called "distillation" and will show some surprising examples in which good classifiers over all of the classes can be learned from data in which some of the classes are entirely absent, provided the targets come from an ensemble that has been trained on all of the classes. I will also show how this technique can be used to improve a state-of-the-art acoustic model and will discuss its application to learning large sets of specialist models without overfitting. This is joint work with Oriol Vinyals and Jeff Dean.
	Christopher Manning Stanford University Title: Distributed representations of language are back Abstract: Distributed representations of human language content and structure had a brief boom in the 1980s, but it quickly faded, and the past 15 years have been dominated by continued use of categorical representations of language, despite the use of probabilities or weights over elements of these categorical representations. However, the last five years have seen a resurgence, with highly successful use of distributed representations, often in the context of "neural" or "deep learning" models. One great success has been distributed word representations, and I will look at some of our recent work and that of others on better understanding word representations and how they can be thought of as global matrix factorizations, much more similar to the traditional literature. Then I will turn to the use of distributed representations in parsing, showing how a dependency parser can gain in accuracy and speed by using distributed representations of not only words but also part-of-speech tags and dependency labels. Joint work with Danqi Chen and Jeffrey Pennington.
	Kevin Murphy Google Title: Extracting declarative and procedural knowledge from documents and videos on the web Abstract: We describe how we built a very large probabilistic database of declarative facts, called "Knowledge Vault", by applying "machine reading" to the web. This approach extends previous work, such as NELL and YAGO, by leveraging existing knowledge bases as a form of "prior". We also discuss our new nascent efforts to extract procedural knowledge from videos on the web. This requires training visual detectors from weakly labeled data. We give an example where we attempt to interpret cooking videos by aligning the frames to the steps of a recipe.
	Ben Recht UC Berkeley Title: Machine Learning Pipelines at Scale Abstract: Recent advances linking machine learning and systems research have enabled the rapid solution of common data analysis problems such as regression, classification, and clustering on terabytes of data. However, complex tasks that chain together multiple information modalities, learning algorithms, and algorithmic backends are still difficult to engineer at large scales. This talk describes some of the challenges facing complex machine learning systems and notes some possible ways forward via collaborative research between systems engineers and machine learning theorists. In particular I will describe many of the nascent projects at the Berkeley AMPLab focused on simplifying the engineering, testing, and deployment of such machine learning pipelines.