Document Type


Publication Date


Publication Title

Proceedings of the Seventeenth International Conference on Machine Learning


Many collections of data do not come packaged in a form amenable to the ready application of machine learning techniques. Nevertheless, there has been only limited research on the problem of preparing raw data for learning, perhaps because widespread differences between domains make generalization difficult. This paper focuses on one common class of raw data, in which the entities of interest actually comprise collections of (smaller pieces of) homologous data. We present a technique for processing such collections into high-dimensional vectors, suitable for the application of many learning algorithms including clustering, nearestneighbors, and boosting. We demonstrate the abilities of the method by using it to implement similarity metrics on two different domains: natural images and measurements from ocean buoys in the Pacific.

Creative Commons License

Creative Commons Attribution 4.0 International License
This work is licensed under a Creative Commons Attribution 4.0 International License.


© Nicholas Howe


Author’s submitted manuscript.

icml2Ktalk.pdf (231 kB)



To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.