Advances of Machine Learning in Theory & Applications





Research Projects

Each AMALTHEA REU team worked on a chosen research topic. Here you will find descriptions of these topics along with the posters that the teams presented during AMALTHEA's Symposium at the end of the summer experience. The posters are clickable links that will allow you to view them in better (higher) resolution. More details on the teams' research topics can be found in their technical reports (TRs; see the link below each project description). Finally, some of this research was published as conference and journal papers, which you can find under the Publications page.




Poster by Kelvin Cardona; Summer 2007

Title: A Grid Based System for Data Mining Using MapReduce (2007)
By: Kelvin Cardona
Graduate mentor(s): Jimmy Secretan
Faculty mentor(s): Prof. Michael Georgiopoulos
Abstract: We discuss a Grid data mining system based on the MapReduce paradigm of computing. The MapReduce paradigm emphasizes system automation of fault tolerance and redundancy, while keeping the programming model for the user very simple. MapReduce is built closely on top of a distributed file system, that allows efficient distributed storage of large data sets, and allows computation to be scheduled closely to this data. Many machine learning algorithms can be easily integrated into this environment. We explore the potential of the MapReduce paradigm for general large scale data mining. We offer several modifications to the existing MapReduce scheduling system to bring it from a cluster environment to a campus grid that includes desktop PCs, servers and clusters. We provide an example implementation of a machine learning algorithm (the Probabilistic Neural Network) in MapReduce form. We also discuss a MapReduce simulator that can be used to develop further enhancements to the MapReduce system. We provide simulation results for two new proposed scheduling algorithms, designed to improve MapReduce processing on the grid. These scheduling algorithms provide increased storage efficiency and increased job processing speed, when used in a heterogeneous grid environment. This work will be used in the future to produce a fully functioning implementation of the MapReduce runtime system for a grid environment, that will enable easy, data intensive parallel computing for machine learning, with little to no additional hardware investment.
TR: Cardona, K., Secretan, J., Georgiopoulos, M. and Anagnostopoulos G.C. (2007) A Grid Based System for Data Mining Using MapReduce, Technical Report TR-2007-02, The AMALTHEA Program, Summer 2007. [PDF]




Poster by Amy Hoover; Summer 2007

Title: NEAT Drummer: Interactive Evolutionary Computation for Drum Pattern Generation (2007)
By: Amy Hoover
Faculty mentor(s): Dr. Kenneth Stanley
Abstract: A major challenge in computer generated music is breaking the barrier between musical novelty and musical quality. Typically, computer music generators produce either genre-specific patterns that lack innovation or patterns that are given too much freedom and lack cohesion. In an attempt to both constrain the musical search space and produce novel rhythms, a program called NEAT Drummer is introduced. NEAT Drummer evolves neural networks with the NeuroEvolution of Augmenting Topologies (NEAT) that produce compelling drum patterns. To constrain the musical search space, NEAT Drummer accepts a base rhythm or motif from the user and through Interactive Evolutionary Computation (IEC), complexifies that pattern with each successive generation. This work discusses the concepts behind how NEAT Drummer understands and manipulates a base rhythm, which is either predefined by the user through a basic interface or defined by MIDI music file information.
TR: Hoover, A.K., and Stanley, K.O. (2007) NEAT Drummer: Interactive Evolutionary Computation for Drum Pattern Generation, Technical Report TR-2007-03, The AMALTHEA Program, Summer 2007. [PDF]




Poster by Andrew Stiles, Brandon Schmitt & Tad Gertz; Summer 2007

Title: Testing and Improvement of the Triple Scoring Method for Applications of Wake-up Word Technology (2007)
By: Andrew Stiles, Brandon Schmitt & Tad Gertz
Graduate mentor(s): Tudor Klein
Faculty mentor(s): Dr. Veton Kepuska
Abstract: Constant monitoring of an individual’s voice and near perfect recognition of a specific word while maintaining consistent rejections of all other words can be realized by implementation of Wake-Up Word (WUW) Speech Recognition (SR) technology. The algorithm shown here has the potential to add robustness to even in a speaker independent environment, and provides much better results for the application of single word recognition when compared to current industry or academic standards such as Microsoft SAPI and HTK respectively. By implementing a Triple Scoring Method (TSM) implemented with Hidden Markov Models (HMM) in the feature domain the WUW modeling results are found to be far superior in single word recognition, providing a 15166.15% increase in correct recognition with Callhome corpus over HTK and a 1303.78% increase over Microsoft SDK.
TR: Stiles, A., Schmitt, B., Gertz, F., Klein, T., and Kepuska, V. (2007) Testing and Improvement of the Triple Scoring Method for Applications of Wake-up Word Technology, Technical Report TR-2007-04, The AMALTHEA Program, Summer 2007. [PDF]




Poster by Maria Garcia & Jason Beck; Summer 2007

Title: A Backward Adjusting Strategy for the C4.5 Decision Tree Classifier (2007)
By: Maria Garcia & Jason Beck
Graduate mentor(s): Mingyu Zhong
Faculty mentor(s): Prof. Michael Georgiopoulos
Abstract: In machine learning, decision trees are employed extensively in solving classification problems. In order to produce a decision tree classifier two main steps need to be followed. The first step is to grow the tree using a set of data, referred to as the training set. The second step is to prune the tree; this step produces a smaller tree with better generalization (smaller error on unseen data). The goal of this project is to incorporate an additional adjustment phase interjected between the growing and pruning phases of a well known decision tree classifier, called the C4.5 decision tree. This additional step reduces the error rate (generalization of the tree) by making adjustments to the non-optimal splits created in the growing phase of the C4.5 classifier. As a byproduct of our work we are also discussing of how the decision tree produced by C4.5 is affected by the change of the C4.5 default parameters, such as CF (confidence factor) and MS (number of minimum split-off) cases, and emphasizing the fact that CF and MS parameter values, different than the default values, lead us to C4.5 trees of much smaller size and smaller error.
TR: Beck, J.R., Garcia, M.E., Zhong, M. Georgiopoulos, M., and Anagnostopoulos G.C. (2007) A Backward Adjusting Strategy for the C4.5 Decision Tree Classifier, Technical Report TR-2007-01, The AMALTHEA Program, Summer 2007. [PDF]