Close
Home
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Flymodeller: an interactive machine learning platform for automatic fly behavior annotation
(USC Thesis Other)
Flymodeller: an interactive machine learning platform for automatic fly behavior annotation
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Flymodeller: an interactive machine learning platform for automatic fly behavior annotation Hetian Chen Master of Science (BIOSTATISTICS) University of Southern California May 13, 2016 1 ……………………………………………………………………………Table of Contents Abstract.……………………………………………………………………………………3 Introduction..………………………………………………………………………………3 Behavior Classification….……..……………………………………………………4 Support Vector Machines….………….……………………………………………..5 Random Forests…….……………….……………………………………………..10 Boosting……….…………………..……………………………………………… 11 KNN……..……….……………..………………………………………………… 12 Methods…………..………………………………………………………………………13 Flymodeller overview………………………………………………………………13 Behavior annotation..………………… ……………………………………………14 Feature selection……………………………………………………………………14 Train/Evaluate/Predict………………………………………………………………17 Working with different videos………………………………………………………17 Results……..………….…….……………………………………………………………18 Discussion…………….…………………………………….……………………………22 References……………..…………………………………………………………………26 2 Abstract Studies of animal behavior patterns and genetic determinants of those behaviors have recently received increased attention from researchers. Since manual annotation is labor- intensive, automatic animal behavior annotation software is beginning to emerge. A popular such tool is the JAABA(Kabra, et al., 2013) package, produced by Janelia Farms [inc.], which is a machine learning system for automatically computing interpretable, quantitative measures of animal behavior. In this thesis we present Flymodeller, an improvement to JAABA. It is an automatic animal behavior annotator that has faster computation speed, more flexible choices of machine learning algorithms, and provides a more efficient pipeline for training and prediction across multiple videos. Introduction The fruit fly, Drosophila melanogaster, is a model organism that has been studied in the scientific community for more than a century, since Thomas Hunt Morgan began using it for genetic studies in 1907(Garland, 1966). It has a relatively simple genome, which has been sequenced, is easy to care for, and has a short generation time that provides for rapid development. As such, the fruit fly serves as a good research model. However, flies are not solitary animals, they often form groups when eating, seeking mates, or laying eggs. When animals live in groups, they can unlock complex behaviors simply by interacting with their neighbors according to basic rules. Locusts can form crop-devastating, sun-blocking swarms(Lyons, 2012); starlings can form beautiful murmurations(Craig, et al., 2009); fish can solve problems as a shoal(Tan, et al., 2011). However, very little is known about the nature of these fascinating phenomena. What’s the innate cause of collective behavior? How, at an individual level, do the interactions of animals in a large group happen? Such knowledge might be applied to change the group dynamics, thereby influencing the behavior of animals. Therefore, studying collective behavior can be very practical and useful. For high-throughput study of animal behavior, a key is to properly video record the animals, and accurately extract animal information contained in the video. Recently, computer-vision techniques have emerged for automatically tracking the animals, transforming video data into trajectories of their positions over time. However, there is still a big gap between that trajectory data and the resulting animal behavior. Transforming the trajectory to meaningful quantitative measurements, such as speed, rotation angle, or 3 size, and building predictive models to infer behavior from these quantitative measurements, are two important steps in analyzing such data. Machine learning and traditional statistical methods are two ways to build prediction models. They have much in common, but follow different approaches [2] . Traditional statistical methods are more likely to adopt a data modeling perspective. For example, the analysis might start by proposing a stochastic data model, or linear regression model, for the 'black box’ between predictors and outcome. A common data model framework is that data are generated by independent draws from: Response variables = f(predictor variables, random noise, parameters) The values of the parameters are estimated from the data and the model then used to both better understand the process that resulted in the data and for prediction (e.g.,: linear regression, logistic regression, the Cox model, etc). In contrast, the algorithmic modeling culture is popular in machine learning. The analysis here considers the inside of the black box as complex and unknown, and not necessarily of interest. The approach is to find a function f(x) ----an algorithm that operates on data x - to predict responses y (e.g., decision trees, boosting, neural nets, etc). Therefore, it is popularly said in the statistics community that “machine learning is statistics minus any checking of models and assumptions”(Brian, 2004). Machine learning has developed rapidly in fields outside statistics. It can be used both on large complex data sets and sometimes as a more accurate and informative alternative to data modeling on smaller data sets(Breiman, 2001). In this section, we introduce the popular machine learning techniques that we will be employing in this thesis for the purpose of fly behavior classification. In that context, the underlying data model is often of no interest, but accurate annotation of data is crucial. The algorithms we discuss are support vector machines, boosting, random forests, and k-nearest neighbor classifiers. But before we do that, we discuss the goal of fly behavior classification. Behavior Classification: In machine learning and statistics, classification is the problem of identifying to which of a set of categories (‘sub-populations’) a new observation belongs, on the basis of a training 4 set of data containing observations (or ‘instances’) whose category membership is known. This will allow high-throughput analysis of fly video data (compared to existing approaches in which investigators manually record behavior). Normally, the training set is a set of input vectors x1,…,xn in R n , and the corresponding categories y1,…,yn are a subset of all possible categories. In this thesis, we are training classifiers to identify fly behavior of interest, and we focus on “chasing” as an example of this. Chasing is the act in which one fly chases another. Therefore, the outcome variable has two classes – “chasing” and “not chasing”—and labels are thus binary {+1,-1}. The input vectors are quantitative measures (speed, orientation, size, etc.) computed from the trajectories of flies. First, a predictive model is constructed using supervised machine learning methods, and then this model is used to classify new data points. We now introduce the machine learning approaches implemented in our work. Support Vector Machines[SVMs]: The concept of SVMs was first introduced in 1979(Vapnik, 1979) based on statistical learning theory developed and proposed by Vapnik and Chervonenkis in 1974(Vapnik, 1974) SVMs are computational algorithms that construct a hyperplane, or a set of hyperplanes, in a high or infinite dimensional space. Intuitively, separation between two linearly separable classes is achieved by any hyperplane that provides no misclassification for all data points of any of the considered classes. This approach is called linear classification. However, there are many hyperplanes that might classify the same set of data as can be seen in figure 1 (based on a figure from Thomé, 2012). SVMs are an approach in which the objective is to find the best separation hyperplane. That is, the 5 Figure 1. Separation hyperplanes. H1 does not separate the two classes; H2 separates but with a very tiny margin between the classes and H3 separates the two classes with much better margin than H2 hyperplane that provides the highest margin distance between the nearest points of the two classes. That is the reason why SVM is also called maximum margin classifier. This approach, in general, guarantees that the larger the margin is the lower is the generalization error of the classifier(Thomé, 2012). To find the hyperplane with the largest margin, we first define the hyperplanes as: , where w and b are, respectively, the vector normal to the hyperplane and the displacement of that vector relative to the origin. We classify data for which f(x) <-1 as y = -1 and data for which f(x) > 1 as y = 1. The margin between these two hyperplanes: f(x) = 1 and f(x) = -1, is then defined as: where ||w|| denotes the Euclidean norm of vector w. Therefore the decision boundary can be found by solving the following constrained optimization problem: However, if the data is not absolutely linearly separable, the above approach would fail, since no misclassified observations are allowed. To solve this problem, we introduce a slack variable (εi) to provide some freedom to the system, allowing some samples to not respect the original equations. This is called the “soft-margin” support vector machine(Vapnik, 1995), and the optimization problem becomes: where i is a non-negative slack variable that measures the degree of misclassification of the data xi, C is a regularization term, which provides a way to control overfitting, and n is the number of observations. Another way to deal with non-linear separable data is to use a non-linear decision boundary. 6 f (x)=w T x+b m= 2 ||w || minimize 1 2 ||w|| 2 subject to y i (w T x i +b)≥1 minimize 1 2 ||w|| 2 +C ξ i i=1 n ∑ subject to y i (w T x i +b)≥1−ξ i ,ξ i ≥ 0 The key idea is to transform data points (xi) to a higher dimension (φ(xi)) in which the data become linearly separable. This transformation process is called feature mapping. A polynomial feature mapping example is as follows: As shown above, mapping the given data from n (with n=2) dimension to m (with m=3) dimensional space helps to achieve linear separability. However, in practice, this mapping process is very computationally expensive when m is large. To avoid computing the coordinates of data in the higher dimensional space explicitly, we employ a “kernel trick”, which we now explain in detail. First we need some definitions. Defining ɸ as a non-linear feature mapping function, then the above soft-margin SVM optimization problem becomes: To solve this problem, we can form the Lagrangian: where ai, i, are Lagrange multipliers. Solving the above equation, we get: 7 φ :R 2 →R 3 (x 1 ,x 2 )→ φ (z 1 ,z 2 ,z 3 ) whereφ(x 1 ,x 2 )= (x 1 2 , 2x 1 x 2 ,x 2 2 )is a polynomial feature mapping function. minimize 1 2 ||w|| 2 +C ξ i i=1 n ∑ subject to y i (w T φ(x i )+b)≥1−ξ i ,ξ i ≥ 0 w= a i y i i=1 n ∑ φ(x i ), a i ≥ 0 λ i =C−a i ≥ 0 a i y i = 0 i=1 n ∑ L(w,b,ξ,a,λ)=C ξ i i=1 n ∑ + 1 2 ||w || 2 + a i i=1 n ∑ (1−y i (w T φ(x i )+b)−ξ i )− λ i i=1 n ∑ ξ i ,a i ≥ 0,λ i ≥ 0 ∇L(w,b,ξ,a,λ)= 0 Substituting the above equations back into the Lagrangian, we transform the original optimization problem to: This optimization problem is convex, and thus can be solved by many different algorithms such as Coordinate ascent(Vapnik, 1995) and Sequential minimal optimization (SMO) (Shalev-Shwartz, et al., 2013). To identify b, we need to use the Lagrangian again. At the optimal solution, we must have: If ai < C, then i = C - ai> 0, and thus i must be 0 to make i i = 0. If ai >0, then this observation will contribute to the calculation of w. Data points with ai >0 are thus called “support vectors”. Taken together, the optimal b can be computed by taking a majority vote of the support vectors as: where m is the number of support vectors. With the above definitions, we finally are ready to define the “kernel tricks”. Suppose we have fit our model’s parameters to a training set, and wish to make a prediction at a new input x. We would then calculate w T ɸ(x)+b, and predict y = 1 if this quantity is bigger than 0. But if we substitute w with , we get: 8 L(a,λ)= a i i=1 n ∑ − 1 2 a i a j y i y j i,j=1 n ∑ φ(x i )φ(x j ) max a L(a,λ) subject to0≤a i ≤C, a i y i = 0 i=1 n ∑ λ i ξ i = 0 a i (1− y i (w T φ(x i )+b)−ξ i )= 0 1− y i (w T φ(x i )+b)= 0⇒b= y i −w T φ(x i ) as y i ∈{−1,1} b= 1 m (y i −w T φ(x i )) i=1 m ∑ w T φ(x)+b= ( a i y i φ(x i )) T i=1 n ∑ φ(x)+ 1 m (y j −( a i y i φ(x i )) T i=1 n ∑ φ(x j )) j=1 m ∑ = a i y i < i=1 n ∑ φ(x i ),φ(x)>+ 1 m (y j − a i y i <φ(x i ), i=1 n ∑ φ(x j )>) j=1 m ∑ w= a i y i i=1 n ∑ φ(x i ) where <x1,x2> denotes the inner product of vector x1,x2, which is equal to with k being the dimension of vector x1 and x2. As we can see from the above equation, in order to make a prediction we only need to compute the inner product between the new observation x and support vectors as well as the inner product between the support vectors. We define a function K(x,z), such that K(x,z) = <ɸ(x),ɸ(z)>. Therefore, the above equation becomes: In this case, K(x,z) is called a kernel function (which must be positive semi-definite), and each kernel function corresponds to a specific feature mapping function. For example, if we define K(x,z) = (xz) 2 , then for x and z ∊ R 2 ,we have: As shown above, K(x,z) = (xz) 2 corresponds to a feature mapping from R 2 to R 3 . If we replace <ɸ(x),ɸ(z)> with K(x,z), we are able to compute the inner product of features in 3 dimensional space without computing the corresponding coordinates of the 2D features in the 3D space first. This is called the “kernel trick”, which helps reduce the computational complexity. The computation time saved by the kernel trick becomes significant when the features are mapped to a very high dimensional space. Let’s illustrate this by applying a simple RBF kernel function K(x,y) to points in R 2 . 9 x 1i i=1 k ∑ x 2i w T φ(x)+b= a i y i K( i=1 n ∑ x i ,x)+ 1 m (y j − a i y i K(x i , i=1 n ∑ x j )) j=1 m ∑ K(x,z)= (x 1 z 1 +x 2 z 2 ) 2 = x 2 1 z 2 1 +x 2 2 z 2 2 + 2x 1 z 1 x 2 z 2 = (x 2 1 , 2x 1 x 2 ,x 2 2 ) T (z 2 1 , 2z 1 z 2 ,z 2 2 ) =φ(x) T φ(z) whereφ :R 2 →R 3 (x 1 ,x 2 )→ (x ' 1 ,x ' 2 ,x ' 3 ):= (x 2 1 , 2x 1 x 2 ,x 2 2 ) K(x,y)= exp(−(x−y) T (x−y)) = exp(−x 2 1 +2x 1 y 1 −y 2 1 −x 2 2 +2x 2 y 2 −y 2 2 ) = exp(−x T x)exp(y T y)exp(2x T y) = exp(−x T x)exp(y T y) (2x T y) n n! n=0 ∞ ∑ = (( 2exp(−x T x) n! n x) T ( 2exp(y T y) n! n y)) n n=0 ∞ ∑ Therefore, the above RBF kernel corresponds to a mapping from 2 dimensional space to infinite dimensional space. We resort to the kernel trick to do the mapping implicitly since it’s computationally impossible to map explicitly. Random Forests: Random forests [RFs](Breiman, 2001) were originally conceived as a method of combining several CART(Breiman, 1984) (“Classification and Regression Tree”) style decision trees using bagging(Breiman, 1996). Theearly development of RFs was influenced by the random subspace method of Ho(Ho, 1998), the approach of random split selection from Dietterich(Dietterich, 2000) and the work of Amit & Geman(Amit Y , et al., 1997) on feature selection. Several of the core ideas used in RFs are also present in the early work of Kwokt & Carter (Kwokt, et al., 1988) on ensembles of decision trees. A random forest is a classifier consisting of a collection of decision trees {h(x,k) k = 1,2, ….}, where the {k} are independent identically distributed random vectors and each tree casts a unit vote to determine the most popular class among all trees in the forest, given input x. For a given number of trees, T, here is how such a system is trained: 1. Sample N cases at random with replacement to create a subset of the data (see top layer of figure 2). The subset should be about 66% of the total set. 2. At each node: 2.1. For some number m (see below), m predictor variables are selected at random from all the predictor variables. 2.2. The predictor variable that provides the best split, according to some objective function, is used to perform a binary split on that node. 2.3. At the next node, choose another m variables at random from all predictor variables and do the same. Depending upon the value of m, there are three slightly different systems: 1. Random splitter selection: m =1 2. Breiman’s bagger: m = total number of predictor variables 3. Random forest: m << number of predictor variables. Brieman suggests three possible values for m: ½√m, √m, and 2√m 10 When new input is entered into the system, it is run down all of the trees. The result may either be an average or weighted average of all of the terminal nodes that are reached, or, in the case of categorical variables, a voting majority. Random forest runtimes are quite fast, and they are able to deal with unbalanced and missing data. The weaknesses of Random Forests are that when used for regression they cannot predict beyond the range in the training data, and that they may over-fit data sets that are particularly noisy(Khan, et al., 2014). Boosting: Boosting is a class of machine learning methods based on the idea that a combination of simple classifiers (obtained by a weak learner, see below) can perform better than any of the simple classifiers alone. A weak learner is a learning algorithm capable of producing classifiers with a probability of error strictly (but only slightly) less than that produced by random guessing (0.5, in the binary case) (Ferreira, et al., 2012). The idea of building ensembles of classifiers has gained popularity in the last decade(Kucheva, 2004); the rationale is that it may be easier to train several simple classifiers and combine them into a more complex classifier than to learn a single complex classifier. As shown in Figure 3, Hm(x) denotes the m-th weak binary classifier and x is some input pattern to be classified. There are many ways to combine these single weak classifiers. For example, assuming that classifier errors are independent of each other, a majority vote combination should yield a lower error rate than any of the individual classifier. Therefore, the key to boosting is to find out the appropriate weight for each weak learner. Considering a weighted linear combination of the outputs of the weak classifiers, the ensemble prediction function H is given by: 11 H(x)=sign( a m H m (x) m=1 M ∑ ) Figure 3. The concept of ensemble of classifiers. The outputs of the weak learners Hm(x) with m ∈ {1,…,M} are combined to produce the output of the ensemble of classifiers given by H(x) where a1,…., aM is a set of weights (a simple majority vote results if all the weights are equal). Among the many different ways in which ensembles of classifiers can be trained and combined, boosting techniques exhibit, in addition to good practical performance, several theoretical and algorithmic features that makes them particularly attractive(Hastie, et al., 2001; Meir et al., 2006; Schapire, 2002). Essentially, boosting consists of repeatedly using the base weak learning algorithm, on differently weighted versions of the training data, yielding a sequence of weak classifiers that are combined as in Figure 2. The weighting of each instance in the training data, at each round of the algorithm, depends on the accuracy of the previous classifiers, thus allowing the algorithm to focus its attention on those samples that are still incorrectly classified. There are several variants of boosting algorithms, that differ in their choice of base learners and in the criteria used for updating the weights of the training samples. AdaBoost is arguably the best known boosting algorithm. Some popular variants are Real AdaBoost(Friedman, et al., 1998), LogitBoost(Friedman, et al., 2000), Gentle AdaBoost(Schapire, et al., 1999). K-Nearest Neighbor: K-nearest-neighbor (KNN) classification is one of the most fundamental and simple classification methods. The concept of a nearest neighbor decision rule was first proposed by Cover & Hart in 1967(Cover, 1967). As a non-parametric lazy learning algorithm, KNN makes no assumptions on the underlying data distribution, which make it one of the first choices for a classification study when there is little or no prior knowledge about the distribution of the data. KNN classification assumes that data points are in a metric space, and thus the points have a notion of distance (e.g., Euclidean distance). The training data consists of a set of vectors and a class label associated with each vector. We also define a single number, k. k determines how many neighbors (defined based on the distance metric) influence the classification. For the purpose of classification, an object is classified by a majority vote of the “k” nearest neighbors (based on particular distance measure) in the training set. In summary, KNN is a simple and powerful algorithm. Since no training or tuning of complex parameters is needed, new training examples can be added easily. However, it is relatively expensive and slow in terms of speed and memory usage compared with other 12 algorithms. For a training set of m examples and d dimensions, the running time to determine the classification of a new sample is O(md). In this case, pre-sorting training examples into fast data structures, or computing only an approximate distance, may significantly improve algorithm performance. Methods Flymodeller overview In this thesis we report on the development of our “Flymodeller” software package. Flymodeller is a machine learning based system to annotate and model fly behaviors. It is written in Python under the OpenCV and Qt5 framework. This platform inherits most of the features from JAABA (Kabra, et al., 2013), but makes important improvements in terms of faster computation speed, more flexible choices of machine learning algorithms, and more efficient pipeline to train or predict across multiple videos (most experiments consist of more than one video). The input data of Flymodeller are trajectories of flies obtained from image processing. These trajectory data will then be transformed into efficient general-purpose quantitative measures—“per-frame” features that describe the state of the animal in the current frame (e.g. size, orientation, speed). Flymodeller can compute the same set of per-frame features as JAABA does (some 50+ features) at a computational speed that is 28% faster. This performance increase is achieved by converting the original python code to C through Cython. Another difference between JAABA and Flymodeller is that JAABA requires computing all the “per-frame” features beforehand. However, we are not likely to use all 13 Figure 4. The concept of KNN classifier. Samples in red, blue and green belongs to three classes respectively. With Euclidean distance as the distance measure, and k=5 as the number of neighbors, Xu is classified as red by majority vote of the nearest 5 points (4 points in red and 1 point in green). these features in the training phase. Therefore, to save time, Flymodeller allows the user to start annotating the video without computing those time-consuming “per-frame” features, and only computes them when they are actually needed in the model. Behavior annotation To obtain the training data, we first manually label a number of bouts of behavior of interest (e,g, chase), where we are certain the fly is performing that behavior, and a few nearby periods in which the behavior is not being performed. As shown in Figure 5, Flymodeller allows the user to add new behaviors, or switch between multiple behaviors of interest, during the labeling process. This ability makes the labeling process more efficient than that of JAABA. Feature selection Before we train the model, it is important to screen out irrelevant predictors, so far as is possible. In theory, we can select all per-frame features as predictors, and let the algorithm 14 Figure 5. Flymodeller panel to label, train and predict for one fly. assign appropriate weight to them. However, this approach can lead to very high dimensional datasets, which negatively influences the speed and accuracy of the algorithm. Though we can later consider using dimension reduction techniques (e.g.principal components) to reduce the dimension, it is still best to begin by excluding obviously irrelevant features. First of all, we can remove irrelevant features by our understanding of the behavior of interest and the information captured by the per-frame features. Per-frame features are grouped into 4 categories: Appearance, Locomotion, Social, and Identity. “Appearance” features capture the appearance of the fly, and contains information such as size, or the eccentricity of the animal based on the ellipse fitted to the animals. (Flies are recognized in the image processing software by having an ellipse fitted to them.) “Locomotion” features capture the locomotive properties of the animal and describe aspects such as the speed, change in orientation, and location of center of rotation. “Social” features capture the social properties of the animals, such as the distance to the closest animal, and speed of movement towards the closest animal. “Identity” features identify the nearest animal based on different criteria, such as the distance between the centers, or distance between the ‘nose ‘of the current animal and any point on the ellipse of the other animal. These are used as intermediate features to compute various social per-frame features. It seems reasonable to exclude “Appearance” features from the “chasing behavior” model, so we do so. Flymodeller provides two sample t-test/Wilcoxon Rank Sum tests and Box-plots to help the user screen out further irrelevant features (See Figure 6). If a per-frame feature is not at all different between “behavior” and “non-behavior” groups, then it may not contribute to the model. Since most behaviors consist of a movement pattern that occurs within some consecutive number of frames, we also need features that can represent the movement of animals across a number of frames. Therefore, a general set of “window features” (e.g. mean, max, min, std) that provide temporal context from a window around the current frame were computed. These window features can help describe the distribution of per-frame features in time windows around the current frame. Obviously, when the window radius is set as 0, the window-feature is the same as the per-frame feature. In this thesis, we selected 0 and 5 as our window radius. 15 16 Figure 6. Flymodeller panel to select per-frame and window Figure 7. Flymodeller panel to select algorithms and set Figure 8. Flymodeller panel to show cross validation results. Train/Evaluate/Predict The last step before we train the classifier is to select algorithms. In contrast to JAABA, Flymodeller provides the user with a large choice of algorithms, including support vector machines (SVM), boosting, logistic regression (Logistic), (See Figure 7). These algorithms are implemented by the open source Python machine learning package, Scikit- learn(Pedregosa et al., 2011). For the sake of simplicity, default hyper-parameters for each algorithms are pre-configured for the user. Training results, including selected hyper- parameters, a confusion matrix, accuracy, sensitivity, specificity and an ROC curve for each algorithm are shown in a number of pop-up windows. We can then save the model in a file by clicking the “save” button. (See Figure 8). If several models are saved, they can be loaded to Flymodeller (Menu Bar->Edit-> Import Models). We can then use these models to classify all flies for the current project (Menu Bar->Classifier-> Predict All Flies). Prediction results will be saved as .csv file in the current project folder. Moreover, as shown in Figure 5, to view the prediction results of different algorithms, we can load several of them into Flymodeller at the same time, which enables a side-by-side comparison. Working with different videos In some cases, a behavior may be rare, and we are unable to collect enough data to train a classifier in one video. In other cases, we have a good model and want to apply it to multiple videos. To make the above steps more efficient, we also developed two plugins for Flymodeller: multiTrain and multiPredict. As shown in Figure 9, multiTrain makes it possible to easily combine labels from different videos to train a single classifier. multiPredict helps the user apply different classifiers to multiple videos. 17 Figure 9. Flymodeller plugins: multiTrain and multiPredict Results We start by comparing the most time-consuming step of both Flymodeller and JAABA — the time spent computing the necessary pre-frame features for model building. JAABA requires all per-frame features to be computed before any labeling and training. For a sample trajectory input file with 18 files and 17999 frames for each fly, it takes around 50 minutes to generate all 54 per-frame features (7 Appearance features, 19 Locomotion features, 24 Social features, and 4 Identity features). In contrast, Flymodeller only takes ~7 minutes to prepare the necessary per-frame features to train models. Flymodeller works more efficiently due to its different workflow and because of optimizations to feature computation carried out through Cython. First, Flymodeller enables the user to start labeling frames with Appearance and Locomotion features pre-computed, which takes only 30 seconds. Second, Flymodeller only computes the Social features that are chosen by users. Third, for some extremely time consuming features, Flymodeller only computes the data for the flies with labels, because data for other flies won’t be used in the model building process. In this sample analysis, 13 out of 24 Social features are selected for the following two sample t-test/Wilcoxon Rank Sum tests to determine whether these features are candidates for inclusion in the final model. This process takes only 6 minutes. To test our platform we obtained 4553 frames of data labeled by flymodeller: 2999 of them are “chasing” frames, and the remainder are “non-chasing” frames. We then train a classifier to identify this behavior. Two different datasets are used to train the “chasing” classifier. For the first dataset, we selected almost all Locomotion and Social features, thus getting a dataset of size 4553 × 80. For the second dataset, we first excluded features with p-value < 0.05 when testing for effects on labeled behavior, then further excluded highly correlated features to prevent collinearity issues in model building. The size of the second dataset is 4553 × 24. To compare our platform with JAABA, we used JAABA to label the same 4553 frames, obtaining the same 4553 × 80 and 4553 × 24 datasets. The following tables show the performance of the various algorithms implemented by Flymodeller, and the Boosting algorithm implemented by JAABA, on the two different datasets (4553× 80 and 4553 × 24). We compared a Support Vector Machine that used a linear model, K-Nearest Neighbors with k = 5, Boosting implemented by both Flymodeller and JAABA, Random Forest analysis, and Logistic Regression algorithms. To evaluate performance we look at the average 3-fold cross validation scores: Accuracy, Sensitivity, Specificity, Precision, F score, and Area Under Curve [AUC]. Here, “Accuracy” is defined as the number of true positives and true negatives divided by the 18 total number of classifications. We primarily focus on Accuracy and AUC. Since it is an unbiased estimate of overall performance, Accuracy is a reasonable measure to use to provide an overall rank of the algorithms. AUC is the area under the operating characteristic curve, which is created by plotting the sensitivity (true positive rate) vs 1- specificity (false positive rate) at various cut points. In the context of classification, we naturally set the cut point at 0.5, which means we classify outcomes with probability less than 0.5 as negative outcomes (e.g. “not chasing”) and those higher than 0.5 as positive outcomes (e.g. “chasing”). For an ideal model, we want both high sensitivity and high specificity. However, there is a trade off between sensitivity and specificity as increasing the cut point leads to higher sensitivity but lower specificity, and vise-versa if we decrease the cut point. ROC analysis provides a more general way to evaluate the performance of the algorithms. Higher values of AUC means the algorithm is more likely to rank a randomly chosen positive example higher than a randomly chosen negative example. Results are shown in figures 11, 13, and 15. 19 Table 1. Performance of algorithms on data 4553 × 24 Table 2. Performance of algorithms on data 4553 × 80 a Measure was not available As shown in Table 1, among the six algorithms, the Support Vector Machine (using a linear kernel) gives the highest accuracy (0.853) and AUC (0.914). Based on the cross validation scores, the overall performance of the five algorithms is ranked as: SVM(linear kernel) > Logistic Regression > FlyModeller Boosting ≈ JAABA Boosting > KNN (k=5) > Random Forest . The highest specificity scores (0.866) are obtained by SVM (linear kernel), which reflects very good performance in terms of predicting when chasing does not occur. Flymodeller Boosting gives the highest sensitivity scores (0.867), which indicates that a great percentage of chasing behaviors are correctly identified. The highest AUC score (0.914) is obtained by SVM (linear kernel). The interpretation of this is that the probability that the classifier will assign a higher score to a randomly chosen “chasing” example than to a randomly chosen non-chasing example is 0.914. If we compare the cross validation scores of datasets 4553 × 80 (Table 2) and 4553 × 24 (Table 1), we find that excluding “seemingly unimportant” per-frame features does not negatively effect model performance. In contrast, there are many advantages associated with the now smaller dataset. First of all, it allows for faster computation. This fact is most clearly seen in the SVM (linear kernel). In our example, it completes in just a few seconds with 24 features in the model, but takes 30 minutes to converge when 80 features were used. Second, we obtain much more parsimonious models at the cost of only a small decrease in accuracy in some cases. The performance of KNN and Boosting decreased slightly after excluding the “unimportant” features. In contrast, the overall performance of Random Forest and Logistic Regression is actually increased. 20 Figure10. confusion matrix for different algorithms on data 4553 × 24 To make these two datasets more comparable, we reduced the dimension of dataset 4553 × 80 to 4553 × 24 by Principal Component Analysis (PCA). We have seen that the performance of the Random Forest and Logistic Regression on dataset 4553 × 80 is worse than that on dataset 4553 × 24. We wish to investigate whether the performance of these algorithms on dataset 4553 × 80 after PCA improves and possibly even becomes better than that for the original 4553 × 20 dataset. As shown in Table 3 (Boosting results by JAABA are not available since JAABA is not able to perform PCA), PCA does help improve the performance of Random Forest and Logistic Regression, but the improvement is not great. The accuracy and AUC scores of these two algorithms on dataset 4553 × 80 after PCA are still lower than the original dataset 4553 × 20. The conclusion from this is that excluding irrelevant features from model fitting by using prior knowledge can be helpful in this example at least. Even though machine learning algorithms are supposed to be robust, blindly throwing in features all at once is not necessarily a smart strategy. Finally, to demonstrate Flymodeller’s ability to work with multiple videos, the Boosting classifiers we trained in the above example using Flymodeller and JAABA were applied to 3 different videos, each of which contains 18 flies with 17999 frames each. With Flymodeller’s “multiPredict” plugin (Figure 9), users can select model and features once 21 Figure 11. ROC for different algorithms on data 4553 × 24 for use all three videos. It takes 30 minutes for Flymodeller to complete the prediction. In contrast, applying classifiers one-at-a-time using JAABA is much more labor intensive. First, the user must wait for about 50 minutes for each video project to compute all pre- frame features, with most of them being unused. Second, the user must select a model, per- frame features and window features three times. In total, it takes JAABA 3 hours to complete the prediction. Thus, Flymodeller has great advantages compared to JAABA when working with multiple videos. Discussion In this thesis, we presented Flymodeller as a tool to perform automatic fly behavior annotation. We demonstrated how to use it to 1) obtain a training data set, 2) select features, 3) train models, 4) evaluate performance, and 5) make predictions. Most importantly, we also discussed the limitations of the most popular (indeed only) existing program JAABA and how these limitations are improved upon by Flymodeller. We then gave a demonstration of an example application, demonstrating that machine learning algorithms can be used to successfully predict fly behavior, and how to select features 22 Figure 12. confusion matrix for different algorithms on data 4553 × 80 23 Figure 13. ROC for different algorithms on data 4553 × 80 Figure 14. confusion matrix for different algorithms on data 4553 × 80 after PCA wisely. The algorithms we compared were: Support Vector Machines, Boosting(JAABA and Flymodeller), Random Forest, K-Nearest Neighbors, and Logistic Regression. This provides a major step forward for investigators who wish to conduct high-throughput analysis of behavioral data. 24 Table 3. Performance of algorithms on data 4553 × 80 after PCA a Measure was not available Figure 15. ROC for different algorithms on data 4553 × 80 after PCA A key goal in behavioral analyses is the ability to annotate large quantities of behavioral data. For a one hour video containing 18 flies, we have 10,800 frames for each fly, and in total 194,400 frames to label. It is clearly impossible to do this manually even for a single video, which makes automatic annotation software vital. JAABA made the concept of automatic behavior annotation feasible by introducing interpretable, quantitative measures (per-frame features) to capture animal dynamics in each frame, and “window” features which provide temporal context for windows around the each frame. However, its slow computation speed, limited algorithm options, and the inconvenience of using it to work with multiple videos make JAABA inefficient for high throughput analyses. These issues motivated us to develop Flymodeller, as an improvement to JAABA. As shown in the results, Flymodeller was 7 times faster (7 minutes versus 50 minutes) than JAABA when training a “chasing” classifier, and 6 times faster (30 minutes versus 3 hours) when making predictions for multiple videos using a given classifier. When comparing machine learning methods, in both 4553× 80 and 4553 × 24 data sets, Boosting implemented by Flymodeller and JAABA had very similar results, as would be expected. Moreover, since it contains more algorithm options, Flymodeller is more likely to be able to build good behavioral classifiers. In general the machine learning algorithms all performed relatively well. Among these five learning algorithms, SVM with linear kernel gave best cross-validation scores but was also most influenced by feature dimension. As expected, adding “unimportant” features did not help improve the model much and reduced performance in some algorithms. Another important point is the correct way to calculate PCs in this context. Given a dataset whose dimension we wish to reduce, we cannot perform the PCA first and then use the newly generated dataset to do cross-validation. This is incorrect since the testing data then “leaks” into the training data, since the PCs are calculated from the two combined. Thus, we would utilize information that should be unknown during validation. The correct way to perform PCA in a cross-validation setting is as follows: calculate the principal components, W, on the training dataset first, and then project training (TrW) and testing datasets (TeW) to the lower dimension, both using these principle components, where Tr and Te represent the training and testing matrix respectively. Only in this way can we guarantee that the data are projected to the same axes, and thus be comparable. In conclusion, we have described the Flymodeller platform that we have developed. We have shown how Flymodeller can be used as an interactive machine learning platform for 25 automatic fly behavior annotation. It significantly extends the options offered by the industry standard software tool in this context: JAABA. We demonstrated that with appropriate feature selection many of the machine learning algorithms performed well when classifying “chasing” behavior. SVM with linear kernel gave the best overall performance on our example data, but a more rigorous analysis is needed to determine whether this remains true for other behaviors, other fly genotypes, or other experimental conditions. We note that a subset of the results reported in this thesis appeared in the publication(Chen, et al., 2015). References Amit, Y . and Geman, D. Shape quantization and recognition with randomized trees. Neural Computation, 9:1545–1558, 1997. Breiman, L., Friedman, J., Stone, C., and Olshen, R. Classification and Regression Trees. CRC Press LLC, 1984. Breiman, L. Bagging predictors. Machine Learning, 24(2):123– 140, 1996. Breiman, L. Random forests. Machine Learning, 45(1):5–32, 2001. Breiman, Leo. Statistical Modeling: the two cultures. Statistical science, 2001, V ol.16,No. 3,199-231. Brian D. Ripley (about the difference between machine learning and statistics) useR! 2004, Vienna (May 2004). Chen, H; Marjoram, P; Foley, B; Ardekani, R. Trajectory evaluation and behavioral scoring using JAABA in a noisy system. Emerging Trends in Image Processing,Computer Vision and Pattern Recognition (ISBN: 9780128020920), Chapter 17. (March, 2015) Cover TM, Hart PE. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21-27. doi: 10.1109/TIT.1967.1053964 Craig, Adrian; Feare, Chris (2009). "Family Sturnidae (Starlings)". In del Hoyo, Josep; Elliott, Andrew; Christie, David. Handbook of the Birds of the World. V olume 14: Bush- shrikes to Old World Sparrows. Barcelona: Lynx Edicions. pp. 654–709. ISBN 978-84-96553-50-7. 26 Dietterich, T. G. An experimental comparison of three meth- ods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning, pp. 139–157, 2000. Ferreira, Artur and Figueiredo, Ma ́rio, Boosting Algorithms: A Review of Methods, Theory, and Applications. Ensemble Machine Learning, Chapter 3. pp. 35-85, 2012. Friedman, Jerome; Hastie, Trevor; Tibshirani, Robert (1998). "Additive Logistic Regression: A Statistical View of Boosting". CiteSeerX: 10.1.1.51.9525. Friedman, Jerome, Hastie, Trevor and Tibshirani, Robert. Additive logistic regression: a statistical view of boosting. Annals of Statistics 28(2), 2000. 337–407. Garland E. Allen. Thomas Hunt Morgan and the Problem of Sex Determination, 1903-1910. Proceedings of the American Philosophical Society. V ol. 110, No. 1 (Feb. 18, 1966), pp. 48-57. Ho, T. K. The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8):832–844, 1998. Khan, Sushmita Laila, and Nusrat Jahan Feroz. Analysis and manipulation of the data set obtained from the study of pre-primary education development using machine learning. Diss. BRAC University, 2014. Kabra, Mayank, et al. "JAABA: interactive machine learning for automatic annotation of animal behavior." nature methods 10.1 (2013): 64-67.APA. Kwokt, S. W. and Carter, C. Multiple decision trees. In Uncer- tainty in Artificial Intelligence, pp. 213–220, 1988. Kuncheva, L. Combining Pattern Classifiers: Methods and Algorithms. Wiley, 2004. Hastie, T, Tibshirani, R, and J. Friedman. The Elements of Statistical Learning. Springer, 2nd edition, 2001. Lyons, Chuck (5 February 2012). "1874: The Year of the Locust". HistoryNet. Retrieved 2015-06-17. Meir, R and Ra ̈ tsch, G. An introduction to boosting and leveraging. In S. Mendelson and A. Smola, editors, Advanced Lectures on Machine Learning. Springer Verlag, 2006. Pedregosa et al., Scikit-learn: Machine Learning in Python, JMLR 12, pp. 2825-2830, 2011. 27 Platt, J. C. Sequential minimal optimization: A fast algorithm for training support vector machines. Microsoft Res. Redmond WA, Tech. Rep. MSR-TR-98-14, 1998. Schapire, Robert E., and Yoram Singer. "Improved boosting algorithms using confidence- rated predictions." Machine learning 37.3 (1999): 297-336. Shalev-Shwartz, Shai, and Tong Zhang. "Stochastic dual coordinate ascent methods for regularized loss." The Journal of Machine Learning Research 14.1 (2013): 567-599. Schapire, R. The boosting approach to machine learning: An overview. In Nonlinear Esti- mation and Classification, Berkeley, 2002. Springer. Thomé, Antonio Carlos Gay. SVM Classifiers – Concepts and Applications to Character Recognition. Advances in Character Recognition. November 7, 2012. Vapnik, V . N., and A. Ya. Chervonenkis. Teoriya raspoznavaniya obrazov: Statisticheskie problemy obucheniya. (in Russian) [Theory of pattern recognition: Statistical problems of learning]. Moscow: Nauka, 1974. Vapnik, V .. Estimation of Dependences Based on Empirical Data [in Russian]. Moscow: Nauka, 1979. Vapnik, V . (1995). "Support-vector networks". Machine Learning 20 (3): 273. doi:10.1007/ BF00994018. Tan, Ying, Shi, Yuhui, Chai, Yi, Wang, Guoyin. Advances in Swarm Intelligence, Part II: Second International Conference, ICSI 2011, Chongqing, China, June 12-15, 2011, Proceedings. 28
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Analysis of factors associated with breast cancer using machine learning techniques
PDF
Animal behavior pattern annotation and performance evaluation
PDF
Nonlinear modeling and machine learning methods for environmental epidemiology
PDF
Sequential analysis of large scale data screening problem
PDF
Identifying prognostic gene mutations in colorectal cancer with random forest survival analysis
PDF
DNA methylation review and application
PDF
Psychometric study of an English version of Perceived Stress Scale in minority adolescents
PDF
Machine learning paradigms for behavioral coding
PDF
Novel statistical and computational methods for analyzing genome variation
PDF
Predicting autism severity classification by machine learning models
PDF
Using artificial neural networks to estimate evolutionary parameters
PDF
Machine learning approaches for downscaling satellite observations of dust
PDF
An analysis of conservation of methylation
PDF
Automatic tracking of flies and the analysis of fly behavior
PDF
Bone mineral density is associated with carotid atherosclerosis in healthy postmenopausal women: a longitudinal analysis of randomized clinical trial data
PDF
Disease risk estimation from case-control studies with sampling
PDF
Comparison of participant and study partner predictions of cognitive impairment in the Alzheimer's disease neuroimaging initiative 3 study
PDF
The application of machine learning in stock market
PDF
Shortcomings of the genetic risk score in the analysis of disease-related quantitative traits
PDF
Evaluating the effects of testing framework and annotation updates on gene ontology enrichment analysis
Asset Metadata
Creator
Chen, Hetian
(author)
Core Title
Flymodeller: an interactive machine learning platform for automatic fly behavior annotation
School
Keck School of Medicine
Degree
Master of Science
Degree Program
Biostatistics
Publication Date
02/18/2016
Defense Date
11/12/2015
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
animal behavior,boosting,Drosophila melanogaster,image processing,KNN,machine learning,OAI-PMH Harvest,random forests,SVMs
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Marjoram, Paul (
committee chair
), Mack, Wendy (
committee member
), Siegmund, Kimberly (
committee member
)
Creator Email
071cht@gmail.com,hetianch@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-208432
Unique identifier
UC11278547
Identifier
etd-ChenHetian-4112.pdf (filename),usctheses-c40-208432 (legacy record id)
Legacy Identifier
etd-ChenHetian-4112.pdf
Dmrecord
208432
Document Type
Thesis
Format
application/pdf (imt)
Rights
Chen, Hetian
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
animal behavior
boosting
Drosophila melanogaster
image processing
KNN
machine learning
random forests
SVMs