Close
The page header's logo
About
FAQ
Home
Login
USC Login
Register
0
Selected 
Invert selection
Deselect all
Deselect all
 Click here to refresh results
 Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Animal behavior pattern annotation and performance evaluation
(USC Thesis Other) 

Animal behavior pattern annotation and performance evaluation

doctype icon
play button
PDF
 Download
 Share
 Open document
 Flip pages
 More
 Download a page range
 Download transcript
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
  1
 










Animal Behavior Pattern Annotation  
and Performance Evaluation


 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 
Ye
 Meng
 
Professor
 Paul
 Marjoram
 
March
 24,
 2014
 

  2
 

 

 

 

 

 

 

 

 

 ....................................................................................................................................
 Table
 of
 Contents
 
Abstract
 ...................................................................................................................................................
 3
 
Introduction
 ...........................................................................................................................................
 3
 
Classification.……………………………………………………………………………………………………….…………5
 

 
 
 
 
 
 
 K-­‐Nearest
 Neighbors………………………………………………………………………………………………....5
 

 
 
 
 
 
 
 Boosting……………………………………………………………………………………………….…………………..7
 

 
 
 
 
 
 
 Support
 Vector
 Machines…………………………………………………………………………………………..8
 
Training/Validation
 Dataset……..………………………..…………………………….……………….………......10
 
Principal
 Component
 Analysis.…………………………………………….………………………………………..11
 
Data
 Preprocessing
 ………………………………………………………………………………………………………12
 
Methods
 .................................................................................................................................................
 12
 
JAABA
 Software
 ...................................................................................................................................
 12
 
Algorithms……………………………………………………………………………………………………………………13
 
Performance
 Evaluation………………………………………………………………………………………………..14
 
Results
 ...................................................................................................................................................
 15
 
Discussion
 ............................................................................................................................................................
 28
 
References…………...………………………………………………………………………………………………….30
 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

  3
 
Abstract
 

 
Learning
 animal
 behavior
 patterns,
 such
 as
 the
 social
 interactions
 of
 flies,
 is
 of
 great
 
interest
 both
 in
 terms
 of
 improving
 our
 understanding
 of
 how
 those
 patterns
 emerge
 and
 
as
 a
 precursor
 for
 looking
 for
 genetic
 determinants
 of
 those
 behaviors.
 Manual
 annotation
 
is
 labor-­‐intensive,
 so
 automated
 techniques
 are
 much
 needed.
 A
 popular
 tool
 for
 semi-­‐
automatic
  annotation
  of
  fly
  behavior,
  based
  upon
  video
  tracking
  data,
  is
  the
  JAABA
 
software
  package.
  In
  this
  thesis
  we
  propose
  methods
  based
  on
  machine
  learning
 
techniques
 to
 refine
 the
 analysis
 of
 sample
 data
 from
 a
 JAABA
 document.
 Machine
 learning
 
algorithms
 are
 applied
 to
 behavior
 outcome
 classification,
 since
 they
 are
 ideal
 for
 large
 
datasets.
 We
 do
 this
 using
 a
 self-­‐written
 Python
 package.
 In
 our
 case,
 flies
 are
 characterized
 
via
 approximately
 50
 per-­‐frame
 features.
 Those
 features
 capture
 relevant
 aspects
 of
 the
 
tracking
 data.
 While
 no
 feature
 can
 predict
 behavior
 in
 itself,
 combinations
 of
 features
 are
 
able
 to
 do
 so.
 We
 demonstrate
 how
 to
 do
 so,
 thereby
 offering
 scope
 for
 improving
 the
 
accuracy
 as
 well
 as
 speed
 of
 classification
 in
 future.
 

 
Introduction
 

 
Group
 behaviors
 of
 large
 aggregations
 of
 animals,
 and
 the
 interactions
 between
 them,
 
are
  fascinating
  natural
  phenomena.
  Particularly
  interesting
  is
  the
  situation
  in
  which
 
animals
 self-­‐organize
 behaviors
 into
 complex
 patterns
 with
 no
 need
 of
 external
 stimulus
 or
 
control.
  Such
  phenomena
  are
  known
  as
  “emergent
  behaviors”.
  However,
  very
  little
  is
 
known
 about
 the
 nature
 of
 such
 interactions.
 How
 do
 simple
 animals
 exhibit
 complex
 
behaviors?
 Are
 their
 leaders
 who
 trigger
 such
 behaviors,
 or
 do
 all
 animals
 act
 under
 the
 
same
 set
 of
 rules?
 Such
 knowledge
 might
 reasonably
 be
 applied
 to
 influence
 the
 behavior
 of
 
animals,
 such
 as
 leading
 a
 school
 of
 fish
 away
 from
 polluted
 waters,
 say,
 or
 to
 control
 
aggregates
 of
 robotic
 individuals,
 such
 as
 drones.
 
 For
 reasons
 such
 as
 this,
 the
 goal
 of
 
understanding
 group
 behaviors
 is
 a
 worthwhile
 goal
 to
 aim
 at.
 
 

 

  4
 
Since
 we
 cannot
 typically
 ask
 a
 fish
 why
 it
 is
 schooling,
 or
 a
 fly
 why
 it
 is
 on
 one
 food
 
patch
 rather
 than
 another,
 we
 need
 another
 way
 to
 get
 inside
 the
 creature’s
 mind.
 We
 can’t
 
ask
 it
 what
 it
 is
 doing,
 but
 we
 can
 tell
 a
 story
 that
 we
 believe
 might
 capture
 the
 underlying
 
processes.
 We
 can
 then
 use
 a
 simulation
 model
 to
 explore
 whether
 simulated
 animals
 
following
 that
 set
 of
 rules
 do
 in
 fact
 exhibit
 the
 behavior
 we
 observe
 in
 real
 data.
 
 

 
A
 key
 problem
 here
 is
 how
 to
 go
 about
 relating
 observed
 behavior
 from
 animals
 in
 the
 
field
 or
 lab,
 to
 the
 simulated
 behavior
 observed
 in
 our
 simulated
 animals.
 In
 days
 gone
 by,
 
such
 an
 analysis
 involved
 large
 numbers
 of
 researchers,
 with
 large
 numbers
 of
 clip-­‐boards,
 
even
 greater
 patience,
 and
 a
 willingness
 to
 observe
 animals
 for
 long
 periods
 of
 time
 taking
 
copious
 notes
 on
 what
 they
 saw.
 Such
 a
 strategy
 is
 clearly
 not
 viable,
 or
 desirable,
 in
 the
 
modern
 era
 of
 “Big
 Data”.
 For
 this
 reason,
 we
 look
 for
 more
 automated
 ways
 to
 proceed.
 
 

 
Machine
 learning
 might
 indicate
 the
 way
 patterns
 animal
 behavior
 emerges
 from
 local
 
rules
 of
 interaction
 among
 the
 individuals.
 Flies
 are
 experimental
 animals
 that
 are
 simple
 
to
 cultivate,
 and
 exhibit
 social
 behavior.
 Using
 video
 cameras,
 it
 is
 possible
 to
 investigate
 
their
 interactions
 and
 distances
 from
 each
 other
 in
 each
 frame
 during
 their
 time
 in
 the
 field
 
of
 vision.
 We
 see
 that
 groups
 of
 flies
 are
 formed,
 and
 that
 the
 nature
 of
 those
 groups
 
changes
 with
 genotype,
 for
 example.
 Other
 prominent
 examples
 of
 aggregation
 behavior
 
are
 bird
 flocks,
 fish
 schools
 and
 mammal
 herds,
 etc.
 
 Apart
 from
 its
 obvious
 relevance
 in
 
genetics,
 neurobiology
 and
 evolutionary
 biology,
 collective
 behavior
 is
 a
 key
 concept
 in
 
many
 other
 fields
 of
 science,
 including
 economics,
 control
 theory,
 and
 social
 sciences.
 There
 
are
 open-­‐source
 software
 tools
 that
 allow
 biologists
 to
 encode
 their
 intuition
 about
 the
 
structure
  of
  behavior
  and
  to
  transform
  the
  records
  of
  motion
  recorded
  by
  tracking
 
technology
 into
 higher-­‐order,
 scientifically
 meaningful
 statistics
 of
 behavior.
 

 
Traditionally
 statistical
 analysis
 relies
 upon
 an
 underlying
 theoretical
 model,
 or
 set
 of
 
assumptions,
 and
 usually
 the
 analysis
 is
 designed
 around
 that
 principal
 theory,
 for
 example
 
to
 test
 specific
 hypotheses
 about
 a
 data
 set
 of
 interest.
 On
 the
 contrary,
 with
 Machine
 
Learning
 [ML]
 we
 take
 a
 different
 perspective
 in
 that
 we
 have
 an
 observed
 outcome
 and
 
our
 goal
 is
 to
 find
 a
 way
 to
 predict
 that
 outcome
 using
 the
 underlying
 data.
 
 In
 our
 context,
 

  5
 
the
 outcome
 is
 whether
 a
 fly
 performs
 a
 given
 behavior
 or
 not
 in
 a
 given
 frame,
 and
 the
 
underlying
 data
 is
 the
 combination
 of
 features
 that
 describe
 the
 animal’s
 behavior
 at
 that
 
moment.
 With
 machine
 learning,
 we
 typically
 do
 not
 have
 to
 make
 underlying
 assumptions,
 
such
 as
 a
 normal
 assumption
 in
 a
 linear
 regression.
 What
 makes
 ML
 a
 particular
 field
 is
 
partially
 its
 goals
 and
 problems,
 but
 also
 the
 large
 set
 of
 tools,
 techniques
 and
 strategies
 
that
 it
 involves.
 It
 is
 characterized
 by
 the
 massive
 use
 of
 algorithms
 and
 computational
 
resources
 to
 deal
 with
 large
 sets
 of
 data,
 high
 number
 of
 variables
 and
 complex
 data
 
structures.
 
 

 
In
 this
 section
 we
 introduce
 the
 popular
 Machine
 Learning
 techniques
 that
 we
 will
 be
 
employing
 in
 this
 thesis
 for
 the
 purpose
 of
 behavior
 classification,
 (which
 is
 viewed
 as
 
categorical
 data
 for
 our
 purposes).
 Those
 algorithms
 are
 K-­‐nearest
 neighbors,
 Boosting
 and
 
Support
 Vector
 Machine.
 But
 first,
 we
 define
 the
 goal
 of
 “classification”.
 

 
Classification:
 
 
 
 
A
 classification
 problem
 in
 our
 context
 consists
 of
 a
 set
 of
 input
 vectors
 x1,…,
 xn
 in
 ℝ
n

 
and
 corresponding
 output
 labels
 y1,…,
 yn.
 
 In
 the
 context
 of
 flies,
 in
 the
 case
 of
 the
 behavior
 
of
 fly
 chasing,
 for
 example,
 there
 are
 two
 classes
 –
 ‘chasing’
 and
 ‘not
 chasing’
 -­‐
 and
 labels
 
are
 thus
 binary
 {+1,
 -­‐1}.
 The
 input
 vectors
 will
 be
 automatically
 recorded
 properties
 of
 the
 
observed
  flies
  (position,
  orientation,
  size,
  etc.).
  A
  predictive
  model,
  or
  set
  of
  rules,
  is
 
constructed
 and
 new
 data
 points
 are
 predictively
 classified
 according
 to
 the
 sign
 of
 the
 
resulting
  predictor.
  We
  compare
  three
  classification
  approaches—K-­‐nearest
  neighbor,
 
boosting
 and
 SVM
 in
 terms
 of
 error
 occurrence.
 

 

 
K-­‐Nearest
 Neighbors
 (k-­‐NN)
 
Among
 all
 machine
 learning
 algorithms,
 the
 k-­‐nearest
 neighbors
 (KNN)
 is
 the
 simplest
 
and
 most
 intuitive
 method.
 The
 concept
 of
 a
 nearest
 neighbor
 decision
 rule
 was
 first
 
proposed
 by
 Cover
 &
 Hart
 in
 1967(Cover
 TM,
 1967).
 The
 algorithm
 starts
 by
 assuming
 the
 
existence
 of
 a
 set
 of
 Training
 Data,
 for
 which
 classifications
 are
 already
 known.
 For
 the
 

  6
 
purpose
 of
 classification,
 an
 object
 is
 classified
 by
 a
 majority
 vote
 of
 its
 neighbors
 in
 the
 
Training
 Data,
 with
 the
 object
 being
 assigned
 to
 the
 class
 most
 common
 among
 
those
 k
 nearest
 neighbors.
 For
 example,
 for
 some
 new
 point
 x,
 if
 k
 =
 1,
 then
 the
 object
 x
 is
 
simply
 assigned
 to
 the
 class
 of
 its
 single
 nearest
 neighbor
 in
 the
 Training
 Data.
 In
 summary,
 
the
 KNN
 algorithm
 says
 new
 points
 have
 to
 be
 labeled
 to
 the
 class
 that
 is
 in
 the
 majority
 
among
 its
 neighbors.
 

 
Figure1
 a,
 in
 this
 2-­‐dimensional
 example,
 the
 nearest
 point
 to
 x
 is
 a
 red
 training
 instance,
 thus,
 x
 will
 be
 labeled
 as
 red.
 
Figure1
 b,
 shows
 a
 decision
 boundary
 formed
 by
 KNN
 rules.
 
 

 
The
 KNN
 rule
 can
 determine
 the
 label
 of
 every
 new
 point
 in
 the
 space
 and
 give
 rise
 to
 
a
 decision
 boundary
 that
 partitions
 the
 feature
 space
 into
 different
 regions
 (Figure
 1).
 In
 
contrast
 to
 other
 techniques,
 like
 Support
 Vector
 Machines
 that
 will
 discard
 all
 non-­‐support
 
vectors,
 most
 of
 the
 lazy
 algorithms
 –
 and
 in
 particular
 KNN
 –
 makes
 a
 decision
 based
 on
 
the
 entire
 training
 data
 set.
 One
 straightforward
 extension
 is
 not
 to
 give
 1
 vote
 to
 all
 the
 
neighbors.
 For
 example,
 a
 very
 common
 thing
 to
 do
 is
 weighted
 KNN,
 where
 each
 point
 has
 
a
 weight
 which
 is
 typically
 calculated
 as
 a
 function
 of
 its
 distance.
 

 
( )=   ∈  ( )
|| −  ||
!
!

 
More
 formally,
 assuming
 we
 are
 working
 in
 the
 space
 ℝ
d
,
 to
 compute
 the
 predictor
 at
 
a
 point
 x,
 we
 have
 to
 define
 a
 neighborhood
 Nk(x)
 corresponding
 to
 the
 set
 of
 the
 k
 closest
 
observations
 to
 x
 among
 the
 Training
 Data
 in
 ℝ
d
.
 
 

 
The
 classification
 rule
 implicitly
 computes
 the
 decision
 boundary
 majority
 
vote:
 
 

  7
 
Aggregate
 every
 neighbor’s
 vote:
 
  
=  
∈   (   
==  ),∀ ∈ [ ]
 
Label
 according
 to
 the
 majority:
 
=  ( )=
∈[ ]
 
 
In
 the
 binary
 classification
 case,
 the
 borderline
 between
 the
 two
 classes
 is
 in
 general
 a
 
very
 irregular
 curve.
 KNN
 is
 a
 non-­‐parametric
 method
 that
 is
 robust
 against
 outliers
 and
 
has
 strong
 guarantees
 of
 “doing
 the
 right
 thing”(Hastie,
 Tibshirani,
 &
 Friedman,
 2001).
 The
 
only
 unknown
 parameter
 is
 k.
 When
 k
 increases,
 it
 will
 yield
 a
 smoother
 decision
 region
 
that
 cuts
 across
 noisy
 points(Bishop,
 2006).
 

 
 
Boosting
 (GentleBoost)
 
Boosting
 is
 an
 iteration
 technique
 that
 learns
 an
 accurate
 classifier
 by
 adding
 a
 set
 of
 
weak
 rules
 repeatedly
 in
 a
 series
 of
 rounds.
 The
 algorithm
 implements
 an
 iterative
 
procedure
 that
 requires
 more
 accuracy
 in
 the
 next
 iteration
 at
 precisely
 the
 points
 at
 which
 
the
 previous
 predictor
 had
 the
 worst
 performance.
 In
 the
 long
 run,
 boosting
 will
 obtain
 an
 
efficient
 predictor.
 
 Boosting
 is
 based
 on
 the
 question
 posed
 by
 Kearns(Kearns):
 Can
 a
 set
 
of
 weak
 learners
 create
 a
 single,
 stronger
 learner?
 A
 weak
 learner
 is
 defined
 to
 be
 a
 
classifier
 which
 is
 only
 slightly
 correlated
 with
 the
 true
 classification.
 Schapire's
 affirmative
 
answer(Schapire,
 1990)
 to
 Kearns'
 question
 has
 had
 significant
 ramifications
 in
 machine
 
learning
 and
 statistics,
 most
 notably
 leading
 to
 the
 development
 of
 boosting.
 

 
The
  details
  of
  Boosting
  are
  as
  follows.
  The
  classifier
  first
  begins
  with
  a
  random
 
prediction
 where
 accuracy
 will
 typically
 be
 around
 0.5
 for
 a
 binary
 classifier.
 A
 more
 
accurate
 predictor
 is
 generated
 from
 the
 current
 weak
 rules
 by
 calculating
 a
 distribution
 of
 
weights
 over
 the
 training
 set.
 Initially,
 all
 training
 set
 members
 are
 given
 equal
 weight.
 
After
 each
 iteration,
 the
 weights
 of
 incorrectly
 classified
 data
 will
 increase
 so
 that
 the
 
classifier
 is
 forced
 to
 concentrate
 more
 on
 the
 harder
 examples.
 
 The
 algorithm
 proceeds
 as
 
follows:
 

  8
 

 
1.
 Initialize
 distribution
 weight
 at
 for
 iteration
 1:
 D
!
(i)
 =
 1/m
 ,
 t
 =
 1,
 .
 .
 .,
 T
 
2.
 Find
 weak
 h
!
:
 X
 →  {1,
 +1}
 with
 least
 error
 ϵ
!

 =Pr
!~!
!
[h
!
(x
!
≠ y
!
)]
 
3.
 Choose
 α  
!
=
 1-­‐2ϵ
!
,
 Update:
 

 
D
!!!
(i)=
1
1+exp(margin
!
)
 ×  
1
Z
!

 
where
 Z
!

 is
 a
 normalization
 factor
 that
 makes
 D
!!!

 sum
 to
 1.
 
4.
 Output
 the
 final
 hypothesis:
 
H
(!)
= sign( α  
!
!
!!!
h
!
(x
!
))
 
To
 sum
 up
 in
 words,
 the
 algorithm
 first
 forms
 a
 large
 set
 of
 simple
 features
 and
 
initializes
 the
 weights
 for
 training
 examples.
 For
 T
 rounds
 we
 repeat
 the
 following:
 first,
 
train
 a
 classifier
 using
 a
 single
 feature
 and
 evaluate
 the
 training
 error
 for
 those
 available
 
features;
 next,
 choose
 the
 classifier
 with
 the
 lowest
 error,
 and
 that
 classifier
 to
 the
 set
 of
 
weak
  learners,
  and
  then
  update
  the
  weights
  of
  training
  examples
  according
  to
  how
 
accurately
 they
 are
 now
 classified
 (the
 most
 poorly
 classified
 getting
 the
 highest
 weights).
 
 
Finally
 form
 the
 strong
 classifier
 as
 the
 linear
 combination
 of
 T
 weak
 classifiers.
 

 

 
Support
 Vector
 Machines
 (SVMs)
 
SVM
 was
 first
 introduced
 in
 1992(al.,
 1992).
 It
 has
 become
 popular
 because
 of
 its
 
success
  in
  handwritten
  digit
  recognition(BURGES,
  1998).
  SVM
  are
  “maximum
  margin
 
classifiers”.
 This
 means
 that,
 among
 all
 the
 hyperplanes
 that
 separate
 the
 training
 data
 into
 
two
 classes
 (ideally,
 all
 the
 positively
 classified
 data
 are
 on
 one
 side
 and
 all
 the
 negative
 
classifications
 are
 on
 the
 other
 side),
 there
 exists
 only
 one
 hyperplane
 in
 R
d

 such
 that
 this
 
specific
 hyperplane
 maximizes
 the
 margin
 from
 it
 to
 the
 two
 classes
 (Figure
 2).
 

 

  9
 

 
Figure
 2.
 The
 last
 boundary
 gives
 the
 maximum
 margin
 solution.
 

 
Intuitively
 this
 hyperplane
 is
 the
 best
 boundary
 as
 the
 classifier
 is
 the
 farthest
 
possible
 from
 all
 cases,
 and
 it
 will
 thereby
 generalize
 against
 new
 data
 points
 that
 lie
 
slightly
 outside
 of
 the
 observed
 boundary.
 The
 hypothesis
 is:
 

 
( )=   +  
 

 
where
 the
 optimal
 hyperplane
 is
 denoted
 by
 f(x)
 =
 0.
 The
 goal
 is
 to
 find
 the
 specific
 
parameters
 w
 ∈
 R
!
 and
 b
 that
 minimizes
 an
 object
 like:
 

 
!
!
l(w ∙x
!
!
!!!
+b,y
!
)+ w
!

 
 under
 constraints
 y
!
w ∙x
!
+b ≥ 1,
 

 
where
 the
 first
 term
 is
 the
 training
 error,
 and
 the
 second
 term
 is
 the
 complexity
 term,
 
w
!      ,
,
 which
 is
 the
 margin
 that
 we
 want
 to
 minimize,
 assuming
 we
 can
 separate
 the
 data
 
perfectly.
 But
 this
 only
 applies
 to
 linearly
 separable
 cases,
 for
 cases
 which
 are
 non-­‐linearly
 
separable,
 the
 object
 now
 changes
 to:
 

 
w
!
+C ξ
!
!
!
!!!

 
 under
 constraints
 y
!
w ∙x
!
+b ≥ 1− ξ
!
,ξ
!
≥ 0,
 

 
where
 p
 is
 either
 1
 (“hinge
 loss”)
 or
 2
 (“quadratic
 loss”)(Chapelle,
 2007).
 
 Since
 a
 single
 
straight
 line
 may
 be
 insufficient
 to
 separate
 classes,
 at
 this
 point,
 SVMs
 rely
 on
 the
 so-­‐called
 
kernel-­‐trick
 to
 increase
 the
 separation
 between
 classes.
 The
 kernel-­‐trick
 makes
 use
 of
 a
 

  10
 
kernel
 k(x,
 y)
 that
 measures
 similarities
 between
 elements
 x,
 y
 and
 needs
 to
 fulfill
 certain
 
properties
 to
 be
 applicable
 (it
 must
 be
 positive
 semi-­‐definite).
 A
 frequently
 used
 kernel
 is
 
the
 Gaussian
 kernel:
 

 
k(x,y)  = exp(−σ||x−y||
!
),
 

 
where
 σ
 is
 a
 hyper-­‐parameter
 and
 ||∙||
 is
 the
 Euclidean
 norm.
 Using
 the
 Gaussian
 kernel
 
implies
 that
 the
 data
 is
 embedded
 in
 an
 infinite
 dimensional
 space
 and
 all
 operations
 are
 
applied
 in
 this
 space(Chapelle,
 2007).
 

 

 
Training/Validation
 Dataset
 
Far
 better
 results
 can
 be
 obtained
 by
 adopting
 a
 machine
 learning
 approach
 in
 which
 
three
 distinct
 datasets:
 training
 data,
 test
 data
 and
 cross-­‐validation
 set
 are
 used
 to
 tune
 
parameters.
 The
 training
 data
 consists
 of
 explanatory
 features
 x1,…,
 xn
 in
 ℝ
n

 and
 labeled
 
pre-­‐classified
 targets
 y1,…,
 yn
 known
 in
 advance.
 The
 first
 and
 simplest
 form
 of
 the
 machine
 
learning
 algorithm
 is
 established
 on
 the
 basis
 of
 training
 set.
 After
 the
 learning
 phase,
 the
 
test
 set
 is
 used
 to
 determine
 the
 identity
 of
 new
 data.
 The
 ability
 to
 correctly
 categorize
 
new
 examples
 that
 differ
 from
 those
 used
 for
 training
 is
 known
 as
 ‘generalization’(Bishop,
 
2006).
 
 

 
This
 procedure
 is
 used
 to
 avoid
 overfitting:
 a
 methodological
 mistake
 that
 arises
 if
 
one
 both
 learns
 the
 parameters
 of
 a
 prediction
 function,
 and
 then
 tests
 it,
 on
 the
 same
 data:
 
a
 model
 that
 would
 just
 repeat
 the
 labels
 of
 the
 samples
 that
 it
 has
 just
 seen
 would
 have
 a
 
perfect
 score
 but
 would
 likely
 fail
 to
 predict
 anything
 useful
 on
 yet-­‐unseen
 data.
 The
 model
 
is
 then
 refined
 in
 the
 cross-­‐validation
 step.
 Here,
 extra
 parameters
 are
 fitted
 until
 the
 
estimator
 appears
 to
 perform
 optimally.
 
 

 
 
These
 algorithms
 exploit
 “cross-­‐validation”.
 The
 technique
 of
 K-­‐fold
 cross-­‐validation
 
involves
 taking
 the
 available
 data
 and
 partitioning
 it
 into
 K
 equal
 groups.
 We
 use
 K
 −
 1
 of
 

  11
 
the
 groups
 to
 train
 and
 fit
 a
 set
 of
 models
 that
 are
 then
 validated
 on
 the
 remaining
 group.
 
This
 procedure
 is
 then
 iterated
 for
 all
 N
 possible
 groups,
 indicated
 in
 figure
 3
 by
 the
 red
 
blocks,
 and
 the
 performance
 scores
 from
 the
 K
 runs
 are
 then
 averaged.
 

 
Fig
 3.
 5-­‐fold
 Cross-­‐validation
 technique.
 

 
Principal
 Component
 Analysis
 (PCA)
 
 
In
 developing
 a
 successful
 SVM
 forecast
 when
 dealing
 with
 high
 dimensional
 data,
 for
 
instance,
 if
 there
 are
 thousands
 of
 features
 in
 the
 data
 X
(!)
∈ℝ
!,!!!
,
 the
 features
 are
 likely
 
to
 be
 highly
 correlated
 and
 amount
 of
 information
 that
 they
 carry
 about
 the
 behaviors.
 For
 
this
 reason
 a
 first
 step
 of
 feature
 extraction
 is
 often
 added.
 Principal
 Component
 Analysis
 
(PCA)
 is
 by
 far
 one
 of
 the
 most
 commonly
 used
 approaches
 for
 this.
 It
 linearly
 transforms
 
the
 original
 inputs
 into
 new,
 uncorrelated
 features.
 By
 compressing
 the
 data
 by
 using
 just
 a
 
subset
 of
 the
 PCs,
 PCA
 data
 reduction
 will
 generally
 improve
 the
 performance
 and
 speed
 of
 
the
 ML
 algorithm,
 which
 is
 itself
 trying
 to
 find
 a
 lower
 dimensional
 surface
 onto
 which
 to
 
project
 the
 data
 with
 less
 squared
 projection
 error.
 In
 general,
 if
 the
 data
 has
 N-­‐dimensions,
 
the
 goal
 is
 to
 reduce
 it
 to
 k-­‐dimensions.
 We
 aim
 to
 find
 k
 such
 vectors,
  (!)
,
  (!)
,…,
  ( )
,
 onto
 
which
 to
 project
 the
 data
 so
 as
 to
 minimize
 the
 projection
 error.
 To
 apply
 this
 approach,
 we
 
first
 evaluate
 the
 covariance
 matrix
  !!
   
 and
 find
 its
 eigenvectors
 and
 eigenvalues,
 and
 
then
 compute
 the
 eigenvectors
 in
 the
 original
 data
 space
 using
 normalization
 rescaling.
 The
 
number
 k,
 called
 the
 number
 of
 components,
 can
 be
 viewed
 as
 a
 PCA
 parameter.
 Commonly,
 
we
 pick
 the
 smallest
 value
 of
 k
 for
 which
 the
 percentage
 of
 variance
 explained
 is
 99%(A
 
d'Aspremont,
 2004).
 In
 other
 words,
 we
 must
 have:
 
!
|| ( )
−
( )
||
!   !
|| ( )
||
!   ≤ 0.01
 

  12
 

 
 
Data
 Preprocessing
 (feature
 scaling)
 
Since
  the
  range
  of
  values
  of
  raw
  data
  varies
  widely,
  in
  some
 machine
 
learning
 algorithms
 objective
 functions
 will
 not
 work
 properly
 without
 normalization.
 For
 
example,
 the
 majority
 of
 classifiers
 calculate
 the
 distance
 between
 two
 points
 by
 a
 metric
 
such
 as
 Euclidean
 distance.
 If
 one
 of
 the
 features
 has
 a
 broad
 range
 of
 values,
 the
 distance
 
will
 be
 likely
 be
 governed
 by
 this
 particular
 feature.
 Therefore,
 the
 range
 of
 all
 features
 
should
 be
 normalized
 so
 that
 each
 feature
 contributes
 approximately
 proportionately
 to
 
the
 final
 distance.
 We
 define
 
′=
 ( )

 
where
 x’
 is
 the
 rescaled
 value,
 x
 is
 the
 original
 value,
 and
 max(x)
 the
 maximum
 value
 of
 x
 
 
among
 all
 frames
 of
 data.
 
 

 
Method:
 
JAABA
 Software
 
Our
 analysis
 uses
 data
 that
 were
 drawn
 from
 the
 JAABA
 document.
 We
 focus
 on
 a
 
particular
 behavior:
 that
 of
 one
 fly
 chasing
 a
 nearby
 fly.
 We
 analyze
 the
 data
 of
 (Kabra,
 
Robie,
 Rivera-­‐Alba,
 Branson,
 &
 Branson,
 2013).
 According
 to
 the
 supplementary
 methods
 
section
 of
 that
 paper,
 20
 Drosophila
 melanogaster
 (10
 males
 and
 10
 females)
 were
 reared
 
in
 standard
 vials
 on
 dextrose-­‐based
 medium,
 under
 moderate
 temperature,
 fresh
 food
 and
 
starvation
 treatment(Kabra
 et
 al.,
 2013).
 Flies
 were
 recorded
 by
 camera
 and
 the
 trajectory
 
outputs
 of
 the
 motion
 of
 20
 flies
 are
 automatically
 tracked
 using
 Ctrax(Branson,
 2009),
 a
 
software
 that
 assigns
 both
 a
 fly
 identity
 and
 a
 label
 of
 the
 body.
 The
 JAABA
 software
 
transforms
 trajectory
 outputs
 into
 a
 novel,
 efficient
 general-­‐purpose
 representation
 by
 
computing
 a
 suite
 of
 
 ‘per-­‐frame’
 features
 that
 describe
 the
 state
 of
 the
 animal
 in
 the
 
current
 frame
 (e.g.,
 size,
 orientation).
 From
 these,
 JAABA
 computes
 a
 general
 set
 of
 window
 

  13
 
features
 that
 provide
 temporal
 context
 around
 each
 frame,
 for
 example:
 min,
 std,
 and
 mean
 
for
 given
 per-­‐frame
 features.
 In
 the
 study,
 we
 use
 both
 per-­‐frame
 and
 window
 features.
 To
 
get
 the
 training
 data,
 we
 used
 JAABA
 to
 manually
 label
 a
 number
 of
 bouts
 of
 “chasing”
 
where
 we
 were
 certain
 the
 fly
 was
 performing
 the
 behavior,
 as
 well
 as
 a
 couple
 of
 nearby
 
bouts
 of
 “not
 chasing”,
 in
 which
 the
 fly
 was
 not.
 See
 figure
 4
 for
 a
 screen
 shot
 of
 this
 
labeling
 process.
 There
 are
 several
 “label
 tricks”
 when
 conducting
 such
 labeling
 of
 
behaviors.
 “Chasing”
 is
 a
 label
 applied
 when
 the
 fly
 is
 suddenly
 turning
 around
 and
 
accelerating
 in
 another
 fly’s
 direction.
 If
 a
 fly
 is
 just
 passing
 by
 another
 fly,
 and
 it
 does
 not
 
have
 a
 tendency
 to
 move
 towards
 the
 target
 fly,
 then
 it
 is
 not
 considered
 to
 be
 chasing.
 
However,
 since
 the
 labeling
 is
 performed
 manually,
 it
 is
 quite
 arbitrary
 in
 the
 sense
 that
 the
 
starting
 point
 at
 which
 to
 label
 the
 chasing
 behavior
 is
 vague,
 there
 is
 not
 a
 strict
 distance
 
between
 two
 flies
 that
 we
 can
 use
 to
 tell
 when
 the
 behavior
 actually
 begins.
 

 

 
Figure
 4.
 JAABA
 panel
 for
 observe
 one
 fly
 behavior,
 label,
 train
 and
 predict.
 
Algorithms
 
In
 total,
 we
 obtained
 four
 sets
 of
 data,
 one
 of
 839
 frames,
 one
 2743,
 one
 5144
 and
 one
 
10523
 frames,
 among
 a
 total
 of
 27375
 frames.
 All
 frames
 are
 quantified
 using
 1197
 
features.
 In
 this
 experiment,
 data
 were
 always
 labeled
 by
 the
 same
 person,
 Ye
 Meng,
 in
 the
 
same
 computing
 environment,
 so
 that
 the
 way
 we
 obtained
 training
 data
 and
 decided
 
whether
 a
 given
 behavior
 had
 occurred
 or
 not
 was
 consistent
 across
 time
 and
 would
 not
 

  14
 
confuse
 the
 classification
 process.
 
 In
 a
 first
 data
 processing
 step,
 we
 transformed
 the
 data
 
into
 rescaled
 features
 to
 standardize
 their
 range.
 We
 then
 fitted
 these
 data
 using
 boosting,
 
Gaussian
 kernel
 SVM-­‐PCA
 and
 k-­‐nearest
 neighbor
 using
 the
 scikit-­‐learn
 machine
 learning
 
python
 modules.
 
 

 
Performance
 Evaluation
 
We
 divided
 the
 data
 into
 training
 and
 cross-­‐validation
 data
 first,
 with
 a
 ratio
 of
 3:1,
 
where
 we
 performed
 4-­‐fold
 cross
 validation,
 in
 this
 process,
 we
 randomly
 split
 the
 set
 of
 
data
 into
 4
 subsets.
 We
 train
 a
 classifier
 on
 3/4
 of
 the
 data,
 and
 then
 estimate
 the
 error
 rate
 
on
 the
 remaining
 1/4
 subset.
 That
 is
 how
 the
 cross-­‐validation
 score
 is
 measured.
 Towards
 
algorithms
 performance,
 ranked
 algorithms
 that
 each
 generated
 the
 comparative
 accuracy
 
scores
 after
 cross-­‐validation.
 We
 calculated
 the
 error
 rates
 for
 model
 performance
 
evaluation
 across
 the
 4
 cross-­‐validation
 iterations,
 and
 averaged
 these
 4
 scores
 to
 obtain
 
the
 final
 measures.
 Specifically,
 the
 performance
 measures
 are
 accuracy
1
,
 MSE
2
,
 ,Precision
3
,
 
Specificity
4
,
 Recall
 rate
 (sensitivity)
5
,
 and
 F
 score
6
,
 almost
 all
 of
 which
 were
 computed
 from
 
the
 confusion
 table.
 These
 are
 defined
 below.
 We
 used
 these
 measures
 to
 rank
 the
 relative
 
performances
 of
 the
 ML
 algorithms.
 
 

 

 

 

 

 

 
_________________________________________________________________________________________________________
 
1
 

   
 
2
 
MSE: For binary outcomes, quantify the difference between values of predict probabilities and the Ground-truth target values.
3  
4
Specificity (sometimes called the true negative rate): measures the proportion of negatives which are correctly identified

5
Sensitivity
 measures the proportion of actual positives which are correctly identified

6
 

  15
 
Results:
 

 
The
 following
 tables
 show
 performance
 on
 the
 four
 different
 datasets,
 which
 we
 recall
 
vary
 according
 to
 the
 amount
 of
 data
 they
 contain.
 Each
 dataset
 has
 the
 same
 number
 of
 
features
 capturing
 fly
 properties,
 1197,
 meaning
 our
 data
 are
 of
 sizes
 839x1197,
 
2743x1197,
 5144x1197
 and
 10523x1197.
 Through
 PCA,
 each
 dataset
 was
 first
 reduced
 
from
 a
 dimension
 of
 1197
 to
 a
 to
 lower
 dimension
 of
 150.
 We
 choose
 this
 as
 the
 number
 of
 
component
 since
 it
 gives
 the
 lowest
 explained
 variance
 increment
 by
 adding
 components
 
to
 150
 (Figure
 5),
 which
 showed
 by
 the
 plot,
 approximately
 the
 minimum
 number
 of
 
component
 that
 explains
 the
 greatest
 percentage
 of
 the
 data’s
 total
 variance.
 PCA
 also
 
greatly
 helped
 increase
 the
 computing
 speed
 and
 thereby
 reducing
 the
 analysis
 time.
 
 

 
Figure
 5.
 PCA
 number
 of
 component
 selection
 
We
 fit
 boosting,
 logistic,
 SVM
 with
 a
 linear
 kernel,
 SVM
 with
 a
 Gaussian
 kernel,
 and
 
KNN
 algorithms
 and
 summarize
 performance
 using
 the
 following
 average
 4-­‐fold
 Cross-­‐
validation
 scores:
 Accuracy
 rate,
 MSE,
 Specificity,
 Precision,
 Sensitivity
 (recall),
 F
 score.
 We
 
decide
 which
 algorithm
 performs
 better
 by
 comparing
 accuracy
 rate,
 since
 it
 gives
 
unbiased
 estimates
 of
 overall
 performance
 and
 might
 therefore
 be
 used
 to
 give
 an
 overall
 
rank
 of
 the
 algorithms.
 Since
 our
 classifier
 is
 binary,
 we
 can
 evaluate
 the
 optimal
 classifier
 
in
 a
 direct
 and
 natural
 way
 —using
 the
 area
 under
 the
 receiver
 operating
 
characteristic
 (ROC)
 curve,
 which
 is
 basically
 created
 by
 plotting
 the
 fraction
 of
 true
 
positives
 out
 of
 the
 total
 actual
 positives
 (TPR
 =
 true
 positive
 rate)
 vs.
 the
 fraction
 of
 false
 

  16
 
positives
 out
 of
 the
 total
 actual
 negatives
 (FPR
 =
 false
 positive
 rate).
 Within
 each
 algorithm,
 
I
 compare
 the
 performance
 of
 data
 before
 and
 after
 the
 feature
 scaling
 process
 described
 in
 
“Data
 Preprocessing
 “.
 Results
 are
 shown
 in
 tables
 1-­‐4.
 

 

  Boosting
 
Performance
 
Logistic
 
Regression
 
Performance
 
SVM
 (Linear)
 
Performance
 
SVM
 (Gaussian)
 
Performance
 
k-­‐Nearest
 
Neighbor
 
Performance
 
Measures
  After
 
Rescale
 
Before
 
Rescale
 
After
 
Rescale
 
Before
 
Rescale
 
After
 
Rescale
 
Before
 
Rescale
 
After
 
Rescale
 
Before
 
Rescale
 
After
 
Rescale
 
Before
 
Rescale
 
Accuracy
 
Rate
 
0.799
  0.868
  0.902
  0.818
  0.867
  0.758
  0.893
  0.614
  0.759
  0.751
 
MSE
  0.152
  0.109
  0.077
  0.160
  0.107
  0.209
  0.061
  0.285
  0.186
  0.195
 
Specificity
  0.635
  0.830
  0.865
  0.810
  0.788
  0.699
  0.853
  0.0
a

  0.432
  0.587
 
Precision
 
  0.771
  0.849
  0.905
  0.830
  0.836
  0.769
  0.878
  0.614
  0.709
  0.717
 
Sensitivity
 
  0.943
  0.914
  0.930
  0.856
  0.951
  0.885
  0.943
  1.0
  0.994
  0.945
 
F
 score
  0.828
  0.875
  0.913
  0.830
  0.877
  0.794
  0.896
  0.732
  0.805
  0.793
 
Table
 1.
 Performance
 before
 and
 after
 rescaling
 on
 data-­‐scale
 839x1197
 
a
Measure
 was
 ‘not-­‐defined’
 for
 some
 labels
 

 

 

  Boosting
 
Performance
 
Logistic
 
Regression
 
Performance
 
SVM
 (Linear)
 
Performance
 
SVM
 (Gaussian)
 
Performance
 
k-­‐Nearest
 
Neighbor
 
Performance
 
Measures
  After
 
Rescale
b

 
Before
 
Rescale
 
After
 
Rescale
 
Before
 
Rescale
 
After
 
Rescale
 
Before
 
Rescale
 
After
 
Rescale
 
Before
 
Rescale
 
After
 
Rescale
 
Before
 
Rescale
 
Accuracy
 
Rate
 
0.913
  0.869
  0.945
  0.933
  0.949
  0.926
  0.958
  0.619
  0.906
  0.882
 
MSE
  0.068
  0.104
  0.049
  0.062
  0.043
  0.063
  0.036
  0.245
  0.067
  0.078
 
Specificity
  0.841
  0.777
  0.896
  0.877
  0.907
  0.865
  0.916
  0.0
a

  0.761
  0.725
 
Precision
 
  0.923
  0.866
  0.945
  0.926
  0.951
  0.921
  0.968
  0.619
  0.873
  0.856
 
Sensitivity
 
  0.933
  0.919
  0.964
  0.963
  0.963
  0.959
  0.964
  1.0
a

  0.985
  0.967
 
F
 score
  0.927
  0.892
  0.954
  0.943
  0.957
  0.938
  0.966
  0.759
  0.925
  0.907
 
Table
 2.
 Performance
 before
 and
 after
 scaling
 on
 data-­‐scale
 2743x1197
 
a
Measure
 was
 not
 defined
 for
 some
 labels
 
b
overflow
 encountered
 

 

 

  Boosting
 
Performance
 
Logistic
 
Regression
 
Performance
 
SVM
 (Linear)
 
Performance
 
SVM
 (Gaussian)
 
Performance
 
k-­‐Nearest
 
Neighbor
 
Performance
 
Measures
  After
 
Rescale
 
Before
 
Rescale
b

 
After
 
Rescale
 
Before
 
Rescale
 
After
 
Rescale
 
Before
 
Rescale
 
After
 
Rescale
 
Before
 
Rescale
 
After
 
Rescale
 
Before
 
Rescale
 
Accuracy
 
Rate
 
0.914
  0.892
  0.913
  0.901
  0.915
  0.910
  0.932
  0.595
  0.898
  0.868
 
MSE
  0.065
  0.086
  0.081
  0.095
  0.070
  0.078
  0.052
  0.245
  0.073
  0.095
 
Specificity
  0.872
  0.841
  0.877
  0.842
  0.883
  0.860
  0.907
  0.0
a

 
0.777
  0.726
 
Precision
 
  0.911
  0.897
  0.914
  0.905
  0.919
  0.913
  0.934
  0.595
  0.863
  0.839
 
Sensitivity
 
  0.943
  0.920
  0.936
  0.934
  0.935
  0.937
  0.948
  1.0
a

  0.979
  0.959
 
F
 score
  0.927
  0.908
  0.925
  0.918
  0.926
  0.924
  0.941
  0.743
  0.917
  0.894
 

  17
 
Table
 3.
 Performance
 before
 and
 after
 scaling
 on
 data-­‐scale
 5144x1197
 
a
Measure
 was
 not
 defined
 for
 some
 labels
 
b
Overflow
 encountered
 

 

 

 

  Boosting
 
Performance
 
Logistic
 
Regression
 
Performance
 
SVM
 (Linear)
 
Performance
 
SVM
 (Gaussian)
 
Performance
 
k-­‐Nearest
 
Neighbor
 
Performance
 
Measures
  After
 
Rescale
 
Before
 
Rescale
 
After
 
Rescale
b

 
Before
 
Rescale
b

 
After
 
Rescale
 
Before
 
Rescale
 
After
 
Rescale
 
Before
 
Rescale
 
After
 
Rescale
 
Before
 
Rescale
 
Accuracy
 
Rate
 
0.936
  0.924
  0.937
  0.940
  0.943
  0.942
  0.955
  0.616
  0.943
  0.924
 
MSE
  0.048
  0.059
  0.060
  0.056
  0.047
  0.049
  0.034
  0.268
  0.043
  0.059
 
Specificity
  0.938
  0.919
  0.940
  0.933
  0.948
  0.937
  0.967
  Nan
a

  0.916
  0.891
 
Precision
 
  0.900
  0.880
  0.887
  0.900
  0.900
  0.897
  0.942
  0.0
  0.872
  0.844
 
Sensitivity
 
  0.913
  0.901
  0.931
  0.937
  0.934
  0.940
  0.921
  0.0
  0.964
  0.946
 
F
 score
  0.905
  0.889
  0.906
  0.916
  0.914
  0.916
  0.930
  0.0
  0.915
  0.891
 
Table
 4.
 Performance
 before
 and
 after
 scaling
 on
 data-­‐scale
 10523x1197
 
a
Measure
 was
 not
 defined
 for
 some
 labels
 
b
Overflow
 encountered
 

 

 
Among
 the
 five
 algorithms,
 the
 Gaussian
 kernel
 SVM
 gives
 the
 highest
 cross-­‐validation
 
scores
 (0.893,
 0.958,
 0.932,
 0.955).
 Based
 on
 the
 cross
 validation
 scores,
 the
 overall
 
performance
 of
 the
 five
 algorithms
 could
 be
 ranked
 as:
 Gaussian
 SVM
 >
 (Logistic
 
Regression
 =
 Linear
 SVM)
 >
 Boosting
 >
 kNN.
 Without
 scaling
 the
 Gaussian
 SVM
 algorithm
 
performed
 much
 less
 well:
 from
 0.958
 to
 0.619
 for
 data
 2743x1197,
 from
 0.932
 to
 0.595
 
for
 data
 5144x1197,
 almost
 twice
 after
 rescaling.
 
 Thus,
 feature
 scaling
 appears
 to
 
contribute
 a
 lot
 to
 Gaussian
 kernel
 SVM.
 Since
 the
 experiment
 was
 a
 binary
 classification
 
test,
 sensitivity
 and
 specificity
 need
 to
 be
 considered
 for
 misclassification.
 The
 highest
 
specificity
 scores
 are
 obtained
 by
 Gaussian
 SVM
 (0.853,
 0.985,
 0.979,
 0.964),
 which
 reflects
 
very
 good
 performance
 in
 terms
 of
 predicting
 when
 chasing
 does
 not
 occur.
 K-­‐Nearest
 
Neighbors
 generate
 the
 highest
 sensitivity
 scores
 (0.994,
 0.983,
 0.986,
 0.986),
 which
 on
 the
 
other
 hand,
 indicates
 that
 a
 great
 percentage
 of
 chasing
 behaviors
 are
 correctly
 identified
 
(Figures
 6
 and
 7).
 
 

  18
 

 
Figure
 6.
 Confusion
 matrix
 for
 SVM
 with
 Gaussian
 kernel
 on
 data
 10523x1197
 

 
Figure
 7.
 Confusion
 matrix
 for
 kNN
 on
 data
 10523x1197
 

 

  19
 
All
 5
 algorithms
 show
 the
 same
 pattern
 as
 the
 data
 scale
 increases:
 the
 bigger
 the
 data,
 
the
 better
 the
 algorithms
 built
 the
 model,
 resulting
 in
 better
 predicted
 outcomes,
 and
 
higher
 scores
 obtained.
 Under
 the
 same
 the
 PCA
 procedure,
 Gaussian
 SVM
 took
 somewhat
 
more
 
 processing
 time
 than
 the
 other
 algorithms.
 Changes
 to
 the
 lengths
 of
 data
 affected
 
the
 speed
 of
 Gaussian
 SVM
 more
 obviously,
 about
 a
 four-­‐fold
 increase
 in
 run-­‐time
 on
 my
 
computer.
 The
 following
 ROC
 plots
 with
 area
 under
 each
 one
 computed
 reflected
 the
 ranks
 
of
 five
 algorithms
 comparing
 different
 scale
 of
 data
 (figure
 8-­‐11).
 

 

 

  20
 

 

 

 

  21
 

 

 
Figure
 8.
 ROC
 plots
 for
 different
 algorithms
 on
 4-­‐fold
 cross-­‐validation
 on
 rescaled
 data
 839x1197:
 Gaussian
 SVM
 mean
 ROC
 area=0.9799,
 
Logistic
 regression
 mean
 ROC
 area=0.9704,
 Linear
 SVM
 mean
 ROC
 area=0.9663,
 Boosting
 mean
 ROC
 area=0.9469,
 kNN
 mean
 ROC
 
area=0.9127
 

 

 

 

  22
 

 

 

 

  23
 

 

 
Figure
 9.
 ROC
 plots
 for
 different
 algorithms
 on
 4-­‐fold
 cross-­‐validation
 on
 rescaled
 data
 2743x1197:
 Gaussian
 SVM
 mean
 ROC
 
area=0.9895,
 Logistic
 regression
 mean
 ROC
 area=0.9837,
 Linear
 SVM
 mean
 ROC
 area=0.9816,
 kNN
 mean
 ROC
 area=0.9669,
 Boosting
 
mean
 ROC
 area=0.9658
 

 

 

  24
 

 

 

  25
 

 

 

 
Figure
 10.
 ROC
 plots
 for
 different
 algorithms
 on
 4-­‐fold
 cross-­‐validation
 on
 rescaled
 data
 5144x1197:
 Gaussian
 SVM
 mean
 ROC
 
area=0.9820,
 Linear
 SVM
 mean
 ROC
 area=0.9751,
 Logistic
 regression
 mean
 ROC
 area=0.9740,
 
 kNN
 mean
 ROC
 area=0.9701,
 Boosting
 
mean
 ROC
 area=0.9698
 

 

  26
 

 

 

  27
 

 

 

  28
 

 
Figure
 11.
 ROC
 plots
 for
 different
 algorithms
 on
 4-­‐fold
 cross-­‐validation
 on
 rescaled
 data
 10523x1197:
 Gaussian
 SVM
 mean
 ROC
 
area=0.9881,
 Linear
 SVM
 mean
 ROC
 area=0.9846,
 Logistic
 regression
 mean
 ROC
 area=0.9837,
 kNN
 mean
 ROC
 area=0.9819,
 Boosting
 
mean
 ROC
 area=0.9782
 

 
Discussion:
 
In
 this
 thesis
 we
 explored
 the
 issue
 of
 whether
 machine
 learning
 algorithms
 can
 be
 
used
 to
 predict
 an
 animal
 behavior
 and,
 if
 so,
 which
 ML
 methods
 seem
 most
 effective
 at
 the
 
task.
 To
 date
 there
 has
 only
 been
 one
 such
 example
 published
 in
 the
 literature(Kabra
 et
 al.,
 
2013)
 Our
 application
 was
 to
 video
 data
 recorded
 for
 Drosophila
 melanogaster,
 and
 the
 
behavior
 of
 interest
 was
 ‘chasing’,
 a
 behavior
 that
 is
 common
 among
 flies,
 often
 as
 a
 
precursor
 to
 aggression.
 We
 compared
 five
 machine
 learning
 models:
 Gentle
 Boosting,
 
Support
 Vector
 Machine
 with
 Linear
 kernel,
 Support
 Vector
 Machines
 with
 Gaussian
 kernel,
 
Logistic
 Regression
 and
 k-­‐Nearest
 Neighbor.
 Data
 was
 then
 fit
 to
 four
 video
 recordings
 of
 
differing
 lengths.
 

 
In
  general
  the
  machine
  learning
  models
  all
  fitted
  relatively
  well
  on
  this
  high-­‐
dimensional
 data
 sets,
 in
 which
 we
 used
 1197
 features
 to
 summarize
 the
 animals.
 Among
 
these
 five
 learning
 algorithms,
 SVM
 with
 Gaussian
 kernel
 gives
 robust
 accurate
 rates
 for
 the
 
purpose
 of
 classification,
 while
 others
 give
 similar
 predications
 on
 validation
 data
 before
 
and
 after
 scale.
 Scaling
 helps
 boost
 the
 performance
 of
 SVM
 with
 Gaussian
 kernel
 nearly
 
twice
  as
  before.
  K-­‐nearest
  neighbor,
  as
  a
  lazy
  algorithm,
  performs
  less
  well
  but
  the
 

  29
 
processing
  speed
  on
  my
  computer
  is
  best
  for
  this
  method.
  As
  might
  be
  expected,
 
performance
 of
 all
 algorithms
 tended
 to
 improve
 as
 the
 size
 of
 the
 dataset
 increased.
 
However,
  the
  story
  was
  more
  nuanced
  than
  that,
  as
  sometimes
  performance
  of
  an
 
algorithm
 (in
 terms
 of
 sensitivity
 and
 specificity)
 would
 decrease
 for
 a
 given,
 larger
 dataset.
 
Why
 did
 this
 occur
 in
 our
 case?
 One
 possibility
 is
 that
 it
 might
 be
 due
 to
 inaccuracies
 in
 the
 
behavior
 labeling.
 Another
 possibility
 is
 that
 the
 longer
 video
 contained
 instances
 in
 which
 
the
 behavior
 was
 more
 difficult
 to
 predict,
 because
 it
 was
 atypical
 in
 some
 way.
 We
 note
 
that
  even
  when
  the
  data
  scale
  is
  at
  its
  largest,
  10523x1197,
  the
  specificities
  and
 
sensitivities
 are
 not
 the
 highest.
 We
 would
 expect
 more
 experimental
 replicates
 to
 provide
 
for
 more
 robust
 estimates
 of
 overall
 performance.
 

 
I
 choose
 to
 use
 4-­‐fold
 cross-­‐validation
 so
 that
 it
 is
 3:1
 ratio
 of
 training
 and
 cross-­‐
validation
 data,
 as
 suggested
 by
 Andrew
 Ng
 from
 Stanford
 University.
 Though
 I
 include
 150
 
PCA,
  principle
  component
  analysis
  to
  reduce
  data
  dimension
  failed
  in
  reducing
  data
 
efficiently
 without
 also
 including
 rescaling.
 This
 is
 a
 reflection
 of
 the
 fact
 that
 the
 feature
 
data
 are
 quite
 unbalanced,
 varying
 from
 the
 0.001
 to
 100.
 For
 this
 data,
 the
 choice
 of
 kernel
 
method
 of
 support
 vector
 machine
 (Gaussian
 or
 linear
 kernel)
 also
 does
 not
 make
 appear
 
to
 make
 a
 significant
 difference.
 
 

 
Overall,
 we
 have
 shown
 that
 ML
 algorithms
 can
 be
 used
 to
 annotate
 the
 fly
 behavior
 of
 
chasing
 based
 upon
 automatically
 generated
 video
 imaging.
 Many
 of
 the
 machine
 learning
 
algorithms
 performed
 this
 task
 well,
 with
 a
 key
 first
 step
 appearing
 to
 be
 that
 of
 feature
 
scaling
 to
 ensure
 that
 all
 features
 were
 varying
 over
 the
 same
 scale.
 This
 single
 best
 
performing
 algorithm
 appeared
 to
 be
 the
 SVM
 with
 Gaussian
 kernel
 method,
 but
 a
 more
 
rigorous
 analysis
 is
 needed
 to
 determine
 whether
 this
 remains
 true
 for
 other
 behaviors,
 
other
  fly
  genotypes,
  or
  other
  experimental
  conditions.
  The
  high
  complexity,
  and
  time
 
commitment
 involved
 in
 such
 an
 analysis
 prevents
 the
 inclusion
 of
 such
 a
 comprehensive
 
study
 here.
 

  30
 
References:
 

 
A
 d'Aspremont,
 L
 El
 Ghaoui,
 MI
 Jordan,
 GRG
 Lanckriet
 (2004).
 A
 direct
 formulation
 for
 
sparse
 PCA
 using
 semidefinite
 programming.
 NIPS.
 
 
al.,
 B.E.
 Boser
 et.
 (1992).
 A
 Training
 Algorithm
 for
 Optimal
 Margin
 Classifiers.
 .
 Proceedings
 
of
 the
 Fifth
 Annual
 Workshop
 on
 Computational
 Learning
 Theory,
 5,
 144-­‐152.
 
 
Bishop,
 Christopher
 M.
 (2006).
 Pattern
 recognition
 and
 machine
 learning.
 New
 York:
 
Springer.
 
Branson,
 K.,
 Robie,
 A.
 A.,
 Bender,
 J.,
 Perona,
 P.
 &
 Dickinson,
 M.
 H.
 .
 (2009).
 High-­‐throughput
 
ethomics
 in
 large
 groups
 of
 Drosophila.
 Nat
 Methods,
 6,
 451-­‐457.
 doi:
 
10.1038/nmeth.1328
 
BURGES,
 CHRISTOPHER
 J.C.
 (1998).
 A
 Tutorial
 on
 Support
 Vector
 Machines
 for
 Pattern
 
Recognition.
 Kluwer
 Academic
 Publishers,
 1-­‐43.
 
 
Chapelle,
 O.
 (2007).
 Training
 a
 support
 vector
 machine
 in
 the
 primal
 (Vol.
 19).
 
Cover
 TM,
 Hart
 PE.
 (1967).
 Nearest
 neighbor
 pattern
 classification.
 IEEE
 Transactions
 on
 
Information
 Theory,
 13(1),
 21-­‐27.
 doi:
 10.1109/TIT.1967.1053964
 
Hastie,
 Trevor,
 Tibshirani,
 Robert,
 &
 Friedman,
 J.
 H.
 (2001).
 The
 elements
 of
 statistical
 
learning
 :
 data
 mining,
 inference,
 and
 prediction
 :
 with
 200
 full-­‐color
 illustrations.
 
New
 York:
 Springer.
 
Kabra,
 M.,
 Robie,
 A.
 A.,
 Rivera-­‐Alba,
 M.,
 Branson,
 S.,
 &
 Branson,
 K.
 (2013).
 JAABA:
 interactive
 
machine
 learning
 for
 automatic
 annotation
 of
 animal
 behavior.
 Nat
 Methods,
 10(1),
 
64-­‐67.
 doi:
 10.1038/nmeth.2281
 
Kearns,
 Michael.
 Thoughts
 on
 Hypothesis
 Boosting.
 Unpublished
 manuscript
 (Machine
 
Learning
 class
 project,
 December
 1988).
 
 
Schapire,
 Robert
 E.
 (1990).
 The
 Strength
 of
 Weak
 Learnability.
 Machine
 Learning
 (Boston,
 
MA:
 Kluwer
 Academic
 Publishers),
 5(2),
 197-­‐227. 
Abstract (if available)
Abstract Learning animal behavior patterns, such as the social interactions of flies, is of great interest both in terms of improving our understanding of how those patterns emerge and as a precursor for looking for genetic determinants of those behaviors. Manual annotation is labor‐intensive, so automated techniques are much needed. A popular tool for semi‐automatic annotation of fly behavior, based upon video tracking data, is the JAABA software package. In this thesis we propose methods based on machine learning techniques to refine the analysis of sample data from a JAABA document. Machine learning algorithms are applied to behavior outcome classification, since they are ideal for large datasets. We do this using a self‐written Python package. In our case, flies are characterized via approximately 50 per‐frame features. Those features capture relevant aspects of the tracking data. While no feature can predict behavior in itself, combinations of features are able to do so. We demonstrate how to do so, thereby offering scope for improving the accuracy as well as speed of classification in future. 
Linked assets
University of Southern California Dissertations and Theses
doctype icon
University of Southern California Dissertations and Theses 
Action button
Conceptually similar
Flymodeller: an interactive machine learning platform for automatic fly behavior annotation
PDF
Flymodeller: an interactive machine learning platform for automatic fly behavior annotation 
Automatic tracking of flies and the analysis of fly behavior
PDF
Automatic tracking of flies and the analysis of fly behavior 
A penta-dimensional longitudinal analysis of the predictors of compulsive internet use among adolescents using linear mixed model (LMM)
PDF
A penta-dimensional longitudinal analysis of the predictors of compulsive internet use among adolescents using linear mixed model (LMM) 
Development and validation of survey instrument designed for cervical cancer screening in Malawi, and other low resource settings
PDF
Development and validation of survey instrument designed for cervical cancer screening in Malawi, and other low resource settings 
Hierarchical regularized regression for incorporation of external data in high-dimensional models
PDF
Hierarchical regularized regression for incorporation of external data in high-dimensional models 
Using artificial neural networks to estimate evolutionary parameters
PDF
Using artificial neural networks to estimate evolutionary parameters 
The role of genetic ancestry in estimation of the risk of age-related degeneration (AMD) in the Los Angeles Latino population
PDF
The role of genetic ancestry in estimation of the risk of age-related degeneration (AMD) in the Los Angeles Latino population 
Fine-grained analysis of temporal and spatial differences of behavior patterns and their correlation with the spread of COVID-19 in Los Angeles County
PDF
Fine-grained analysis of temporal and spatial differences of behavior patterns and their correlation with the spread of COVID-19 in Los Angeles County 
Gene-set based analysis using external prior information
PDF
Gene-set based analysis using external prior information 
Bayesian hierarchical models in genetic association studies
PDF
Bayesian hierarchical models in genetic association studies 
Motion pattern learning and applications to tracking and detection
PDF
Motion pattern learning and applications to tracking and detection 
Essays on bioinformatics and social network analysis: statistical and computational methods for complex systems
PDF
Essays on bioinformatics and social network analysis: statistical and computational methods for complex systems 
The association between self-reported physical activity and cognition in elderly clinical trial participants
PDF
The association between self-reported physical activity and cognition in elderly clinical trial participants 
Finding signals in Infinium DNA methylation data
PDF
Finding signals in Infinium DNA methylation data 
Association between informed decision-making and mental health-related quality of life in long term prostate cancer survivors
PDF
Association between informed decision-making and mental health-related quality of life in long term prostate cancer survivors 
Biological interactions on the behavioral, genomic, and ecological scale: investigating patterns in Drosophila melanogaster of the southeast United States and Caribbean islands
PDF
Biological interactions on the behavioral, genomic, and ecological scale: investigating patterns in Drosophila melanogaster of the southeast United States and Caribbean islands 
Applications of multiple imputations in survival analysis
PDF
Applications of multiple imputations in survival analysis 
Generalized linear discriminant analysis for high-dimensional genomic data with external information
PDF
Generalized linear discriminant analysis for high-dimensional genomic data with external information 
The longitudinal risk factors of diabetic retinopathy: the Los Angeles Latino Eye Study
PDF
The longitudinal risk factors of diabetic retinopathy: the Los Angeles Latino Eye Study 
Forecasting traffic volume using machine learning and kriging methods
PDF
Forecasting traffic volume using machine learning and kriging methods 
Action button
Asset Metadata
Creator Meng, Ye (author) 
Core Title Animal behavior pattern annotation and performance evaluation 
Contributor Electronically uploaded by the author (provenance) 
School Keck School of Medicine 
Degree Master of Science 
Degree Program Biostatistics 
Publication Date 04/28/2014 
Defense Date 03/24/2014 
Publisher University of Southern California (original), University of Southern California. Libraries (digital) 
Tag behavior pattern annotation,cross‐validation,fly,machine learning,OAI-PMH Harvest 
Format application/pdf (imt) 
Language English
Advisor Marjoram, Paul (committee chair), Azen, Stanley P. (committee member), Nuzhdin, Sergey V. (committee member) 
Creator Email mengye1989@gmail.com,yemeng@usc.edu 
Permanent Link (DOI) https://doi.org/10.25549/usctheses-c3-385363 
Unique identifier UC11296303 
Identifier etd-MengYe-2425.pdf (filename),usctheses-c3-385363 (legacy record id) 
Legacy Identifier etd-MengYe-2425.pdf 
Dmrecord 385363 
Document Type Thesis 
Format application/pdf (imt) 
Rights Meng, Ye 
Type texts
Source University of Southern California (contributing entity), University of Southern California Dissertations and Theses (collection) 
Access Conditions The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law.  Electronic access is being provided by the USC Libraries in agreement with the a... 
Repository Name University of Southern California Digital Library
Repository Location USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
behavior pattern annotation
cross‐validation
fly
machine learning