Close
About
FAQ
Home
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Animal behavior pattern annotation and performance evaluation
(USC Thesis Other)
Animal behavior pattern annotation and performance evaluation
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
1
Animal Behavior Pattern Annotation
and Performance Evaluation
Ye
Meng
Professor
Paul
Marjoram
March
24,
2014
2
....................................................................................................................................
Table
of
Contents
Abstract
...................................................................................................................................................
3
Introduction
...........................................................................................................................................
3
Classification.……………………………………………………………………………………………………….…………5
K-‐Nearest
Neighbors………………………………………………………………………………………………....5
Boosting……………………………………………………………………………………………….…………………..7
Support
Vector
Machines…………………………………………………………………………………………..8
Training/Validation
Dataset……..………………………..…………………………….……………….………......10
Principal
Component
Analysis.…………………………………………….………………………………………..11
Data
Preprocessing
………………………………………………………………………………………………………12
Methods
.................................................................................................................................................
12
JAABA
Software
...................................................................................................................................
12
Algorithms……………………………………………………………………………………………………………………13
Performance
Evaluation………………………………………………………………………………………………..14
Results
...................................................................................................................................................
15
Discussion
............................................................................................................................................................
28
References…………...………………………………………………………………………………………………….30
3
Abstract
Learning
animal
behavior
patterns,
such
as
the
social
interactions
of
flies,
is
of
great
interest
both
in
terms
of
improving
our
understanding
of
how
those
patterns
emerge
and
as
a
precursor
for
looking
for
genetic
determinants
of
those
behaviors.
Manual
annotation
is
labor-‐intensive,
so
automated
techniques
are
much
needed.
A
popular
tool
for
semi-‐
automatic
annotation
of
fly
behavior,
based
upon
video
tracking
data,
is
the
JAABA
software
package.
In
this
thesis
we
propose
methods
based
on
machine
learning
techniques
to
refine
the
analysis
of
sample
data
from
a
JAABA
document.
Machine
learning
algorithms
are
applied
to
behavior
outcome
classification,
since
they
are
ideal
for
large
datasets.
We
do
this
using
a
self-‐written
Python
package.
In
our
case,
flies
are
characterized
via
approximately
50
per-‐frame
features.
Those
features
capture
relevant
aspects
of
the
tracking
data.
While
no
feature
can
predict
behavior
in
itself,
combinations
of
features
are
able
to
do
so.
We
demonstrate
how
to
do
so,
thereby
offering
scope
for
improving
the
accuracy
as
well
as
speed
of
classification
in
future.
Introduction
Group
behaviors
of
large
aggregations
of
animals,
and
the
interactions
between
them,
are
fascinating
natural
phenomena.
Particularly
interesting
is
the
situation
in
which
animals
self-‐organize
behaviors
into
complex
patterns
with
no
need
of
external
stimulus
or
control.
Such
phenomena
are
known
as
“emergent
behaviors”.
However,
very
little
is
known
about
the
nature
of
such
interactions.
How
do
simple
animals
exhibit
complex
behaviors?
Are
their
leaders
who
trigger
such
behaviors,
or
do
all
animals
act
under
the
same
set
of
rules?
Such
knowledge
might
reasonably
be
applied
to
influence
the
behavior
of
animals,
such
as
leading
a
school
of
fish
away
from
polluted
waters,
say,
or
to
control
aggregates
of
robotic
individuals,
such
as
drones.
For
reasons
such
as
this,
the
goal
of
understanding
group
behaviors
is
a
worthwhile
goal
to
aim
at.
4
Since
we
cannot
typically
ask
a
fish
why
it
is
schooling,
or
a
fly
why
it
is
on
one
food
patch
rather
than
another,
we
need
another
way
to
get
inside
the
creature’s
mind.
We
can’t
ask
it
what
it
is
doing,
but
we
can
tell
a
story
that
we
believe
might
capture
the
underlying
processes.
We
can
then
use
a
simulation
model
to
explore
whether
simulated
animals
following
that
set
of
rules
do
in
fact
exhibit
the
behavior
we
observe
in
real
data.
A
key
problem
here
is
how
to
go
about
relating
observed
behavior
from
animals
in
the
field
or
lab,
to
the
simulated
behavior
observed
in
our
simulated
animals.
In
days
gone
by,
such
an
analysis
involved
large
numbers
of
researchers,
with
large
numbers
of
clip-‐boards,
even
greater
patience,
and
a
willingness
to
observe
animals
for
long
periods
of
time
taking
copious
notes
on
what
they
saw.
Such
a
strategy
is
clearly
not
viable,
or
desirable,
in
the
modern
era
of
“Big
Data”.
For
this
reason,
we
look
for
more
automated
ways
to
proceed.
Machine
learning
might
indicate
the
way
patterns
animal
behavior
emerges
from
local
rules
of
interaction
among
the
individuals.
Flies
are
experimental
animals
that
are
simple
to
cultivate,
and
exhibit
social
behavior.
Using
video
cameras,
it
is
possible
to
investigate
their
interactions
and
distances
from
each
other
in
each
frame
during
their
time
in
the
field
of
vision.
We
see
that
groups
of
flies
are
formed,
and
that
the
nature
of
those
groups
changes
with
genotype,
for
example.
Other
prominent
examples
of
aggregation
behavior
are
bird
flocks,
fish
schools
and
mammal
herds,
etc.
Apart
from
its
obvious
relevance
in
genetics,
neurobiology
and
evolutionary
biology,
collective
behavior
is
a
key
concept
in
many
other
fields
of
science,
including
economics,
control
theory,
and
social
sciences.
There
are
open-‐source
software
tools
that
allow
biologists
to
encode
their
intuition
about
the
structure
of
behavior
and
to
transform
the
records
of
motion
recorded
by
tracking
technology
into
higher-‐order,
scientifically
meaningful
statistics
of
behavior.
Traditionally
statistical
analysis
relies
upon
an
underlying
theoretical
model,
or
set
of
assumptions,
and
usually
the
analysis
is
designed
around
that
principal
theory,
for
example
to
test
specific
hypotheses
about
a
data
set
of
interest.
On
the
contrary,
with
Machine
Learning
[ML]
we
take
a
different
perspective
in
that
we
have
an
observed
outcome
and
our
goal
is
to
find
a
way
to
predict
that
outcome
using
the
underlying
data.
In
our
context,
5
the
outcome
is
whether
a
fly
performs
a
given
behavior
or
not
in
a
given
frame,
and
the
underlying
data
is
the
combination
of
features
that
describe
the
animal’s
behavior
at
that
moment.
With
machine
learning,
we
typically
do
not
have
to
make
underlying
assumptions,
such
as
a
normal
assumption
in
a
linear
regression.
What
makes
ML
a
particular
field
is
partially
its
goals
and
problems,
but
also
the
large
set
of
tools,
techniques
and
strategies
that
it
involves.
It
is
characterized
by
the
massive
use
of
algorithms
and
computational
resources
to
deal
with
large
sets
of
data,
high
number
of
variables
and
complex
data
structures.
In
this
section
we
introduce
the
popular
Machine
Learning
techniques
that
we
will
be
employing
in
this
thesis
for
the
purpose
of
behavior
classification,
(which
is
viewed
as
categorical
data
for
our
purposes).
Those
algorithms
are
K-‐nearest
neighbors,
Boosting
and
Support
Vector
Machine.
But
first,
we
define
the
goal
of
“classification”.
Classification:
A
classification
problem
in
our
context
consists
of
a
set
of
input
vectors
x1,…,
xn
in
ℝ
n
and
corresponding
output
labels
y1,…,
yn.
In
the
context
of
flies,
in
the
case
of
the
behavior
of
fly
chasing,
for
example,
there
are
two
classes
–
‘chasing’
and
‘not
chasing’
-‐
and
labels
are
thus
binary
{+1,
-‐1}.
The
input
vectors
will
be
automatically
recorded
properties
of
the
observed
flies
(position,
orientation,
size,
etc.).
A
predictive
model,
or
set
of
rules,
is
constructed
and
new
data
points
are
predictively
classified
according
to
the
sign
of
the
resulting
predictor.
We
compare
three
classification
approaches—K-‐nearest
neighbor,
boosting
and
SVM
in
terms
of
error
occurrence.
K-‐Nearest
Neighbors
(k-‐NN)
Among
all
machine
learning
algorithms,
the
k-‐nearest
neighbors
(KNN)
is
the
simplest
and
most
intuitive
method.
The
concept
of
a
nearest
neighbor
decision
rule
was
first
proposed
by
Cover
&
Hart
in
1967(Cover
TM,
1967).
The
algorithm
starts
by
assuming
the
existence
of
a
set
of
Training
Data,
for
which
classifications
are
already
known.
For
the
6
purpose
of
classification,
an
object
is
classified
by
a
majority
vote
of
its
neighbors
in
the
Training
Data,
with
the
object
being
assigned
to
the
class
most
common
among
those
k
nearest
neighbors.
For
example,
for
some
new
point
x,
if
k
=
1,
then
the
object
x
is
simply
assigned
to
the
class
of
its
single
nearest
neighbor
in
the
Training
Data.
In
summary,
the
KNN
algorithm
says
new
points
have
to
be
labeled
to
the
class
that
is
in
the
majority
among
its
neighbors.
Figure1
a,
in
this
2-‐dimensional
example,
the
nearest
point
to
x
is
a
red
training
instance,
thus,
x
will
be
labeled
as
red.
Figure1
b,
shows
a
decision
boundary
formed
by
KNN
rules.
The
KNN
rule
can
determine
the
label
of
every
new
point
in
the
space
and
give
rise
to
a
decision
boundary
that
partitions
the
feature
space
into
different
regions
(Figure
1).
In
contrast
to
other
techniques,
like
Support
Vector
Machines
that
will
discard
all
non-‐support
vectors,
most
of
the
lazy
algorithms
–
and
in
particular
KNN
–
makes
a
decision
based
on
the
entire
training
data
set.
One
straightforward
extension
is
not
to
give
1
vote
to
all
the
neighbors.
For
example,
a
very
common
thing
to
do
is
weighted
KNN,
where
each
point
has
a
weight
which
is
typically
calculated
as
a
function
of
its
distance.
( )= ∈ ( )
|| − ||
!
!
More
formally,
assuming
we
are
working
in
the
space
ℝ
d
,
to
compute
the
predictor
at
a
point
x,
we
have
to
define
a
neighborhood
Nk(x)
corresponding
to
the
set
of
the
k
closest
observations
to
x
among
the
Training
Data
in
ℝ
d
.
The
classification
rule
implicitly
computes
the
decision
boundary
majority
vote:
7
Aggregate
every
neighbor’s
vote:
=
∈ (
== ),∀ ∈ [ ]
Label
according
to
the
majority:
= ( )=
∈[ ]
In
the
binary
classification
case,
the
borderline
between
the
two
classes
is
in
general
a
very
irregular
curve.
KNN
is
a
non-‐parametric
method
that
is
robust
against
outliers
and
has
strong
guarantees
of
“doing
the
right
thing”(Hastie,
Tibshirani,
&
Friedman,
2001).
The
only
unknown
parameter
is
k.
When
k
increases,
it
will
yield
a
smoother
decision
region
that
cuts
across
noisy
points(Bishop,
2006).
Boosting
(GentleBoost)
Boosting
is
an
iteration
technique
that
learns
an
accurate
classifier
by
adding
a
set
of
weak
rules
repeatedly
in
a
series
of
rounds.
The
algorithm
implements
an
iterative
procedure
that
requires
more
accuracy
in
the
next
iteration
at
precisely
the
points
at
which
the
previous
predictor
had
the
worst
performance.
In
the
long
run,
boosting
will
obtain
an
efficient
predictor.
Boosting
is
based
on
the
question
posed
by
Kearns(Kearns):
Can
a
set
of
weak
learners
create
a
single,
stronger
learner?
A
weak
learner
is
defined
to
be
a
classifier
which
is
only
slightly
correlated
with
the
true
classification.
Schapire's
affirmative
answer(Schapire,
1990)
to
Kearns'
question
has
had
significant
ramifications
in
machine
learning
and
statistics,
most
notably
leading
to
the
development
of
boosting.
The
details
of
Boosting
are
as
follows.
The
classifier
first
begins
with
a
random
prediction
where
accuracy
will
typically
be
around
0.5
for
a
binary
classifier.
A
more
accurate
predictor
is
generated
from
the
current
weak
rules
by
calculating
a
distribution
of
weights
over
the
training
set.
Initially,
all
training
set
members
are
given
equal
weight.
After
each
iteration,
the
weights
of
incorrectly
classified
data
will
increase
so
that
the
classifier
is
forced
to
concentrate
more
on
the
harder
examples.
The
algorithm
proceeds
as
follows:
8
1.
Initialize
distribution
weight
at
for
iteration
1:
D
!
(i)
=
1/m
,
t
=
1,
.
.
.,
T
2.
Find
weak
h
!
:
X
→ {1,
+1}
with
least
error
ϵ
!
=Pr
!~!
!
[h
!
(x
!
≠ y
!
)]
3.
Choose
α
!
=
1-‐2ϵ
!
,
Update:
D
!!!
(i)=
1
1+exp(margin
!
)
×
1
Z
!
where
Z
!
is
a
normalization
factor
that
makes
D
!!!
sum
to
1.
4.
Output
the
final
hypothesis:
H
(!)
= sign( α
!
!
!!!
h
!
(x
!
))
To
sum
up
in
words,
the
algorithm
first
forms
a
large
set
of
simple
features
and
initializes
the
weights
for
training
examples.
For
T
rounds
we
repeat
the
following:
first,
train
a
classifier
using
a
single
feature
and
evaluate
the
training
error
for
those
available
features;
next,
choose
the
classifier
with
the
lowest
error,
and
that
classifier
to
the
set
of
weak
learners,
and
then
update
the
weights
of
training
examples
according
to
how
accurately
they
are
now
classified
(the
most
poorly
classified
getting
the
highest
weights).
Finally
form
the
strong
classifier
as
the
linear
combination
of
T
weak
classifiers.
Support
Vector
Machines
(SVMs)
SVM
was
first
introduced
in
1992(al.,
1992).
It
has
become
popular
because
of
its
success
in
handwritten
digit
recognition(BURGES,
1998).
SVM
are
“maximum
margin
classifiers”.
This
means
that,
among
all
the
hyperplanes
that
separate
the
training
data
into
two
classes
(ideally,
all
the
positively
classified
data
are
on
one
side
and
all
the
negative
classifications
are
on
the
other
side),
there
exists
only
one
hyperplane
in
R
d
such
that
this
specific
hyperplane
maximizes
the
margin
from
it
to
the
two
classes
(Figure
2).
9
Figure
2.
The
last
boundary
gives
the
maximum
margin
solution.
Intuitively
this
hyperplane
is
the
best
boundary
as
the
classifier
is
the
farthest
possible
from
all
cases,
and
it
will
thereby
generalize
against
new
data
points
that
lie
slightly
outside
of
the
observed
boundary.
The
hypothesis
is:
( )= +
where
the
optimal
hyperplane
is
denoted
by
f(x)
=
0.
The
goal
is
to
find
the
specific
parameters
w
∈
R
!
and
b
that
minimizes
an
object
like:
!
!
l(w ∙x
!
!
!!!
+b,y
!
)+ w
!
under
constraints
y
!
w ∙x
!
+b ≥ 1,
where
the
first
term
is
the
training
error,
and
the
second
term
is
the
complexity
term,
w
! ,
,
which
is
the
margin
that
we
want
to
minimize,
assuming
we
can
separate
the
data
perfectly.
But
this
only
applies
to
linearly
separable
cases,
for
cases
which
are
non-‐linearly
separable,
the
object
now
changes
to:
w
!
+C ξ
!
!
!
!!!
under
constraints
y
!
w ∙x
!
+b ≥ 1− ξ
!
,ξ
!
≥ 0,
where
p
is
either
1
(“hinge
loss”)
or
2
(“quadratic
loss”)(Chapelle,
2007).
Since
a
single
straight
line
may
be
insufficient
to
separate
classes,
at
this
point,
SVMs
rely
on
the
so-‐called
kernel-‐trick
to
increase
the
separation
between
classes.
The
kernel-‐trick
makes
use
of
a
10
kernel
k(x,
y)
that
measures
similarities
between
elements
x,
y
and
needs
to
fulfill
certain
properties
to
be
applicable
(it
must
be
positive
semi-‐definite).
A
frequently
used
kernel
is
the
Gaussian
kernel:
k(x,y) = exp(−σ||x−y||
!
),
where
σ
is
a
hyper-‐parameter
and
||∙||
is
the
Euclidean
norm.
Using
the
Gaussian
kernel
implies
that
the
data
is
embedded
in
an
infinite
dimensional
space
and
all
operations
are
applied
in
this
space(Chapelle,
2007).
Training/Validation
Dataset
Far
better
results
can
be
obtained
by
adopting
a
machine
learning
approach
in
which
three
distinct
datasets:
training
data,
test
data
and
cross-‐validation
set
are
used
to
tune
parameters.
The
training
data
consists
of
explanatory
features
x1,…,
xn
in
ℝ
n
and
labeled
pre-‐classified
targets
y1,…,
yn
known
in
advance.
The
first
and
simplest
form
of
the
machine
learning
algorithm
is
established
on
the
basis
of
training
set.
After
the
learning
phase,
the
test
set
is
used
to
determine
the
identity
of
new
data.
The
ability
to
correctly
categorize
new
examples
that
differ
from
those
used
for
training
is
known
as
‘generalization’(Bishop,
2006).
This
procedure
is
used
to
avoid
overfitting:
a
methodological
mistake
that
arises
if
one
both
learns
the
parameters
of
a
prediction
function,
and
then
tests
it,
on
the
same
data:
a
model
that
would
just
repeat
the
labels
of
the
samples
that
it
has
just
seen
would
have
a
perfect
score
but
would
likely
fail
to
predict
anything
useful
on
yet-‐unseen
data.
The
model
is
then
refined
in
the
cross-‐validation
step.
Here,
extra
parameters
are
fitted
until
the
estimator
appears
to
perform
optimally.
These
algorithms
exploit
“cross-‐validation”.
The
technique
of
K-‐fold
cross-‐validation
involves
taking
the
available
data
and
partitioning
it
into
K
equal
groups.
We
use
K
−
1
of
11
the
groups
to
train
and
fit
a
set
of
models
that
are
then
validated
on
the
remaining
group.
This
procedure
is
then
iterated
for
all
N
possible
groups,
indicated
in
figure
3
by
the
red
blocks,
and
the
performance
scores
from
the
K
runs
are
then
averaged.
Fig
3.
5-‐fold
Cross-‐validation
technique.
Principal
Component
Analysis
(PCA)
In
developing
a
successful
SVM
forecast
when
dealing
with
high
dimensional
data,
for
instance,
if
there
are
thousands
of
features
in
the
data
X
(!)
∈ℝ
!,!!!
,
the
features
are
likely
to
be
highly
correlated
and
amount
of
information
that
they
carry
about
the
behaviors.
For
this
reason
a
first
step
of
feature
extraction
is
often
added.
Principal
Component
Analysis
(PCA)
is
by
far
one
of
the
most
commonly
used
approaches
for
this.
It
linearly
transforms
the
original
inputs
into
new,
uncorrelated
features.
By
compressing
the
data
by
using
just
a
subset
of
the
PCs,
PCA
data
reduction
will
generally
improve
the
performance
and
speed
of
the
ML
algorithm,
which
is
itself
trying
to
find
a
lower
dimensional
surface
onto
which
to
project
the
data
with
less
squared
projection
error.
In
general,
if
the
data
has
N-‐dimensions,
the
goal
is
to
reduce
it
to
k-‐dimensions.
We
aim
to
find
k
such
vectors,
(!)
,
(!)
,…,
( )
,
onto
which
to
project
the
data
so
as
to
minimize
the
projection
error.
To
apply
this
approach,
we
first
evaluate
the
covariance
matrix
!!
and
find
its
eigenvectors
and
eigenvalues,
and
then
compute
the
eigenvectors
in
the
original
data
space
using
normalization
rescaling.
The
number
k,
called
the
number
of
components,
can
be
viewed
as
a
PCA
parameter.
Commonly,
we
pick
the
smallest
value
of
k
for
which
the
percentage
of
variance
explained
is
99%(A
d'Aspremont,
2004).
In
other
words,
we
must
have:
!
|| ( )
−
( )
||
! !
|| ( )
||
! ≤ 0.01
12
Data
Preprocessing
(feature
scaling)
Since
the
range
of
values
of
raw
data
varies
widely,
in
some
machine
learning
algorithms
objective
functions
will
not
work
properly
without
normalization.
For
example,
the
majority
of
classifiers
calculate
the
distance
between
two
points
by
a
metric
such
as
Euclidean
distance.
If
one
of
the
features
has
a
broad
range
of
values,
the
distance
will
be
likely
be
governed
by
this
particular
feature.
Therefore,
the
range
of
all
features
should
be
normalized
so
that
each
feature
contributes
approximately
proportionately
to
the
final
distance.
We
define
′=
( )
where
x’
is
the
rescaled
value,
x
is
the
original
value,
and
max(x)
the
maximum
value
of
x
among
all
frames
of
data.
Method:
JAABA
Software
Our
analysis
uses
data
that
were
drawn
from
the
JAABA
document.
We
focus
on
a
particular
behavior:
that
of
one
fly
chasing
a
nearby
fly.
We
analyze
the
data
of
(Kabra,
Robie,
Rivera-‐Alba,
Branson,
&
Branson,
2013).
According
to
the
supplementary
methods
section
of
that
paper,
20
Drosophila
melanogaster
(10
males
and
10
females)
were
reared
in
standard
vials
on
dextrose-‐based
medium,
under
moderate
temperature,
fresh
food
and
starvation
treatment(Kabra
et
al.,
2013).
Flies
were
recorded
by
camera
and
the
trajectory
outputs
of
the
motion
of
20
flies
are
automatically
tracked
using
Ctrax(Branson,
2009),
a
software
that
assigns
both
a
fly
identity
and
a
label
of
the
body.
The
JAABA
software
transforms
trajectory
outputs
into
a
novel,
efficient
general-‐purpose
representation
by
computing
a
suite
of
‘per-‐frame’
features
that
describe
the
state
of
the
animal
in
the
current
frame
(e.g.,
size,
orientation).
From
these,
JAABA
computes
a
general
set
of
window
13
features
that
provide
temporal
context
around
each
frame,
for
example:
min,
std,
and
mean
for
given
per-‐frame
features.
In
the
study,
we
use
both
per-‐frame
and
window
features.
To
get
the
training
data,
we
used
JAABA
to
manually
label
a
number
of
bouts
of
“chasing”
where
we
were
certain
the
fly
was
performing
the
behavior,
as
well
as
a
couple
of
nearby
bouts
of
“not
chasing”,
in
which
the
fly
was
not.
See
figure
4
for
a
screen
shot
of
this
labeling
process.
There
are
several
“label
tricks”
when
conducting
such
labeling
of
behaviors.
“Chasing”
is
a
label
applied
when
the
fly
is
suddenly
turning
around
and
accelerating
in
another
fly’s
direction.
If
a
fly
is
just
passing
by
another
fly,
and
it
does
not
have
a
tendency
to
move
towards
the
target
fly,
then
it
is
not
considered
to
be
chasing.
However,
since
the
labeling
is
performed
manually,
it
is
quite
arbitrary
in
the
sense
that
the
starting
point
at
which
to
label
the
chasing
behavior
is
vague,
there
is
not
a
strict
distance
between
two
flies
that
we
can
use
to
tell
when
the
behavior
actually
begins.
Figure
4.
JAABA
panel
for
observe
one
fly
behavior,
label,
train
and
predict.
Algorithms
In
total,
we
obtained
four
sets
of
data,
one
of
839
frames,
one
2743,
one
5144
and
one
10523
frames,
among
a
total
of
27375
frames.
All
frames
are
quantified
using
1197
features.
In
this
experiment,
data
were
always
labeled
by
the
same
person,
Ye
Meng,
in
the
same
computing
environment,
so
that
the
way
we
obtained
training
data
and
decided
whether
a
given
behavior
had
occurred
or
not
was
consistent
across
time
and
would
not
14
confuse
the
classification
process.
In
a
first
data
processing
step,
we
transformed
the
data
into
rescaled
features
to
standardize
their
range.
We
then
fitted
these
data
using
boosting,
Gaussian
kernel
SVM-‐PCA
and
k-‐nearest
neighbor
using
the
scikit-‐learn
machine
learning
python
modules.
Performance
Evaluation
We
divided
the
data
into
training
and
cross-‐validation
data
first,
with
a
ratio
of
3:1,
where
we
performed
4-‐fold
cross
validation,
in
this
process,
we
randomly
split
the
set
of
data
into
4
subsets.
We
train
a
classifier
on
3/4
of
the
data,
and
then
estimate
the
error
rate
on
the
remaining
1/4
subset.
That
is
how
the
cross-‐validation
score
is
measured.
Towards
algorithms
performance,
ranked
algorithms
that
each
generated
the
comparative
accuracy
scores
after
cross-‐validation.
We
calculated
the
error
rates
for
model
performance
evaluation
across
the
4
cross-‐validation
iterations,
and
averaged
these
4
scores
to
obtain
the
final
measures.
Specifically,
the
performance
measures
are
accuracy
1
,
MSE
2
,
,Precision
3
,
Specificity
4
,
Recall
rate
(sensitivity)
5
,
and
F
score
6
,
almost
all
of
which
were
computed
from
the
confusion
table.
These
are
defined
below.
We
used
these
measures
to
rank
the
relative
performances
of
the
ML
algorithms.
_________________________________________________________________________________________________________
1
2
MSE: For binary outcomes, quantify the difference between values of predict probabilities and the Ground-truth target values.
3
4
Specificity (sometimes called the true negative rate): measures the proportion of negatives which are correctly identified
5
Sensitivity
measures the proportion of actual positives which are correctly identified
6
15
Results:
The
following
tables
show
performance
on
the
four
different
datasets,
which
we
recall
vary
according
to
the
amount
of
data
they
contain.
Each
dataset
has
the
same
number
of
features
capturing
fly
properties,
1197,
meaning
our
data
are
of
sizes
839x1197,
2743x1197,
5144x1197
and
10523x1197.
Through
PCA,
each
dataset
was
first
reduced
from
a
dimension
of
1197
to
a
to
lower
dimension
of
150.
We
choose
this
as
the
number
of
component
since
it
gives
the
lowest
explained
variance
increment
by
adding
components
to
150
(Figure
5),
which
showed
by
the
plot,
approximately
the
minimum
number
of
component
that
explains
the
greatest
percentage
of
the
data’s
total
variance.
PCA
also
greatly
helped
increase
the
computing
speed
and
thereby
reducing
the
analysis
time.
Figure
5.
PCA
number
of
component
selection
We
fit
boosting,
logistic,
SVM
with
a
linear
kernel,
SVM
with
a
Gaussian
kernel,
and
KNN
algorithms
and
summarize
performance
using
the
following
average
4-‐fold
Cross-‐
validation
scores:
Accuracy
rate,
MSE,
Specificity,
Precision,
Sensitivity
(recall),
F
score.
We
decide
which
algorithm
performs
better
by
comparing
accuracy
rate,
since
it
gives
unbiased
estimates
of
overall
performance
and
might
therefore
be
used
to
give
an
overall
rank
of
the
algorithms.
Since
our
classifier
is
binary,
we
can
evaluate
the
optimal
classifier
in
a
direct
and
natural
way
—using
the
area
under
the
receiver
operating
characteristic
(ROC)
curve,
which
is
basically
created
by
plotting
the
fraction
of
true
positives
out
of
the
total
actual
positives
(TPR
=
true
positive
rate)
vs.
the
fraction
of
false
16
positives
out
of
the
total
actual
negatives
(FPR
=
false
positive
rate).
Within
each
algorithm,
I
compare
the
performance
of
data
before
and
after
the
feature
scaling
process
described
in
“Data
Preprocessing
“.
Results
are
shown
in
tables
1-‐4.
Boosting
Performance
Logistic
Regression
Performance
SVM
(Linear)
Performance
SVM
(Gaussian)
Performance
k-‐Nearest
Neighbor
Performance
Measures
After
Rescale
Before
Rescale
After
Rescale
Before
Rescale
After
Rescale
Before
Rescale
After
Rescale
Before
Rescale
After
Rescale
Before
Rescale
Accuracy
Rate
0.799
0.868
0.902
0.818
0.867
0.758
0.893
0.614
0.759
0.751
MSE
0.152
0.109
0.077
0.160
0.107
0.209
0.061
0.285
0.186
0.195
Specificity
0.635
0.830
0.865
0.810
0.788
0.699
0.853
0.0
a
0.432
0.587
Precision
0.771
0.849
0.905
0.830
0.836
0.769
0.878
0.614
0.709
0.717
Sensitivity
0.943
0.914
0.930
0.856
0.951
0.885
0.943
1.0
0.994
0.945
F
score
0.828
0.875
0.913
0.830
0.877
0.794
0.896
0.732
0.805
0.793
Table
1.
Performance
before
and
after
rescaling
on
data-‐scale
839x1197
a
Measure
was
‘not-‐defined’
for
some
labels
Boosting
Performance
Logistic
Regression
Performance
SVM
(Linear)
Performance
SVM
(Gaussian)
Performance
k-‐Nearest
Neighbor
Performance
Measures
After
Rescale
b
Before
Rescale
After
Rescale
Before
Rescale
After
Rescale
Before
Rescale
After
Rescale
Before
Rescale
After
Rescale
Before
Rescale
Accuracy
Rate
0.913
0.869
0.945
0.933
0.949
0.926
0.958
0.619
0.906
0.882
MSE
0.068
0.104
0.049
0.062
0.043
0.063
0.036
0.245
0.067
0.078
Specificity
0.841
0.777
0.896
0.877
0.907
0.865
0.916
0.0
a
0.761
0.725
Precision
0.923
0.866
0.945
0.926
0.951
0.921
0.968
0.619
0.873
0.856
Sensitivity
0.933
0.919
0.964
0.963
0.963
0.959
0.964
1.0
a
0.985
0.967
F
score
0.927
0.892
0.954
0.943
0.957
0.938
0.966
0.759
0.925
0.907
Table
2.
Performance
before
and
after
scaling
on
data-‐scale
2743x1197
a
Measure
was
not
defined
for
some
labels
b
overflow
encountered
Boosting
Performance
Logistic
Regression
Performance
SVM
(Linear)
Performance
SVM
(Gaussian)
Performance
k-‐Nearest
Neighbor
Performance
Measures
After
Rescale
Before
Rescale
b
After
Rescale
Before
Rescale
After
Rescale
Before
Rescale
After
Rescale
Before
Rescale
After
Rescale
Before
Rescale
Accuracy
Rate
0.914
0.892
0.913
0.901
0.915
0.910
0.932
0.595
0.898
0.868
MSE
0.065
0.086
0.081
0.095
0.070
0.078
0.052
0.245
0.073
0.095
Specificity
0.872
0.841
0.877
0.842
0.883
0.860
0.907
0.0
a
0.777
0.726
Precision
0.911
0.897
0.914
0.905
0.919
0.913
0.934
0.595
0.863
0.839
Sensitivity
0.943
0.920
0.936
0.934
0.935
0.937
0.948
1.0
a
0.979
0.959
F
score
0.927
0.908
0.925
0.918
0.926
0.924
0.941
0.743
0.917
0.894
17
Table
3.
Performance
before
and
after
scaling
on
data-‐scale
5144x1197
a
Measure
was
not
defined
for
some
labels
b
Overflow
encountered
Boosting
Performance
Logistic
Regression
Performance
SVM
(Linear)
Performance
SVM
(Gaussian)
Performance
k-‐Nearest
Neighbor
Performance
Measures
After
Rescale
Before
Rescale
After
Rescale
b
Before
Rescale
b
After
Rescale
Before
Rescale
After
Rescale
Before
Rescale
After
Rescale
Before
Rescale
Accuracy
Rate
0.936
0.924
0.937
0.940
0.943
0.942
0.955
0.616
0.943
0.924
MSE
0.048
0.059
0.060
0.056
0.047
0.049
0.034
0.268
0.043
0.059
Specificity
0.938
0.919
0.940
0.933
0.948
0.937
0.967
Nan
a
0.916
0.891
Precision
0.900
0.880
0.887
0.900
0.900
0.897
0.942
0.0
0.872
0.844
Sensitivity
0.913
0.901
0.931
0.937
0.934
0.940
0.921
0.0
0.964
0.946
F
score
0.905
0.889
0.906
0.916
0.914
0.916
0.930
0.0
0.915
0.891
Table
4.
Performance
before
and
after
scaling
on
data-‐scale
10523x1197
a
Measure
was
not
defined
for
some
labels
b
Overflow
encountered
Among
the
five
algorithms,
the
Gaussian
kernel
SVM
gives
the
highest
cross-‐validation
scores
(0.893,
0.958,
0.932,
0.955).
Based
on
the
cross
validation
scores,
the
overall
performance
of
the
five
algorithms
could
be
ranked
as:
Gaussian
SVM
>
(Logistic
Regression
=
Linear
SVM)
>
Boosting
>
kNN.
Without
scaling
the
Gaussian
SVM
algorithm
performed
much
less
well:
from
0.958
to
0.619
for
data
2743x1197,
from
0.932
to
0.595
for
data
5144x1197,
almost
twice
after
rescaling.
Thus,
feature
scaling
appears
to
contribute
a
lot
to
Gaussian
kernel
SVM.
Since
the
experiment
was
a
binary
classification
test,
sensitivity
and
specificity
need
to
be
considered
for
misclassification.
The
highest
specificity
scores
are
obtained
by
Gaussian
SVM
(0.853,
0.985,
0.979,
0.964),
which
reflects
very
good
performance
in
terms
of
predicting
when
chasing
does
not
occur.
K-‐Nearest
Neighbors
generate
the
highest
sensitivity
scores
(0.994,
0.983,
0.986,
0.986),
which
on
the
other
hand,
indicates
that
a
great
percentage
of
chasing
behaviors
are
correctly
identified
(Figures
6
and
7).
18
Figure
6.
Confusion
matrix
for
SVM
with
Gaussian
kernel
on
data
10523x1197
Figure
7.
Confusion
matrix
for
kNN
on
data
10523x1197
19
All
5
algorithms
show
the
same
pattern
as
the
data
scale
increases:
the
bigger
the
data,
the
better
the
algorithms
built
the
model,
resulting
in
better
predicted
outcomes,
and
higher
scores
obtained.
Under
the
same
the
PCA
procedure,
Gaussian
SVM
took
somewhat
more
processing
time
than
the
other
algorithms.
Changes
to
the
lengths
of
data
affected
the
speed
of
Gaussian
SVM
more
obviously,
about
a
four-‐fold
increase
in
run-‐time
on
my
computer.
The
following
ROC
plots
with
area
under
each
one
computed
reflected
the
ranks
of
five
algorithms
comparing
different
scale
of
data
(figure
8-‐11).
20
21
Figure
8.
ROC
plots
for
different
algorithms
on
4-‐fold
cross-‐validation
on
rescaled
data
839x1197:
Gaussian
SVM
mean
ROC
area=0.9799,
Logistic
regression
mean
ROC
area=0.9704,
Linear
SVM
mean
ROC
area=0.9663,
Boosting
mean
ROC
area=0.9469,
kNN
mean
ROC
area=0.9127
22
23
Figure
9.
ROC
plots
for
different
algorithms
on
4-‐fold
cross-‐validation
on
rescaled
data
2743x1197:
Gaussian
SVM
mean
ROC
area=0.9895,
Logistic
regression
mean
ROC
area=0.9837,
Linear
SVM
mean
ROC
area=0.9816,
kNN
mean
ROC
area=0.9669,
Boosting
mean
ROC
area=0.9658
24
25
Figure
10.
ROC
plots
for
different
algorithms
on
4-‐fold
cross-‐validation
on
rescaled
data
5144x1197:
Gaussian
SVM
mean
ROC
area=0.9820,
Linear
SVM
mean
ROC
area=0.9751,
Logistic
regression
mean
ROC
area=0.9740,
kNN
mean
ROC
area=0.9701,
Boosting
mean
ROC
area=0.9698
26
27
28
Figure
11.
ROC
plots
for
different
algorithms
on
4-‐fold
cross-‐validation
on
rescaled
data
10523x1197:
Gaussian
SVM
mean
ROC
area=0.9881,
Linear
SVM
mean
ROC
area=0.9846,
Logistic
regression
mean
ROC
area=0.9837,
kNN
mean
ROC
area=0.9819,
Boosting
mean
ROC
area=0.9782
Discussion:
In
this
thesis
we
explored
the
issue
of
whether
machine
learning
algorithms
can
be
used
to
predict
an
animal
behavior
and,
if
so,
which
ML
methods
seem
most
effective
at
the
task.
To
date
there
has
only
been
one
such
example
published
in
the
literature(Kabra
et
al.,
2013)
Our
application
was
to
video
data
recorded
for
Drosophila
melanogaster,
and
the
behavior
of
interest
was
‘chasing’,
a
behavior
that
is
common
among
flies,
often
as
a
precursor
to
aggression.
We
compared
five
machine
learning
models:
Gentle
Boosting,
Support
Vector
Machine
with
Linear
kernel,
Support
Vector
Machines
with
Gaussian
kernel,
Logistic
Regression
and
k-‐Nearest
Neighbor.
Data
was
then
fit
to
four
video
recordings
of
differing
lengths.
In
general
the
machine
learning
models
all
fitted
relatively
well
on
this
high-‐
dimensional
data
sets,
in
which
we
used
1197
features
to
summarize
the
animals.
Among
these
five
learning
algorithms,
SVM
with
Gaussian
kernel
gives
robust
accurate
rates
for
the
purpose
of
classification,
while
others
give
similar
predications
on
validation
data
before
and
after
scale.
Scaling
helps
boost
the
performance
of
SVM
with
Gaussian
kernel
nearly
twice
as
before.
K-‐nearest
neighbor,
as
a
lazy
algorithm,
performs
less
well
but
the
29
processing
speed
on
my
computer
is
best
for
this
method.
As
might
be
expected,
performance
of
all
algorithms
tended
to
improve
as
the
size
of
the
dataset
increased.
However,
the
story
was
more
nuanced
than
that,
as
sometimes
performance
of
an
algorithm
(in
terms
of
sensitivity
and
specificity)
would
decrease
for
a
given,
larger
dataset.
Why
did
this
occur
in
our
case?
One
possibility
is
that
it
might
be
due
to
inaccuracies
in
the
behavior
labeling.
Another
possibility
is
that
the
longer
video
contained
instances
in
which
the
behavior
was
more
difficult
to
predict,
because
it
was
atypical
in
some
way.
We
note
that
even
when
the
data
scale
is
at
its
largest,
10523x1197,
the
specificities
and
sensitivities
are
not
the
highest.
We
would
expect
more
experimental
replicates
to
provide
for
more
robust
estimates
of
overall
performance.
I
choose
to
use
4-‐fold
cross-‐validation
so
that
it
is
3:1
ratio
of
training
and
cross-‐
validation
data,
as
suggested
by
Andrew
Ng
from
Stanford
University.
Though
I
include
150
PCA,
principle
component
analysis
to
reduce
data
dimension
failed
in
reducing
data
efficiently
without
also
including
rescaling.
This
is
a
reflection
of
the
fact
that
the
feature
data
are
quite
unbalanced,
varying
from
the
0.001
to
100.
For
this
data,
the
choice
of
kernel
method
of
support
vector
machine
(Gaussian
or
linear
kernel)
also
does
not
make
appear
to
make
a
significant
difference.
Overall,
we
have
shown
that
ML
algorithms
can
be
used
to
annotate
the
fly
behavior
of
chasing
based
upon
automatically
generated
video
imaging.
Many
of
the
machine
learning
algorithms
performed
this
task
well,
with
a
key
first
step
appearing
to
be
that
of
feature
scaling
to
ensure
that
all
features
were
varying
over
the
same
scale.
This
single
best
performing
algorithm
appeared
to
be
the
SVM
with
Gaussian
kernel
method,
but
a
more
rigorous
analysis
is
needed
to
determine
whether
this
remains
true
for
other
behaviors,
other
fly
genotypes,
or
other
experimental
conditions.
The
high
complexity,
and
time
commitment
involved
in
such
an
analysis
prevents
the
inclusion
of
such
a
comprehensive
study
here.
30
References:
A
d'Aspremont,
L
El
Ghaoui,
MI
Jordan,
GRG
Lanckriet
(2004).
A
direct
formulation
for
sparse
PCA
using
semidefinite
programming.
NIPS.
al.,
B.E.
Boser
et.
(1992).
A
Training
Algorithm
for
Optimal
Margin
Classifiers.
.
Proceedings
of
the
Fifth
Annual
Workshop
on
Computational
Learning
Theory,
5,
144-‐152.
Bishop,
Christopher
M.
(2006).
Pattern
recognition
and
machine
learning.
New
York:
Springer.
Branson,
K.,
Robie,
A.
A.,
Bender,
J.,
Perona,
P.
&
Dickinson,
M.
H.
.
(2009).
High-‐throughput
ethomics
in
large
groups
of
Drosophila.
Nat
Methods,
6,
451-‐457.
doi:
10.1038/nmeth.1328
BURGES,
CHRISTOPHER
J.C.
(1998).
A
Tutorial
on
Support
Vector
Machines
for
Pattern
Recognition.
Kluwer
Academic
Publishers,
1-‐43.
Chapelle,
O.
(2007).
Training
a
support
vector
machine
in
the
primal
(Vol.
19).
Cover
TM,
Hart
PE.
(1967).
Nearest
neighbor
pattern
classification.
IEEE
Transactions
on
Information
Theory,
13(1),
21-‐27.
doi:
10.1109/TIT.1967.1053964
Hastie,
Trevor,
Tibshirani,
Robert,
&
Friedman,
J.
H.
(2001).
The
elements
of
statistical
learning
:
data
mining,
inference,
and
prediction
:
with
200
full-‐color
illustrations.
New
York:
Springer.
Kabra,
M.,
Robie,
A.
A.,
Rivera-‐Alba,
M.,
Branson,
S.,
&
Branson,
K.
(2013).
JAABA:
interactive
machine
learning
for
automatic
annotation
of
animal
behavior.
Nat
Methods,
10(1),
64-‐67.
doi:
10.1038/nmeth.2281
Kearns,
Michael.
Thoughts
on
Hypothesis
Boosting.
Unpublished
manuscript
(Machine
Learning
class
project,
December
1988).
Schapire,
Robert
E.
(1990).
The
Strength
of
Weak
Learnability.
Machine
Learning
(Boston,
MA:
Kluwer
Academic
Publishers),
5(2),
197-‐227.
Abstract (if available)
Abstract
Learning animal behavior patterns, such as the social interactions of flies, is of great interest both in terms of improving our understanding of how those patterns emerge and as a precursor for looking for genetic determinants of those behaviors. Manual annotation is labor‐intensive, so automated techniques are much needed. A popular tool for semi‐automatic annotation of fly behavior, based upon video tracking data, is the JAABA software package. In this thesis we propose methods based on machine learning techniques to refine the analysis of sample data from a JAABA document. Machine learning algorithms are applied to behavior outcome classification, since they are ideal for large datasets. We do this using a self‐written Python package. In our case, flies are characterized via approximately 50 per‐frame features. Those features capture relevant aspects of the tracking data. While no feature can predict behavior in itself, combinations of features are able to do so. We demonstrate how to do so, thereby offering scope for improving the accuracy as well as speed of classification in future.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Flymodeller: an interactive machine learning platform for automatic fly behavior annotation
PDF
Automatic tracking of flies and the analysis of fly behavior
PDF
A penta-dimensional longitudinal analysis of the predictors of compulsive internet use among adolescents using linear mixed model (LMM)
PDF
Development and validation of survey instrument designed for cervical cancer screening in Malawi, and other low resource settings
PDF
Hierarchical regularized regression for incorporation of external data in high-dimensional models
PDF
Using artificial neural networks to estimate evolutionary parameters
PDF
The role of genetic ancestry in estimation of the risk of age-related degeneration (AMD) in the Los Angeles Latino population
PDF
Fine-grained analysis of temporal and spatial differences of behavior patterns and their correlation with the spread of COVID-19 in Los Angeles County
PDF
Gene-set based analysis using external prior information
PDF
Bayesian hierarchical models in genetic association studies
PDF
Motion pattern learning and applications to tracking and detection
PDF
Essays on bioinformatics and social network analysis: statistical and computational methods for complex systems
PDF
The association between self-reported physical activity and cognition in elderly clinical trial participants
PDF
Finding signals in Infinium DNA methylation data
PDF
Association between informed decision-making and mental health-related quality of life in long term prostate cancer survivors
PDF
Biological interactions on the behavioral, genomic, and ecological scale: investigating patterns in Drosophila melanogaster of the southeast United States and Caribbean islands
PDF
Applications of multiple imputations in survival analysis
PDF
Generalized linear discriminant analysis for high-dimensional genomic data with external information
PDF
The longitudinal risk factors of diabetic retinopathy: the Los Angeles Latino Eye Study
PDF
Forecasting traffic volume using machine learning and kriging methods
Asset Metadata
Creator
Meng, Ye (author)
Core Title
Animal behavior pattern annotation and performance evaluation
Contributor
Electronically uploaded by the author
(provenance)
School
Keck School of Medicine
Degree
Master of Science
Degree Program
Biostatistics
Publication Date
04/28/2014
Defense Date
03/24/2014
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
behavior pattern annotation,cross‐validation,fly,machine learning,OAI-PMH Harvest
Format
application/pdf
(imt)
Language
English
Advisor
Marjoram, Paul (
committee chair
), Azen, Stanley P. (
committee member
), Nuzhdin, Sergey V. (
committee member
)
Creator Email
mengye1989@gmail.com,yemeng@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-385363
Unique identifier
UC11296303
Identifier
etd-MengYe-2425.pdf (filename),usctheses-c3-385363 (legacy record id)
Legacy Identifier
etd-MengYe-2425.pdf
Dmrecord
385363
Document Type
Thesis
Format
application/pdf (imt)
Rights
Meng, Ye
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
behavior pattern annotation
cross‐validation
fly
machine learning