Advanced Theano

Profiling

  • To replace the default mode with this mode, use the Theano flags profile=True
  • To enable the memory profiling use the flags profile_memory=True

Theano output:

Function profiling
==================
  Message: train
  Time in 10000 calls to Function.__call__: 7.171231e+00s
  Time in Function.fn.__call__: 6.686692e+00s (93.243%)
  Time in thunks: 6.511275e+00s (90.797%)
  Total compile time: 6.550491e-01s
    Theano Optimizer time: 5.976810e-01s
       Theano validate time: 1.260662e-02s
    Theano Linker time (includes C, CUDA code generation/compiling): 2.649593e-02s

Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  87.0%    87.0%       5.665s       2.83e-04s     C     20000        2   <class 'theano.tensor.blas_c.CGemv'>
  11.5%    98.4%       0.746s       7.46e-06s     C     100000       10   <class 'theano.tensor.elemwise.Elemwise'>
   0.7%    99.1%       0.045s       2.27e-06s     C     20000        2   <class 'theano.tensor.basic.Alloc'>
   0.5%    99.6%       0.030s       1.01e-06s     C     30000        3   <class 'theano.tensor.elemwise.DimShuffle'>
   0.2%    99.8%       0.013s       1.34e-06s     C     10000        1   <class 'theano.tensor.elemwise.Sum'>
   0.2%   100.0%       0.012s       6.00e-07s     C     20000        2   <class 'theano.tensor.opt.Shape_i'>
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  87.0%    87.0%       5.665s       2.83e-04s     C     20000        2   CGemv{inplace}
   6.9%    93.9%       0.452s       4.52e-05s     C     10000        1   Elemwise{Composite{[Composite{[Composite{[sub(mul(i0, i1), neg(i2))]}(
   1.8%    95.7%       0.116s       1.16e-05s     C     10000        1   Elemwise{Composite{[Composite{[Composite{[Composite{[mul(i0, add(i1, i
   1.7%    97.4%       0.109s       1.09e-05s     C     10000        1   Elemwise{ScalarSigmoid{output_types_preference=transfer_type{0}}}[(0, 
   0.7%    98.1%       0.045s       2.27e-06s     C     20000        2   Alloc
   0.3%    98.4%       0.020s       1.02e-06s     C     20000        2   InplaceDimShuffle{x}
   0.2%    98.6%       0.015s       1.50e-06s     C     10000        1   Elemwise{sub,no_inplace}
   0.2%    98.8%       0.014s       1.42e-06s     C     10000        1   Elemwise{gt,no_inplace}
   0.2%    99.1%       0.013s       1.34e-06s     C     10000        1   Sum
   0.2%    99.3%       0.013s       1.29e-06s     C     10000        1   Elemwise{neg,no_inplace}
   0.2%    99.4%       0.012s       6.00e-07s     C     20000        2   Shape_i{0}
   0.2%    99.6%       0.010s       9.84e-07s     C     10000        1   InplaceDimShuffle{1,0}
   0.1%    99.7%       0.010s       9.58e-07s     C     10000        1   Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)]
   0.1%    99.8%       0.007s       6.95e-07s     C     10000        1   Elemwise{Cast{float64}}
   0.1%    99.9%       0.005s       5.46e-07s     C     10000        1   Elemwise{inv,no_inplace}
   0.1%   100.0%       0.005s       4.88e-07s     C     10000        1   Elemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)]
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
  51.0%    51.0%       3.319s       3.32e-04s   10000     7 CGemv{inplace}(Alloc.0, TensorConstant{1.0}, x, w, TensorConstant{0.0})
  36.0%    87.0%       2.345s       2.35e-04s   10000    18 CGemv{inplace}(w, TensorConstant{-0.1}, x.T, Elemwise{Composite{[Composite{[Compo
   6.9%    93.9%       0.452s       4.52e-05s   10000    13 Elemwise{Composite{[Composite{[Composite{[sub(mul(i0, i1), neg(i2))]}(i0, scalar_
   1.8%    95.7%       0.116s       1.16e-05s   10000    16 Elemwise{Composite{[Composite{[Composite{[Composite{[mul(i0, add(i1, i2))]}(i0, n
   1.7%    97.4%       0.109s       1.09e-05s   10000    14 Elemwise{ScalarSigmoid{output_types_preference=transfer_type{0}}}[(0, 0)](Elemwis
   0.5%    97.9%       0.031s       3.13e-06s   10000    12 Alloc(Elemwise{inv,no_inplace}.0, Shape_i{0}.0)
   0.2%    98.1%       0.015s       1.50e-06s   10000     4 Elemwise{sub,no_inplace}(TensorConstant{(1,) of 1.0}, y)
   0.2%    98.3%       0.014s       1.42e-06s   10000    15 Elemwise{gt,no_inplace}(Elemwise{ScalarSigmoid{output_types_preference=transfer_t
   0.2%    98.5%       0.014s       1.40e-06s   10000     5 Alloc(TensorConstant{0.0}, Shape_i{0}.0)
   0.2%    98.7%       0.013s       1.34e-06s   10000    17 Sum(Elemwise{Composite{[Composite{[Composite{[Composite{[mul(i0, add(i1, i2))]}(i
   0.2%    98.9%       0.013s       1.33e-06s   10000     0 InplaceDimShuffle{x}(b)
   0.2%    99.1%       0.013s       1.29e-06s   10000    11 Elemwise{neg,no_inplace}(Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0)
   0.2%    99.3%       0.010s       9.84e-07s   10000     2 InplaceDimShuffle{1,0}(x)
   0.1%    99.4%       0.010s       9.58e-07s   10000     9 Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)](CGemv{inplace}.0, InplaceDimShuff
   0.1%    99.6%       0.007s       7.11e-07s   10000     6 InplaceDimShuffle{x}(Shape_i{0}.0)
   0.1%    99.7%       0.007s       6.95e-07s   10000     8 Elemwise{Cast{float64}}(InplaceDimShuffle{x}.0)
   0.1%    99.8%       0.006s       6.18e-07s   10000     1 Shape_i{0}(x)
   0.1%    99.8%       0.006s       5.82e-07s   10000     3 Shape_i{0}(y)
   0.1%    99.9%       0.005s       5.46e-07s   10000    10 Elemwise{inv,no_inplace}(Elemwise{Cast{float64}}.0)
   0.1%   100.0%       0.005s       4.88e-07s   10000    19 Elemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)](b, TensorConstant{0.1}, Sum.0
   ... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)

Function profiling
==================
  Message: predict
  Time in 1 calls to Function.__call__: 4.870892e-04s
  Time in Function.fn.__call__: 4.608631e-04s (94.616%)
  Time in thunks: 4.491806e-04s (92.217%)
  Total compile time: 7.993293e-02s
    Theano Optimizer time: 7.383800e-02s
       Theano validate time: 2.010584e-03s
    Theano Linker time (includes C, CUDA code generation/compiling): 4.319906e-03s

Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  94.2%    94.2%       0.000s       4.23e-04s     C        1        1   <class 'theano.tensor.blas_c.CGemv'>
   4.0%    98.2%       0.000s       1.81e-05s     C        1        1   <class 'theano.tensor.elemwise.Elemwise'>
   0.7%    98.9%       0.000s       3.10e-06s     C        1        1   <class 'theano.tensor.basic.Alloc'>
   0.6%    99.5%       0.000s       2.86e-06s     C        1        1   <class 'theano.tensor.elemwise.DimShuffle'>
   0.5%   100.0%       0.000s       2.15e-06s     C        1        1   <class 'theano.tensor.opt.Shape_i'>
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  94.2%    94.2%       0.000s       4.23e-04s     C        1        1   CGemv{inplace}
   4.0%    98.2%       0.000s       1.81e-05s     C        1        1   Elemwise{Composite{[Composite{[Composite{[Composite{[GT(scalar_sigmoid
   0.7%    98.9%       0.000s       3.10e-06s     C        1        1   Alloc
   0.6%    99.5%       0.000s       2.86e-06s     C        1        1   InplaceDimShuffle{x}
   0.5%   100.0%       0.000s       2.15e-06s     C        1        1   Shape_i{0}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
  94.2%    94.2%       0.000s       4.23e-04s      1     3 CGemv{inplace}(Alloc.0, TensorConstant{1.0}, x, w, TensorConstant{0.0})
   4.0%    98.2%       0.000s       1.81e-05s      1     4 Elemwise{Composite{[Composite{[Composite{[Composite{[GT(scalar_sigmoid(i0), i1)]}
   0.7%    98.9%       0.000s       3.10e-06s      1     2 Alloc(TensorConstant{0.0}, Shape_i{0}.0)
   0.6%    99.5%       0.000s       2.86e-06s      1     0 InplaceDimShuffle{x}(b)
   0.5%   100.0%       0.000s       2.15e-06s      1     1 Shape_i{0}(x)
   ... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)

Function profiling
==================
  Message: Sum of all printed profiles at exit
  Time in 10001 calls to Function.__call__: 7.171718e+00s
  Time in Function.fn.__call__: 6.687153e+00s (93.243%)
  Time in thunks: 6.511724e+00s (90.797%)
  Total compile time: 7.349820e-01s
    Theano Optimizer time: 6.715190e-01s
       Theano validate time: 1.461720e-02s
    Theano Linker time (includes C, CUDA code generation/compiling): 3.081584e-02s

  [...]

Compilation pipeline

../_images/pipeline1.png

Inplace optimization

  • 2 type of inplace operations:
    • An op that return a view on its inputs (e.g. reshape, inplace transpose)
    • An op that write the output on the inputs memory space
  • This allows some memory optimization
  • The Op must tell Theano if they work inplace
  • Inplace Op add constraints to the order of execution

Conditions

IfElse

  • Build condition over symbolic variables.
  • IfElse Op takes a boolean condition and two variables to compute as input.
  • While Switch Op evaluates both ‘output’ variables, IfElse Op is lazy and only evaluates one variable respect to the condition.

IfElse Example: Comparison with Switch

from __future__ import absolute_import, print_function, division
import time

import numpy as np

import theano
from theano import tensor as tt
from six.moves import xrange
from theano.ifelse import ifelse

a, b = tt.scalars('a', 'b')
x, y = tt.matrices('x', 'y')

z_switch = tt.switch(tt.lt(a, b), tt.mean(x), tt.mean(y))
z_lazy = ifelse(tt.lt(a, b), tt.mean(x), tt.mean(y))

f_switch = theano.function([a, b, x, y], z_switch)
f_lazyifelse = theano.function([a, b, x, y], z_lazy)

val1 = 0.
val2 = 1.
big_mat1 = np.ones((10000, 1000))
big_mat2 = np.ones((10000, 1000))

n_times = 10

tic = time.clock()
for i in xrange(n_times):
    f_switch(val1, val2, big_mat1, big_mat2)
print('time spent evaluating both values %f sec' % (time.clock() - tic))

tic = time.clock()
for i in xrange(n_times):
    f_lazyifelse(val1, val2, big_mat1, big_mat2)
print('time spent evaluating one value %f sec' % (time.clock() - tic))

IfElse Op spend less time (about an half) than Switch since it computes only one variable instead of both.

>>> python ifelse_switch.py 
time spent evaluating both values 0.230000 sec
time spent evaluating one value 0.120000 sec

Note that IfElse condition is a boolean while Switch condition is a tensor, so Switch is more general.

It is actually important to use linker='vm' or linker='cvm', otherwise IfElse will compute both variables and take the same computation time as the Switch Op. The linker is not currently set by default to ‘cvm’ but it will be in a near future.

Loops

Scan

  • General form of recurrence, which can be used for looping.
  • Reduction and map (loop over the leading dimensions) are special cases of Scan
  • You ‘scan’ a function along some input sequence, producing an output at each time-step
  • The function can see the previous K time-steps of your function
  • sum() could be computed by scanning the z + x(i) function over a list, given an initial state of z=0.
  • Often a for-loop can be expressed as a scan() operation, and scan is the closest that Theano comes to looping.
  • The advantage of using scan over for loops
    • The number of iterations to be part of the symbolic graph
    • Minimizes GPU transfers if GPU is involved
    • Compute gradients through sequential steps
    • Slightly faster then using a for loop in Python with a compiled Theano function
    • Can lower the overall memory usage by detecting the actual amount of memory needed

Scan Example: Computing pow(A,k)

from __future__ import absolute_import, print_function, division
import theano
import theano.tensor as tt
from six.moves import xrange

k = tt.iscalar("k")
A = tt.vector("A")


def inner_fct(prior_result, A):
    return prior_result * A
# Symbolic description of the result
result, updates = theano.scan(fn=inner_fct,
                              outputs_info=tt.ones_like(A),
                              non_sequences=A, n_steps=k)

# Scan has provided us with A**1 through A**k.  Keep only the last
# value. Scan notices this and does not waste memory saving them.
final_result = result[-1]

power = theano.function(inputs=[A, k],
                        outputs=final_result,
                        updates=updates)

print(power(list(range(10)), 2))
#[  0.   1.   4.   9.  16.  25.  36.  49.  64.  81.]

Scan Example: Calculating a Polynomial

from __future__ import absolute_import, print_function, division
import numpy as np

import theano
import theano.tensor as tt

coefficients = theano.tensor.vector("coefficients")
x = tt.scalar("x")
max_coefficients_supported = 10000

# Generate the components of the polynomial
full_range = theano.tensor.arange(max_coefficients_supported)
components, updates = theano.scan(fn=lambda coeff, power, free_var:
                                  coeff * (free_var ** power),
                                  outputs_info=None,
                                  sequences=[coefficients, full_range],
                                  non_sequences=x)
polynomial = components.sum()
calculate_polynomial = theano.function(inputs=[coefficients, x],
                                       outputs=polynomial)

test_coeff = np.asarray([1, 0, 2], dtype=np.float32)
print(calculate_polynomial(test_coeff, 3))
# 19.0

Exercise 4

  • Run both examples
  • Modify and execute the polynomial example to have the reduction done by scan

Exercise 5

  • In the last exercises, do you see a speed up with the GPU?
  • Where does it come from? (Use profile=True)
  • Is there something we can do to speed up the GPU version?

Printing/Drawing Theano graphs

Consider the following logistic regression model:

>>> import numpy
>>> import theano
>>> import theano.tensor as T
>>> rng = numpy.random
>>> # Training data
>>> N = 400
>>> feats = 784
>>> D = (rng.randn(N, feats).astype(theano.config.floatX), rng.randint(size=N,low=0, high=2).astype(theano.config.floatX))
>>> training_steps = 10000
>>> # Declare Theano symbolic variables
>>> x = T.matrix("x")
>>> y = T.vector("y")
>>> w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w")
>>> b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b")
>>> x.tag.test_value = D[0]
>>> y.tag.test_value = D[1]
>>> # Construct Theano expression graph
>>> p_1 = 1 / (1 + T.exp(-T.dot(x, w)-b)) # Probability of having a one
>>> prediction = p_1 > 0.5 # The prediction that is done: 0 or 1
>>> # Compute gradients
>>> xent = -y*T.log(p_1) - (1-y)*T.log(1-p_1) # Cross-entropy
>>> cost = xent.mean() + 0.01*(w**2).sum() # The cost to optimize
>>> gw,gb = T.grad(cost, [w,b])
>>> # Training and prediction function
>>> train = theano.function(inputs=[x,y], outputs=[prediction, xent], updates=[[w, w-0.01*gw], [b, b-0.01*gb]], name = "train")
>>> predict = theano.function(inputs=[x], outputs=prediction, name = "predict")

We will now make use of Theano’s printing features to compare the unoptimized graph (prediction) to the optimized graph (predict).

Pretty Printing

>>> theano.printing.pprint(prediction) 
'gt((TensorConstant{1} / (TensorConstant{1} + exp(((-(x \\dot w)) - b)))), TensorConstant{0.5})'

Debug Print

The graph before optimization:

>>> theano.printing.debugprint(prediction) 
Elemwise{gt,no_inplace} [@A] ''
 |Elemwise{true_div,no_inplace} [@B] ''
 | |DimShuffle{x} [@C] ''
 | | |TensorConstant{1} [@D]
 | |Elemwise{add,no_inplace} [@E] ''
 |   |DimShuffle{x} [@F] ''
 |   | |TensorConstant{1} [@D]
 |   |Elemwise{exp,no_inplace} [@G] ''
 |     |Elemwise{sub,no_inplace} [@H] ''
 |       |Elemwise{neg,no_inplace} [@I] ''
 |       | |dot [@J] ''
 |       |   |x [@K]
 |       |   |w [@L]
 |       |DimShuffle{x} [@M] ''
 |         |b [@N]
 |DimShuffle{x} [@O] ''
   |TensorConstant{0.5} [@P]

The graph after optimization:

>>> theano.printing.debugprint(predict) 
Elemwise{Composite{GT(scalar_sigmoid((-((-i0) - i1))), i2)}} [@A] ''   4
 |CGemv{inplace} [@B] ''   3
 | |Alloc [@C] ''   2
 | | |TensorConstant{0.0} [@D]
 | | |Shape_i{0} [@E] ''   1
 | |   |x [@F]
 | |TensorConstant{1.0} [@G]
 | |x [@F]
 | |w [@H]
 | |TensorConstant{0.0} [@D]
 |InplaceDimShuffle{x} [@I] ''   0
 | |b [@J]
 |TensorConstant{(1,) of 0.5} [@K]

Picture Printing of Graphs

pydotprint requires graphviz and either pydot or pydot-ng.

The graph before optimization:

>>> theano.printing.pydotprint(prediction, outfile="pics/logreg_pydotprint_prediction.png", var_with_name_simple=True)
The output file is available at pics/logreg_pydotprint_prediction.png
../_images/logreg_pydotprint_prediction1.png

The graph after optimization:

>>> theano.printing.pydotprint(predict, outfile="pics/logreg_pydotprint_predict.png", var_with_name_simple=True)
The output file is available at pics/logreg_pydotprint_predict.png
../_images/logreg_pydotprint_predict1.png

The optimized training graph:

>>> theano.printing.pydotprint(train, outfile="pics/logreg_pydotprint_train.png", var_with_name_simple=True)
The output file is available at pics/logreg_pydotprint_train.png
../_images/logreg_pydotprint_train1.png

Debugging

  • Run with the Theano flag compute_test_value = {``off'',``ignore'', ``warn'', ``raise''}
    • Run the code as we create the graph
    • Allows you to find the bug earlier (ex: shape mismatch)
    • Makes it easier to identify where the problem is in your code
    • Use the value of constants and shared variables directly
    • For pure symbolic variables uses x.tag.test_value = numpy.random.rand(5,10)
  • Run with the flag mode=FAST_COMPILE
    • Few optimizations
    • Run Python code (better error messages and can be debugged interactively in the Python debugger)
  • Run with the flag mode=DebugMode
    • 100-1000x slower
    • Test all optimization steps from the original graph to the final graph
    • Checks many things that Op should/shouldn’t do
    • Executes both the Python and C code versions

Known limitations

  • Compilation phase distinct from execution phase
    • Use a_tensor_variable.eval() to make this less visible
  • Compilation time can be significant
    • Amortize it with functions over big input or reuse functions
  • Execution overhead
    • We have worked on this, but more work needed
    • So needs a certain number of operations to be useful
  • Compilation time superlinear in the size of the graph.
    • Hundreds of nodes is fine
    • Disabling a few optimizations can speed up compilation
    • Usually too many nodes indicates a problem with the graph