15.1.1.2.1.10. cobra.flux_analysis.sampling

Module implementing flux sampling for cobra models.

New samplers should derive from the abstract HRSampler class where possible to provide a uniform interface.

15.1.1.2.1.10.1. Module Contents

cobra.flux_analysis.sampling.mp_init(obj)[source]

Initialize the multiprocessing pool.

cobra.flux_analysis.sampling.shared_np_array(shape, data=None, integer=False)[source]

Create a new numpy array that resides in shared memory.

Parameters:
  • shape (tuple of ints) – The shape of the new array.
  • data (numpy.array) – Data to copy to the new array. Has to have the same shape.
  • integer (boolean) – Whether to use an integer array. Defaults to False which means float array.
cobra.flux_analysis.sampling._step(sampler, x, delta, fraction=None, tries=0)[source]

Sample a new feasible point from the point x in direction delta.

class cobra.flux_analysis.sampling.HRSampler(model, thinning, nproj=None, seed=None)[source]

The abstract base class for hit-and-run samplers.

Parameters:
  • model (cobra.Model) – The cobra model from which to generate samples.
  • thinning (int) – The thinning factor of the generated sampling chain. A thinning of 10 means samples are returned every 10 steps.
  • nproj (int > 0, optional) – How often to reproject the sampling point into the feasibility space. Avoids numerical issues at the cost of lower sampling. If you observe many equality constraint violations with sampler.validate you should lower this number.
  • seed (int > 0, optional) – The random number seed that should be used.
model

cobra.Model – The cobra model from which the samples get generated.

thinning

int – The currently used thinning factor.

n_samples

int – The total number of samples that have been generated by this sampler instance.

retries

int – The overall of sampling retries the sampler has observed. Larger values indicate numerical instabilities.

problem

collections.namedtuple – A python object whose attributes define the entire sampling problem in matrix form. See docstring of Problem.

warmup

a numpy matrix – A matrix of with as many columns as reactions in the model and more than 3 rows containing a warmup sample in each row. None if no warmup points have been generated yet.

nproj

int – How often to reproject the sampling point into the feasibility space.

seed

positive integer, optional – Sets the random number seed. Initialized to the current time stamp if None.

fwd_idx

np.array – Has one entry for each reaction in the model containing the index of the respective forward variable.

rev_idx

np.array – Has one entry for each reaction in the model containing the index of the respective reverse variable.

__init__(model, thinning, nproj=None, seed=None)[source]

Initialize a new sampler object.

__build_problem()

Build the matrix representation of the sampling problem.

generate_fva_warmup()[source]

Generate the warmup points for the sampler.

Generates warmup points by setting each flux as the sole objective and minimizing/maximizing it. Also caches the projection of the warmup points into the nullspace for non-homogeneous problems (only if necessary).

_reproject(p)[source]

Reproject a point into the feasibility region.

This function is guaranteed to return a new feasible point. However, no guarantees in terms of proximity to the original point can be made.

Parameters:p (numpy.array) – The current sample point.
Returns:A new feasible point. If p was feasible it wil return p.
Return type:numpy.array
_random_point()[source]

Find an approximately random point in the flux cone.

_is_redundant(matrix, cutoff=None)[source]

Identify rdeundant rows in a matrix that can be removed.

_bounds_dist(p)[source]

Get the lower and upper bound distances. Negative is bad.

sample(n, fluxes=True)[source]

Abstract sampling function.

Should be overwritten by child classes.

batch(batch_size, batch_num, fluxes=True)[source]

Create a batch generator.

This is useful to generate n batches of m samples each.

Parameters:
  • batch_size (int) – The number of samples contained in each batch (m).
  • batch_num (int) – The number of batches in the generator (n).
  • fluxes (boolean) – Whether to return fluxes or the internal solver variables. If set to False will return a variable for each forward and backward flux as well as all additional variables you might have defined in the model.
Yields:

pandas.DataFrame – A DataFrame with dimensions (batch_size x n_r) containing a valid flux sample for a total of n_r reactions (or variables if fluxes=False) in each row.

validate(samples)[source]

Validate a set of samples for equality and inequality feasibility.

Can be used to check whether the generated samples and warmup points are feasible.

Parameters:samples (numpy.matrix) – Must be of dimension (n_samples x n_reactions). Contains the samples to be validated. Samples must be from fluxes.
Returns:A one-dimensional numpy array of length containing a code of 1 to 3 letters denoting the validation result:
  • ’v’ means feasible in bounds and equality constraints
  • ’l’ means a lower bound violation
  • ’u’ means a lower bound validation
  • ’e’ means and equality constraint violation
Return type:numpy.array
class cobra.flux_analysis.sampling.ACHRSampler(model, thinning=100, nproj=None, seed=None)[source]

Artificial Centering Hit-and-Run sampler.

A sampler with low memory footprint and good convergence.

Parameters:
  • model (a cobra model) – The cobra model from which to generate samples.
  • thinning (int, optional) – The thinning factor of the generated sampling chain. A thinning of 10 means samples are returned every 10 steps.
  • nproj (int > 0, optional) – How often to reproject the sampling point into the feasibility space. Avoids numerical issues at the cost of lower sampling. If you observe many equality constraint violations with sampler.validate you should lower this number.
  • seed (int > 0, optional) – Sets the random number seed. Initialized to the current time stamp if None.
model

cobra.Model – The cobra model from which the samples get generated.

thinning

int – The currently used thinning factor.

n_samples

int – The total number of samples that have been generated by this sampler instance.

problem

collections.namedtuple – A python object whose attributes define the entire sampling problem in matrix form. See docstring of Problem.

warmup

a numpy matrix – A matrix of with as many columns as reactions in the model and more than 3 rows containing a warmup sample in each row. None if no warmup points have been generated yet.

retries

int – The overall of sampling retries the sampler has observed. Larger values indicate numerical instabilities.

seed

positive integer, optional – Sets the random number seed. Initialized to the current time stamp if None.

nproj

int – How often to reproject the sampling point into the feasibility space.

fwd_idx

np.array – Has one entry for each reaction in the model containing the index of the respective forward variable.

rev_idx

np.array – Has one entry for each reaction in the model containing the index of the respective reverse variable.

prev

numpy array – The current/last flux sample generated.

center

numpy array – The center of the sampling space as estimated by the mean of all previously generated samples.

Notes

ACHR generates samples by choosing new directions from the sampling space’s center and the warmup points. The implementation used here is the same as in the Matlab Cobra Toolbox [2]_ and uses only the initial warmup points to generate new directions and not any other previous iterates. This usually gives better mixing since the startup points are chosen to span the space in a wide manner. This also makes the generated sampling chain quasi-markovian since the center converges rapidly.

Memory usage is roughly in the order of (2 * number reactions)^2 due to the required nullspace matrices and warmup points. So large models easily take up a few GB of RAM.

References

[1]Direction Choice for Accelerated Convergence in Hit-and-Run Sampling David E. Kaufman Robert L. Smith Operations Research 199846:1 , 84-95 https://doi.org/10.1287/opre.46.1.84
[2]https://github.com/opencobra/cobratoolbox
__init__(model, thinning=100, nproj=None, seed=None)[source]

Initialize a new ACHRSampler.

__single_iteration()
sample(n, fluxes=True)[source]

Generate a set of samples.

This is the basic sampling function for all hit-and-run samplers.

Parameters:
  • n (int) – The number of samples that are generated at once.
  • fluxes (boolean) – Whether to return fluxes or the internal solver variables. If set to False will return a variable for each forward and backward flux as well as all additional variables you might have defined in the model.
Returns:

Returns a matrix with n rows, each containing a flux sample.

Return type:

numpy.matrix

Notes

Performance of this function linearly depends on the number of reactions in your model and the thinning factor.

cobra.flux_analysis.sampling._sample_chain(args)[source]

Sample a single chain for OptGPSampler.

center and n_samples are updated locally and forgotten afterwards.

class cobra.flux_analysis.sampling.OptGPSampler(model, processes, thinning=100, nproj=None, seed=None)[source]

A parallel optimized sampler.

A parallel sampler with fast convergence and parallel execution. See [1]_ for details.

Parameters:
  • model (cobra.Model) – The cobra model from which to generate samples.
  • processes (int) – The number of processes used during sampling.
  • thinning (int, optional) – The thinning factor of the generated sampling chain. A thinning of 10 means samples are returned every 10 steps.
  • nproj (int > 0, optional) – How often to reproject the sampling point into the feasibility space. Avoids numerical issues at the cost of lower sampling. If you observe many equality constraint violations with sampler.validate you should lower this number.
  • seed (int > 0, optional) – Sets the random number seed. Initialized to the current time stamp if None.
model

cobra.Model – The cobra model from which the samples get generated.

thinning

int – The currently used thinning factor.

n_samples

int – The total number of samples that have been generated by this sampler instance.

problem

collections.namedtuple – A python object whose attributes define the entire sampling problem in matrix form. See docstring of Problem.

warmup

a numpy matrix – A matrix of with as many columns as reactions in the model and more than 3 rows containing a warmup sample in each row. None if no warmup points have been generated yet.

retries

int – The overall of sampling retries the sampler has observed. Larger values indicate numerical instabilities.

seed

positive integer, optional – Sets the random number seed. Initialized to the current time stamp if None.

nproj

int – How often to reproject the sampling point into the feasibility space.

fwd_idx

np.array – Has one entry for each reaction in the model containing the index of the respective forward variable.

rev_idx

np.array – Has one entry for each reaction in the model containing the index of the respective reverse variable.

prev

numpy.array – The current/last flux sample generated.

center

numpy.array – The center of the sampling space as estimated by the mean of all previously generated samples.

Notes

The sampler is very similar to artificial centering where each process samples its own chain. Initial points are chosen randomly from the warmup points followed by a linear transformation that pulls the points towards the a little bit towards the center of the sampling space.

If the number of processes used is larger than one the requested number of samples is adjusted to the smallest multiple of the number of processes larger than the requested sample number. For instance, if you have 3 processes and request 8 samples you will receive 9.

Memory usage is roughly in the order of (2 * number reactions)^2 due to the required nullspace matrices and warmup points. So large models easily take up a few GB of RAM. However, most of the large matrices are kept in shared memory. So the RAM usage is independent of the number of processes.

References

[1]Megchelenbrink W, Huynen M, Marchiori E (2014) optGpSampler: An Improved Tool for Uniformly Sampling the Solution-Space of Genome-Scale Metabolic Networks. PLoS ONE 9(2): e86587. https://doi.org/10.1371/journal.pone.0086587
__init__(model, processes, thinning=100, nproj=None, seed=None)[source]

Initialize a new OptGPSampler.

sample(n, fluxes=True)[source]

Generate a set of samples.

This is the basic sampling function for all hit-and-run samplers.

n : int
The minimum number of samples that are generated at once (see Notes).
fluxes : boolean
Whether to return fluxes or the internal solver variables. If set to False will return a variable for each forward and backward flux as well as all additional variables you might have defined in the model.
Returns:Returns a matrix with n rows, each containing a flux sample.
Return type:numpy.matrix

Notes

Performance of this function linearly depends on the number of reactions in your model and the thinning factor.

If the number of processes is larger than one, computation is split across as the CPUs of your machine. This may shorten computation time. However, there is also overhead in setting up parallel computation so we recommend to calculate large numbers of samples at once (n > 1000).

__getstate__()[source]

Return the object for serialization.

cobra.flux_analysis.sampling.sample(model, n, method="optgp", thinning=100, processes=1, seed=None)[source]

Sample valid flux distributions from a cobra model.

The function samples valid flux distributions from a cobra model. Currently we support two methods:

  1. ‘optgp’ (default) which uses the OptGPSampler that supports parallel
    sampling [1]_. Requires large numbers of samples to be performant (n < 1000). For smaller samples ‘achr’ might be better suited.

or

  1. ‘achr’ which uses artificial centering hit-and-run. This is a single process method with good convergence [2]_.
Parameters:
  • model (cobra.Model) – The model from which to sample flux distributions.
  • n (int) – The number of samples to obtain. When using ‘optgp’ this must be a multiple of processes, otherwise a larger number of samples will be returned.
  • method (str, optional) – The sampling algorithm to use.
  • thinning (int, optional) – The thinning factor of the generated sampling chain. A thinning of 10 means samples are returned every 10 steps. Defaults to 100 which in benchmarks gives approximately uncorrelated samples. If set to one will return all iterates.
  • processes (int, optional) – Only used for ‘optgp’. The number of processes used to generate samples.
  • seed (positive integer, optional) – The random number seed to be used. Initialized to current time stamp if None.
Returns:

The generated flux samples. Each row corresponds to a sample of the fluxes and the columns are the reactions.

Return type:

pandas.DataFrame

Notes

The samplers have a correction method to ensure equality feasibility for long-running chains, however this will only work for homogeneous models, meaning models with no non-zero fixed variables or constraints ( right-hand side of the equalities are zero).

References

[1]Megchelenbrink W, Huynen M, Marchiori E (2014) optGpSampler: An Improved Tool for Uniformly Sampling the Solution-Space of Genome-Scale Metabolic Networks. PLoS ONE 9(2): e86587.
[2]Direction Choice for Accelerated Convergence in Hit-and-Run Sampling David E. Kaufman Robert L. Smith Operations Research 199846:1 , 84-95