[ENGLISH] | |||
Избранные работы и препринты | ||
2007 | ||
[EN] A.N. Gorban and O. Radulescu
Concepts of distributed robustness and r-robustness proposed by biologists to explain a variety of stability phenomena in molecular biology are analysed. Then, the robustness of the relaxation time using a chemical reaction description of genetic and signalling networks is discussed. First, the following result for linear networks is obtained: for large multiscale systems with hierarchical distribution of time scales, the variance of the inverse relaxation time (as well as the variance of the stationary rate) is much lower than the variance of the separate constants. Moreover, it can tend to 0 faster than 1/n, where n is the number of reactions. Similar phenomena are valid in the nonlinear case as well. As a numerical illustration, a model of signalling network is used for the important transcription factor NFkB. | ||
[EN] A.N. Gorban and A.Y. Zinovyev In special coordinates (codon position-specific nucleotide frequencies), bacterial genomes form two straight lines in 9-dimensional space: one line for eubacterial genomes, another for archaeal genomes. All the 348 distinct bacterial genomes available in Genbank in April 2007, belong to these lines with high accuracy. The main challenge now is to explain the observed high accuracy. The new phenomenon of complementary symmetry for codon position-specific nucleotide frequencies is observed. The results of analysis of several codon usage models are presented.We demonstrate that the mean-field approximation, which is also known as context-free, or complete independence model, or Segre variety, can serve as a reasonable approximation to the real codon usage. The first two principal components of codon usage correlate strongly with genomic G+C content and the optimal growth temperature, respectively. The variation of codon usage along the third component is related to the curvature of the mean-field approximation. First three eigenvalues in codon usage PCA explain 59.1%, 7.8% and 4.7% of variation. The eubacterial and archaeal genomes codon usage is clearly distributed along two third order curves with genomic G+C content as a parameter. | ||
2006 | ||
[EN] Ovidiu Radulescu, Alexander N. Gorban,
Sergei Vakulenko, Andrei
Zinovyev We review several mathematical methods allowing to identify modules and hierarchies with several levels of complexity in biological systems. These methods are based either on the properties of the input-output characteristic of the modules or on global properties of the dynamics such as the distribution of timescales or the stratification of attractors with variable dimension. We also discuss the consequences of the hierarchical structure on the robustness of biological processes. Stratified attractors lead to Waddington's type canalization effects. Successive application of the many to one mapping relating parameters of different levels in an hierarchy of models (analogue to the renormalization operation from statistical mechanics) leads to concentration and robustness of those properties that are common to many levels of complexity. Examples such as the response of the transcription factor NF·B to signalling, and the segmentation patterns in the development of Drosophila are used as illustrations of the theoretical ideas. | ||
2005 | ||
[EN] A.N.
Gorban, T.G.Popova, A.Yu. Zinovyev
Three results are presented. First, we
prove the existence of a universal 7-cluster structure in all 143 completely
sequenced bacterial genomes available in Genbank in August 2004, and explained
its properties. The 7-cluster structure is responsible for the main part of
sequence heterogeneity in bacterial genomes. In this sense, our 7 clusters is
the basic model of bacterial genome sequence. We demonstrated that there are
four basic ``pure" types of this model, observed in nature: ``parallel
triangles", ``perpendicular triangles", degenerated case and the flower-like
type. | ||
[EN] A.N. Gorban, T.G.Popova, A.Yu. Zinovyev Coding information is the main source of heterogeneity (non-randomness) in the sequences of microbial genomes. The heterogeneity corresponds to a cluster structure in triplet distributions of relatively short genomic fragments (200-400bp). We found a universal 7-cluster structure in microbial genomic sequences and explained its properties. We show that codon usage of bacterial genomes is a multi-linear function of their genomic G+C-content with high accuracy. Based on the analysis of 143 completely sequenced bacterial genomes available in Genbank in August 2004, we show that there are four "pure" types of the 7-cluster structure observed. All 143 cluster animated 3D-scatters are collected in a database which is made available on our web-site (http://www.ihes.fr/~zinovyev/7clusters). The findings can be readily introduced into software for gene prediction, sequence alignment or microbial genomes classification. | ||
[EN] A.N. Gorban,
M. Kudryashev, T. Popova What proteins are made from, as the working parts of the living cells protein machines? To answer this question, we need a technology to disassemble proteins onto elementary functional details and to prepare lumped description of such details. This lumped description might have a multiple material realization (in amino acids). Our hypothesis is that informational approach to this problem is possible. We propose a way of hierarchical classification that makes the primary structure of protein maximally non-random and compare them with other classifications. The first step of the suggested research program is realized: the analysis of protein binary alphabet in comparison with other amino acid classifications. | ||
[EN] A.N. Gorban,
A. Yu. Zinovyev In this paper, we give a tutorial for undergraduate students studying statistical methods and/or bioinformatics. The students learn how data visualization can help in genomic sequences analysis. Students start with a fragment of genetic text of a bacterial genome and analyze its structure. By means of principal component analysis they ``discover'' that the information in genome is encoded by non-overlapping triplets. Next, they learn to find gene positions. This exercise on principal component analysis and K-Means clustering gives a possibility for active study of the basic bioinformatics notions. In Appendix the program listings for MatLab are published. | ||
2004 | ||
[EN] A.N. Gorban,
A.Yu. Zinovyev In special coordinates (codon position--specific nucleotide frequencies) bacterial genomes form two straight lines in 9-dimensional space: one line for eubacterial genomes, another for archaeal genomes. All the 175 known bacterial genomes (Genbank, March 2005) belong these lines with high accuracy, and these two lines are certainly different. The results of PCA analysis of codon usage and accuracy of mean--field (context--free) approximation are presented. The first two principal components correlate strongly with genomic G+C-content and the optimal growth temperature respectively. The variation of codon usage along the third component is related to the curvature of the mean-field approximation. The eubacterial and archaeal genomes codon usage are clearly distributed along two third order curves with genomic G+C-content as a parameter. | ||
[EN] A.N. Gorban, T.G. Popova, A.Yu. Zinovyev The coding information is the main source of heterogeneity (non-randomness) in the sequences of bacterial genomes. This information can be naturally modeled by analysing cluster structures in the "in-phase" triplet distributions of relatively short genomic fragments (200-400bp). We found a universal 7-cluster structure in bacterial genomic sequences and explained its properties. We show that codon usage of bacterial genomes is a multi-linear function of their genomic G+C-content with high accuracy. Based on the analysis of 143 completely sequenced bacterial genomes available in Genbank in August 2004, we show that there are four "pure" types of the 7-cluster structure observed. All 143 cluster animated 3D-scatters are collected in a database and is made available on our web-site: http://www.ihes.fr/~zinovyev/7clusters. The finding can be readily introduced into any software for gene prediction, sequence alignment or bacterial genomes classification. | ||
[EN] Gorban, A.N. If we find a representation of an infinite-dimensional dynamical system as a nonlinear kinetic system with {\it conservation of supports} of distributions, then (after some additional technical steps) we can state that the asymptotics is finite-dimensional. This conservation of support has a {\it quasi-biological interpretation, inheritance} (if a gene was not presented initially in a isolated population without mutations, then it cannot appear at later time). These quasi-biological models can describe various physical, chemical, and, of course, biological systems. The finite-dimensional asymptotic demonstrates effects of {\it "natural" selection}. The estimations of asymptotic dimension are presented. The support of an individual limit distribution is almost always small. But the union of such supports can be the whole space even for one solution. Possible are such situations: a solution is a finite set of narrow peaks getting in time more and more narrow, moving slower and slower. It is possible that these peaks do not tend to fixed positions, rather they continue moving, and the path covered tends to infinity at $t \to \infty$. The {\it drift equations} for peaks motion are obtained. Various types of stability are studied. In example, models of cell division self-synchronization are studied. The appropriate construction of notion of typicalness in infinite-dimensional spaces is discussed, and the "completely thin" sets are introduced | ||
2003 | ||
[EN] A. Yu. Zinovyev, A. N. Gorban, T. G. Popova
Self-training
technique for automated gene recognition both in entire genomes and in
unassembled ones is proposed. It is based on a simple measure (namely, the
vector of frequencies of non-overlapping triplets in sliding window), and needs
neither predetermined information, nor preliminary learning. The sliding window
length is the only one tuning parameter. It should be chosen close to the
average exon length typical to the DNA text under investigation. An essential
feature of the technique proposed is preliminary visualization of the set of
vectors in the subspace of the first three principal components. It was shown,
the distribution of DNA sites has the bullet-like structure with one central
cluster (corresponding to non-coding sites) and three or six ank ones
(corresponding to protein-coding sites). The bullet-like structure itself
revealed in the distribution seems to be very interesting illustration of
triplet usage in DNA sequence. The method was examined on several genomes
(mitochondrion of P.wickerhamii, bacteria C.crescentus and primitive eukaryot
S.cerevisiae). The percentage of truly predicted nucleotides exceeds
90%. | ||
[EN] A. N. Gorban, A. Yu. Zinovyev, T. G. Popova Motivation: In several recent papers new algorithms were proposed for detecting coding regions without requiring learning dataset of already known genes. In this paper we studied cluster structure of several genomes in the space of codon usage. This allowed to interpret some of the results obtained in other studies and propose a simpler method, which is, nevertheless, fully functional. Results: Several complete genomic sequences were analyzed, using visualization of tables of triplet counts in a sliding window. The distribution of 64-dimensional vectors of triplet frequencies displays a well-detectable cluster structure. The structure was found to consist of seven clusters, corresponding to protein-coding information in three possible phases in one of the two complementary strands and in the non-coding regions. Awareness of the existence of this structure allows development of methods for the segmentation of sequences into regions with the same coding phase and non-coding regions. This method may be completely unsupervised or use some external information. Since the method does not need extraction of ORFs, it can be applied even for unassembled genomes. Accuracy calculated on the base-pair level (both sensitivity and specificity) exceeds 90%. This is not worse as compared to such methods as HMM, however, has the advantage to be much simpler and clear. Availability: The software and datasets are available at http://www.ihes.fr/~zinovyev/bullet | ||
[EN] A. N. Gorban, A. Y. Zinovyev, D.C. Wunsch Method of elastic maps allows to construct efficiently 1D, 2D and 3D non-linear approximations to the principal manifolds with different topology (piece of plane, sphere, torus etc.) and to project data onto it. We describe the idea of the method and demonstrate its applications in analysis of genetic sequences. | ||
[EN] Alexander N. Gorban, Andrei Yu. Zinovyev, Tatyana G. Popova Motivation: In several recent papers new algorithms
were proposed for detecting coding regions without requiring learning
dataset of already known genes. In this paper we interpret some of these
results and propose a simpler method. | ||
[RU] Горбань А.Н.,Попова Т.Г., Садовский М.Г.
Цель работы - изучение связи между структурой нуклеотидной последователь-ности и таксономическим положением её носителя. Изучены классификации нуклео-тидных последовательностей бактериальных 16SРНК. Показано существование корре-ляции между таксономическим положением носителей и информационной структурой нуклеотидных последовательностей бактериальных 16SРНК. Две последовательности считались близкими по структуре, если близки их частотные словари в евклидовой метрике. Предложена процедура преобразования частотного словаря, которая выявляет особенности информационной структуры символьной последовательности. Проведено сравнительное исследование классификаций по реальным и преобразованным частот-ным словарям. Выделены информационно значимые сайты - главные факторы отли-чия - для полученных классов. Классификация реальных частотных словарей толщи-ны 3 наилучшим образом коррелирует с родом: род, как правило, целиком включён в один класс и исключения редки. В результате иерархической классификации по преоб-разованным частотным словарям на каждом этапе выделялись одна-две таксономиче-ские группы. Структурные различия полученных классов заключены в редком или, на-оборот, частом (по сравнению с ожидаемым) появлении некоторых слов, количество которых невелико. | ||
[EN] Gorban A.N., Zinovyev A.Yu., Popova T.G.
An approach based on using the idea of distinguished coding phase in explicit form for identication of protein-coding regions in whole genome has been proposed. For several genomes an optimal window length for averaging GC-content function and calculating codon frequencies has been found. Self-training procedure based on clustering in multidimensional space of triplet frequencies is proposed. | ||
[EN] Gorban A.N., Zinovyev A.Yu., Popova T.G.
Overview of statistical methods of gene identification is made. Particular attention is given to the methods which need not a training set of already known genes. After analysis several statistical approaches are proposed for computational exon identification in whole genomes. For several genomes an optimal window length for averaging GC-content function and calculating codon frequencies has been found. Self-training procedure based on clustering in multidimensional codon frequencies space is proposed. | ||
[EN] Gorban A.N., Popova T.G., Sadovsky M.G.
The classifications of bacterial 16S RNA sequences developed over the real and transformed frequency dictionaries have been studied. Two sequences considered to be close each other, when their frequency dictionaries were close in Euclidean metrics. A procedure to transform a dictionary is proposed that makes clear some features of the information pattern of a symbol sequence. A comparative study of classifications developed over the real frequency dictionaries vs. the transformed ones has been carried out. A correlation between an information pattern of nucleotide sequences and taxonomy of the bearer of the sequence was found. The sites with high information value are found, that were the main factors of the difference between the classes in a classification. The classification of nucleotide sequences developed over the real frequency dictionaries of the thickness 3 reveals the best correlation to a gender of bacteria. A set of sequences of the same gender is included entirely into one class, as a rule, and the exclusions occur rarely. A hierarchical classification yields one or two taxonomy groups on each level of the classification. An unexpectedly often (in comparison to the expected), or unexpectedly rare occurrence of some sites within a sequence makes a basic difference between the structure patterns of the classes yielded; a number of those sites is not too great. Further investigations are necessary in order to compare the sites revealed with those determined due to other methodology. | ||
[RU] Горбань А.Н., Хлебопрос Р.Г.
Популярная книга о принципе оптимальности в
эволюции, о методологии математического моделирования, о Дарвине и естественном
отборе. | ||
[RU] Бугаенко Н.Н., Горбань А.Н., Садовский М.Г.
Рассматривается проблема определения информационной емкости нуклеотидных последовательностей. Получены выражения для восстановления частотных словарей высших порядков по низшим. Описаны особенности информационных характеристик реальных нуклеотидных последовательностей, достоверно отличающие их от случайных текстов. | ||
[RU] Горбань А.Н., Смирнова Е.В., Чеусова Е.П.
В результате обработки многолетних наблюдений
при сравнительном анализе популяций и групп, находящихся в различных
экологических условиях (на Крайнем Севере и в средних широтах Сибири),
получен вывод: наибольшую информацию о степени адаптированности
популяции к экстремальным или просто изменившимся условиям несут корреляции
между физиологическими параметрами.
С помощью алгоритма автоматической классификации проведен анализ
данных о состоянии липидного обмена плазмы крови при различных нагрузках.
Показано, что в данном конкретном случае кластеризация отсутствует и имеет
место первый способ увеличения корреляций. |
||