The sequence logo has been used for more than three decades to depict the consensus and diversity of sequence motif. However, the in-line sequence logo fails to unveil the intra-motif dependency and therefore is insufficient to fully characterize sequence motifs, which have been demonstrated by many studies (Man et al, 2001; Bulyk et al, 2004).
CircularLogo (circular sequence logo) is a web application for creating circular sequence logos for DNA (or RNA) motifs. By virtue of the circular layout, CircularLogo not only displays nucleotide stacks like the traditional sequence logo, but also depicts intra-motif dependencies using linked arches.
CircularLogo takes JSON or FASTA format files as input and generates circular sequence logo in PNG, JPEG and TIFF format (see details below).
When the input file is in FASTA format, CircularLogo provides two basic metrics (Mutual Information and Chi-square statistic) to quantify the intra-motif dependencies. However, CircularLogo is by and large a motif visualizing tool. When sophisticated statistical models are needed to quantify intra-motif dependencies, users need to perform this analysis by themselves first, and then prepare a JSON format input file.
CircularLogo can be accessed from our webserver or easily built from source code following instructions below.
Users could build CircularLogo webserver on their laptop or other local computers following instructions below.
These packages need to be installed before installing CircularLogo:
Use pip to install these prerequisites:
pip install Django
pip install biopython
pip install numpy
pip install scipy
Download CircularLogo source code from here.
unzip circularlogo_VERSION.zip
cd circularlogo_VERSION
python manage.py runserver 0.0.0.0:8000 #start webserver
You can quite the CircularLogo webserver by pressing CONTROL-C at any time.
CircularLogo takes JSON or FASTA format files as input.
JSON file is useful when you have prior knowledge about intra-motif dependencies, and your goal is to visualize these dependencies.
JSON (JavaScript Object Notation) is a plain text file. You could manually make a JSON format file or edit a existing JSON file. Please note CircularLogo can also export a JSON format file (as a template for you) if you input a FASTA file.
JSON format file contains 3 sections to describe a circular motif:
{
"id":"AR",
"background":{"key":["A","T","C","G"],"val":[0.25,0.25,0.25,0.25]},
"pseudocounts":{"key":["A","T","C","G"],"val":[0.25,0.25,0.25,0.25]},
"nodes":[
{"index":0,"label":"1","bit":0.779265366373,"base":["C","T","G","A"],"freq":[0.006,0.043,0.347,0.604]},
{"index":1,"label":"2","bit":1.57103613124,"base":["C","T","A","G"],"freq":[0.012,0.017,0.032,0.939]},
{"index":2,"label":"3","bit":0.609476284231,"base":["G","C","T","A"],"freq":[0.027,0.053,0.388,0.532]},
{"index":3,"label":"4","bit":1.79237941693,"base":["C","T","G","A"],"freq":[0.001,0.012,0.012,0.976]},
{"index":4,"label":"5","bit":1.7608900955,"base":["A","G","T","C"],"freq":[0.001,0.012,0.017,0.97]},
{"index":5,"label":"6","bit":1.35037781295,"base":["C","G","T","A"],"freq":[0.017,0.017,0.079,0.888]},
{"index":6,"label":"7","bit":0.00719165322846,"base":["A","G","C","T"],"freq":[0.187,0.249,0.28,0.285]},
{"index":7,"label":"8","bit":0.0651518291752,"base":["G","C","A","T"],"freq":[0.146,0.202,0.305,0.347]},
{"index":8,"label":"9","bit":-0.00870652334553,"base":["A","G","C","T"],"freq":[0.233,0.238,0.259,0.269]},
{"index":9,"label":"10","bit":1.43879070277,"base":["G","C","A","T"],"freq":[0.001,0.017,0.084,0.898]},
{"index":10,"label":"11","bit":1.72492316015,"base":["C","T","A","G"],"freq":[0.006,0.006,0.022,0.965]},
{"index":11,"label":"12","bit":1.7608900955,"base":["G","C","A","T"],"freq":[0.001,0.012,0.017,0.97]},
{"index":12,"label":"13","bit":0.525556517288,"base":["C","G","A","T"],"freq":[0.043,0.068,0.341,0.548]},
{"index":13,"label":"14","bit":1.34114459935,"base":["G","A","T","C"],"freq":[0.017,0.043,0.048,0.893]},
{"index":14,"label":"15","bit":0.668493029832,"base":["G","A","C","T"],"freq":[0.006,0.068,0.383,0.543]}
],
"links":[
{"source":3,"target":4,"value":162.623389175},
{"source":3,"target":11,"value":162.623389175},
{"source":3,"target":10,"value":160.757409794},
{"source":4,"target":11,"value":160.757409794},
{"source":4,"target":10,"value":158.901739691},
{"source":10,"target":11,"value":158.901739691},
{"source":1,"target":3,"value":151.613079897},
{"source":1,"target":4,"value":149.808956186},
{"source":1,"target":11,"value":149.808956186},
{"source":1,"target":10,"value":148.015141753},
{"source":3,"target":9,"value":138.375966495},
{"source":4,"target":9,"value":136.65431701},
{"source":9,"target":11,"value":136.65431701},
{"source":3,"target":13,"value":136.056378866},
{"source":9,"target":10,"value":134.942976804},
{"source":3,"target":5,"value":134.778028351},
{"source":4,"target":13,"value":134.34503866},
{"source":11,"target":13,"value":134.34503866},
{"source":4,"target":5,"value":133.076997423},
{"source":5,"target":11,"value":133.076997423},
{"source":10,"target":13,"value":132.644007732},
{"source":5,"target":10,"value":131.386275773},
{"source":1,"target":9,"value":126.571842784},
{"source":1,"target":13,"value":124.324420103},
{"source":1,"target":5,"value":123.118234536},
{"source":9,"target":13,"value":112.40689433},
{"source":6,"target":14,"value":10.5821520619},
{"source":2,"target":8,"value":9.30380154639},
{"source":6,"target":12,"value":9.16978092784},
{"source":8,"target":12,"value":9.12854381443}
]
}
- index: indicates the order (in anticlockwise ) of the residues along the circular track. Index starts from 0.
- label: indicates the label of each position.
- bit: indicates Shannon’s information content of residues at a position. The method to calculate bit is described here.
- base: denotes the array of residues, the orders of residues and frequencies should be consistent. For example, in the position with index = 0, the frequency for each residue is “C” = 0.006, “T” = 0.043, “G” = 0.347, “A” = 0.604.
- freq: denotes the frequency of each residue.
- source: indicates the “index” of source node (or the start position of the link).
- target: indicates the “index” of target node (or the end position of the link).
- value: indicates the strength of the link between the source and target nodes. The values associated with links will be linearly mapped to the “width” of the linked ribbons in the output image. We generally advise to put the smaller index as “source” and the bigger as “target”, but this is not mandatory since the layout of the motif graph is circular.
CircularLogo will transform the above JSON format file into a circular logo (below),in which the width of links reflect the strength of dependence among nucleotide stacks.
You have motif sequences (e.g. through de novo motif search from your ChIP-seq peaks), and your goal is to explore the intra-motif dependencies.
FASTA is also a plain text file (see below for example). When preparing a FASTA format file, it’s a good idea to extend X nucleotides to both up- and downstream of each motif sites. The extended parts are genome background, and dependencies calculated from these nucleotides (usually very weak) are arguably representing the genome background noise.
>site1
GGTACAGTTTGTACA
>site2
AATACAGATTGTTCT
>site3
AATACAGAGTGTACT
>site4
AGAACATAATGTACA
>site5
AGTACACTCTGTAAT
>site6
GGAACATTTTGTTTT
>site7
AGCACAAGATGTTCT
>site8
AGTACTTCCTGTTCC
>site9
GGTACACTGTGTACT
>site10
GGTACAAACTGTTCT
...
In this scenario, the input FASTA file will be automatically transformed into the JSON format motif representation with the intra-motif dependencies measured by chi-square statistic or mutual information. In addition to chi-square statistic and mutual information, many other statistical methods have been developed to quantify the intra-motif dependencies:
Users need to customize the JSON files (can be exported from CircularLogo webserver) if they want to use these methods.
Although both chi-square and mutual information are used to measure dependency, they are two different metrics and should be used in different conditions.
Essentially, the chi-square statistic measures the “co-occurrence” of nucleotides at two different positions. Therefore, chi-square is more capable of measuring dependency between two “conserved (i.e. less variable)” positions. Because for two highly variable positions, the frequency of dinucleotides (which were used to measure “co-occurrence” and calculated chi-square statistic) is close to background (i.e. 1/16) and chi-square statistic is close to 0.
On the contrary, mutual information measures the reduction in uncertainty about nucleotide frequencies on one position, given some knowledge of nucleotide frequencies on another position. Therefore, mutual information is more capable of measuring dependency between two highly variable positions. Because for a highly conserved position which is dominated by a particular nucleotide, the information content of each position and the mutual information between two positions are approaching 0 bit.
When Chi-square metric was used, the significance of dependency beween two positions was evaluated using Chi-square test. Chi-square metric is calculated as below:
When Mutual Information metric was used, the significance of dependency between two positions was evaluated using Chebyshev’s inequality. For example, if the observed mutual information is K * stdev times larger than that expected from random background model. P <= 1/K**2.
Users can download the FASTA file representing HNF6 DNA from here. Or simply click the “Load FASTA Demo” button.
Method = Mutual Information. P-value cutoff = 1 (Displays all links without filtering).
Method = Mutual Information. P-value cutoff = 1 (Displays all links without filtering). Fucus on node 33 (Using the focus on node drop-down list to select 33).
Method = Mutual Information. P-value cutoff = 1 (Displays all links without filtering). Fucus on node 5 (Using the focus on node drop-down list to select 5).
Method = Mutual Information. P-value cutoff = 0.01 (CircularLogo automatically filter out links with p-value > 0.01).
Method = Mutual Information. P-value cutoff = 0.01 (Filter out links with p-value > 0.01). Manually filter out other weak links and focus on node 33
CircularLogo is distributed under GNU General Public License (GPLv2)
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA