The software described in this document is furnished under a license agreement. The software may be used
or copied only under the terms of the license agreement. No part of this manual may be photocopied or
reproduced in any form without prior written consent from The MathW orks, Inc.
FEDERAL ACQUISITION: This provision applies to all acquisitions of the Program and Documentation
by, for, or through the federal government of the United States. By accepting delivery of the Program
or Documentation, the government hereby agrees that this software or documentation qualifies as
commercial computer software or commercial computer software documentation as such terms are used
or defined in FAR 12.212, DFARS Part 227.72, and DFARS 252.227-7014. Accordingly, the terms and
conditions of this Agreement and only those rights specified in this Agreement, shall pertain to and govern
theuse,modification,reproduction,release,performance,display,anddisclosureoftheProgramand
Documentation by the federal government (or other entity acquiring for or through the federal government)
and shall supersede any conflicting contractual terms or conditions. If this License fails to meet the
government’s needs or is inconsistent in any respect with federal procurement law, the government agrees
to return the Program and Docu mentation, unused, to The MathWorks, Inc.
Trademarks
MATLAB and Simulink are registered trademarks of The MathWorks, Inc. See
www.mathworks.com/trademarks for a list of additional trademarks. Other product or brand
names may be trademarks or registered trademarks of their respective holders.
Patents
The MathWorks products are protected by one or more U.S. patents. Please see
www.mathworks.com/patents for more information.
Revision History
September 2003 Online onlyNew for Version 1.0 ( Release 13SP1+)
June 2004Online onlyRevised for Version 1.1 (Release 14)
November 2004Online onlyRevised for Version 2.0 (Release 14SP1+)
March 2005Online onlyRevised for Version 2.0.1 (Release 14SP2)
May 2005Online onlyRevised for Version 2.1 (Release 14SP2+)
September 2005 Online onlyRevised for Version 2.1.1 (Release 14SP3)
November 2005Online onlyRevised for Version 2.2 (Release 14SP3+)
March 2006Online onlyRevised for Version 2.2.1 (Release 2006a)
May 2006Online onlyRevised for Version 2.3 (Release 2006a+)
September 2006 Online onlyRevised for Version 2.4 (Release 2006b)
March 2007Online onlyRevised for Version 2.5 (Release 2007a)
April 2007Online onlyRevised for Version 2.6 (Release 2007a+)
September 2007 Online onlyRevised for Version 3.0 (Release 2007b)
March 2008Online onlyRevised for Version 3.1 (Release 2008a)
October 2008Online onlyRevised for Version 3.2 (Release 2008b)
March 2009Online onlyRevised for Version 3.3 (Release 2009a)
September 2009 Online onlyRevised for Version 3.4 (Release 2009b)
March 2010Online onlyRevised for Version 3.5 (Release 2010a)
Data Formats and Databases
Sequence Alignments
Sequence Utilities and Statistics
Protein Property Analysis
Phylogenetic Analysis
Microarray Data Analysis
Microarray Data Storage
Mass Spectromet ry Data Analysis
Graph Theory Functions
Graph Visualization
Statistical Learning and Visualization
Prototyping and Development Environment
Data Visualization
Algorithm Sharing and Application Deployment
Using Spreadsheet Link EX with Bioinformatic Data
.......................................1-5
........................................1-5
.................................1-5
.................................1-6
............................1-8
........................1-8
.............................. 1-10
..................... 1-10
.......................... 1-11
.............................. 1-12
........................... 1-12
........................... 1-13
.................... 1-14
............................ 1-17
............................... 1-18
................ 1-18
............1-19
................................1-20
........1-20
..1-22
Creating get Functions
What Are get Functions?
Creating the getpubmed Function
Extracting Information from Large Multi-Entry Text
Overview o f Example
Determining Nucleotide Content
Reading Sequence Information
Determining Nucleotide Compositi on
Determining C odon Composition
Open Reading Frames
Amino Acid Conversion and Composition
..............................2-2
.....................2-2
......................2-4
.................2-6
.....................2-9
............................. 2-12
..............2-15
..1-37
viContents
Example: Sequence Alignment
Overview o f Example
Finding a Model Organism to Study
Retrieving Sequence Information from a Public
Database
Searching a Public Database for Related Genes
Locating Protein Coding Sequences
Comparing Amino Acid Sequences
Sequence Tool
Overview o f the Sequence Tool
Importing a Sequence
Viewing Nucleotide Sequence Information
Searching for Words
Exploring Open Reading Frames
Viewing Amino Acid Sequence Statistics
Closing the Sequence Tool
References
Overview o f the Multiple Sequence Alignment Viewer
Loading Sequence Data and Viewing the Phylogenetic
Tree
Selecting a Subset of Data from the Phylogenetic Tree
Aligning Multiple Sequences
Adjusting Multiple Sequence Alignments Manually
Closing the Multiple Sequence Alignment Viewer
Storing and Managing Short-Read Sequence Data in
Objects
Overview
Representing Sequence and Quality Data in a BioRead
Object
Representing Sequence, Quality, and Alignment/Mapping
Data in a B ioMap Object
Retrieving Information from a BioRead or BioMap
Object
Setting Information in a BioRead or BioMap Object
Determining Coverage of a Reference Sequence
Constructing Sequence Alignments to a Reference
Sequence
Filtering Read Sequences Using SAM Flags
..........................................2-52
........................ 2-54
.........................................2-59
........................................2-59
.........................................2-60
......................... 2-62
.........................................2-66
.........2-69
......................................2-71
............2-72
...2-52
...2-53
.....2-55
.......2-58
.....2-68
Microarray Analysis
3
Storing and Managing Gene Expression Data in
Objects
Representing Expression Data Values in DataMatrix
Objects
Overview of DataMatrix Objects
Constructing DataMatrix Objects
Getting and Setting Properties of a DataMatrix Object
Accessing Data in DataM atrix Objects
Representing Expression Data Values in ExptData
Objects
.........................................3-2
.........................................3-5
.....................3-5
....................3-6
................3-8
.........................................3-11
...3-7
vii
Overview o f ExptData Objects .......................3-11
Constructing ExptData Objects
Using Properties of an ExptData Object
Using Methods of an ExptData Object
References
Representing Sample and Feature Metadata in
MetaData Objects
Overview o f MetaData Objects
Constructing MetaData Objects
Using Properties of a MetaData Object
Using Methods of a MetaData Object
Representing Experiment Information in a MIAME
Object
Overview o f MIAME Objects
Constructing MIAME Objects
Using Properties of a MIAME Object
Using Methods of a MIAME Object
Representing All Data and Information in an
ExpressionSet Object
Overview o f ExpressionSet Objects
Constructing ExpressionSet Objects
Using Properties of an ExpressionSet Object
Using Methods of an ExpressionSet Object
.......................................3-14
............................... 3-15
..........................................3-22
............................ 3-27
...................... 3-12
...............3-12
................. 3-13
....................... 3-15
...................... 3-16
................ 3-19
................. 3-20
........................ 3-22
....................... 3-22
.................. 3-25
................... 3-25
................... 3-27
.................. 3-29
...........3-30
.............3-30
viiiContents
Example: Visualizing Microarray Data
Overview o f the Mouse Example
Exploring the Microarray Data Set
Spatial Images of Microarray Data
Statistics of the Microarrays
Scatter Plots of Microarray Data
Example: Analyzing Gene Expression Profiles
Overview o f the Yeast Example
Exploring the Data Set
Filtering Genes
Clustering Genes
Principal Component Analysis
...................................3-61
............................. 3-57
..................................3-64
..................... 3-33
........................ 3-46
..................... 3-48
...................... 3-57
....................... 3-68
..............3-33
................... 3-34
................... 3-36
.......3-57
Phylogenetic Analysis
4
Overview of Phylogenetic Analysis ..................4-2
Example: Building a Phylogenetic Tree
Overview o f the Primate Example
Searching NCBI for Phylogenetic Data
Creating a Phylogenetic Tree for Five Species
Creating a Phylogenetic Tree for Twelve Species
Exploring the Phylogenetic Tree
Phylogenetic Tree Tool Reference
Overview o f the Phylogenetic Tree Tool
Opening the Phylogenetic Tree Tool
File Menu
Tools Menu
Window Menu
Help Menu
• “Using Spreadsheet Link EX with Bioinformatic Data” on page 1-22
• “Creating get Functions” on page 1-29
• “Extracting Information from Large Multi-Entry Text Files” on page 1-34
1
1 Getting Started
Product Overview
Features
The Bioinformatics Toolbox™ product extends the MATLAB®environment
to provide an integrated software environment for genome and proteome
analysis. Scientists and engineers can answer questions, solve problems,
prototype new algorithms, and build applications for drug discovery and
design, genetic engineering, and biological research. An introduction to these
features will help you to develop a conceptual model for working with the
toolbox and your biological data.
The Bioinformatics Toolbox product includes many functions to help you
with genome and proteome analysis. Most functions are implemented in the
MATLAB programming language, with the source available for you to view.
This open environment lets you explore and customize the existing to olbox
algorithms or develop your own.
In this section...
“Features” on page 1-2
“Expected Users” on page 1-3
1-2
You can use t h e basic bioinformatic functions provided with this toolbox
to create more complex algorithms and applications. These robust and
well-tested functions are the functions that you would otherwise have to
create yourself.
Toolbox features and functions fall within these categories:
• Data formats and databases — Connect to Web-accessible databases
containing genomic and proteomic data. Read and conv ert between
multiple data formats.
• Sequence analysis — Determine the statistical characteristics of a
sequence, align two sequences, a nd multiply align several sequences.
Model patterns in biological sequences using hidden Markov model (HMM)
profiles.
• Phylogenetic analysis — C reate and manipulate phylogenetic tree data.
Product Overview
• Microarray data analysis — Read, normalize, and visualize microarray
data.
• Mass spectrometry data analysis — Analyze and enhance raw mass
spectrometry data.
• Statistical learning — Classify and identify features in data sets with
statistical learning tools.
• Programming interface — Use other bioinformatic software (B io P erl
and BioJava) within the MATLAB environment.
The field of bioinformatics is rapidly growing and will become increasingly
important as biology becomes a more an al ytical scie nce. The toolbox pro vi de s
an open environment that you can customize for development and deploymen t
of the analytical tools you will need.
• Prototype and develop algorithms — Prototype new ideas in an open
and extensible environment. Develop algorithms using efficient string
processing and statistical functions, view the source code for existing
functions, and use the code as a template for customizing, improving,
or creating your own functions. See “Prototyping and Development
Environment” on page 1-19.
• Visualize data — Visualize sequences and alignments, gene expression
data, phylogenetic trees, mass spectrometry data, protein structure,
and relationships between data with interconnected graphs. See “Data
Visualization” on page 1-20.
• Share and deploy applications — Use an interactive GUI builder to
develop a custom graphical front end for your data analysis programs.
Create standalone applications that run sepa rately from the MATLAB
environment. See “Algorithm Sharing and Application Deployment” on
page 1-20.
1-3
1 Getting Started
Expected Users
The Bioinformat
and research sci
published ones
ics Toolbox product is intended for computational biologists
entists who need to develop new algorithms or implement
, visualize results, and createstandaloneapplications.
• Industry/Pro
supported by e
whowanttocre
industries.
• Education/P
and teachin
and student
programmin
While the t
to be a comp
However, t
prototyp
g genome and proteome analysis techniques. Educators
s can concentrate on bioinformatic algorithms instead of
g basic functions such as reading and writing to files.
oolbox includes many bioinformatic functions, it is not intended
lete set of tools for scientists to analyze their biological data.
he MATLAB environment is ideal for rapidly designing and
ing the tools you need.
fessional — Increasingly, drug discovery methods are being
ngineering practice. This toolbox supports tool builders
ate applications for the biotechnology and pharmaceutical
rofessor/Student — This toolbox is well suited for learning
1-4
Installation
Installation
In this section...
“Installing” on page 1-5
“Required Software” on page 1-5
“Optional Software” on page 1-6
Installing
Install the Bioinformatics Toolbox software from a DV D or W eb release using
the MathWorks™ Installer. For more information, see the Installation Guide
foryourplatform.
Required Software
The Bioinformatics Toolbox software requires the following MathWorks
products to be installed on your computer.
Require
MATLABProvide
Stat
dSoftware
istics Toolbox™
Descrip
softwa
Toolbo
Versio
res MATLAB Version 7.10 on the Release
requi
2010a
ides basic statistics and probability functions
Prov
by the functions of the Bioinformatics
used
box software.
Tool
sion 3.5 of the Bioinformatics Toolbox software
Ver
uires Statistics Toolbox Version 7.3 on the
req
ease 2010a DVD.
Rel
tion
s a command-line interface and integrated
re environment for the Bioinformatics
x software.
n 3.5 of the Bioinformatics Toolbox software
DVD.
1-5
1 Getting Started
Optional Software
MATLAB and the Bioinformatics Toolbox software environment is open and
extensible. In this environment you can interactively explore ideas, prototype
new algorithms, and develop complete solutions to problems in bioinformatics.
MATLAB facilitates computation, visualization, prototyping, and deployment.
Using the Bioinformatics Toolbox software with other MATLAB toolboxes and
products will allow you to do advanced algorithm development and solve
multidisciplinary problems.
Optional Software
Parallel Computing
Toolbox™
Signal Processing
Toolbox™
Image Processing
Toolbox™
SimBiology
Optimization
Toolbox™
Neural Network
Toolbox™
Database Toolbox™
®
Description
Perform parallel bioinformatic computations on
multicore computers and computer clusters . For
an example of batch processing through parallel
computing, see the Batch Processing of Spectra
Using Distributed Computing demo.
Process signal data from bioanalytical
instrumentation. Examples include acquisition
of fluores cence data for DNA sequence analyzers,
fluorescence data for microarray scanners, and
mass spectrometric data from protein analyses.
Create complex and custom image processing
algorithms for data from microarray scanners.
Model, simulate, and analyze biochemical systems.
Use nonlinear optimization to predict the
secondary structure of proteins and the structure
of other biological macromolecules.
Use neural networks to solve problems where
algorithms are not available. For example, you
can train neural networks for pattern recognition
using large sets of sequence data.
Create your own i n-house databases for sequence
data with custom annotations.
1-6
Installation
Optional Software
MATLAB
®
Compiler™
Description
Create standalone applications from MATLAB
GUI applications, and create dynamic link
libraries from MATLAB functions to use with any
programming environment.
MATLAB®Builder™NECreate COM objects to use with any COM-based
programming environment.
MATLAB Builder JAIntegrate MATLAB applications into your
organization’s Java™ programs by creating a Java
wrapper around the application.
Spreadsheet Link™EXConnect Microsoft Excel with the MATLAB
WorkspacetoexchangedataandtouseMATLAB
computational and visualization functions. For
more information, see “Using Spreadsheet Link
EX with Bioinformatic Data” on page 1-22.
1-7
1 Getting Started
Features and Functions
In this section...
“Data Formats and Databases” on page 1-8
“Sequence Alignments” on page 1-10
“Sequence Utilities and Statistics” on page 1-10
“Protein Property Analysis” on page 1-11
“Phylogenetic Analysis” on page 1-12
“Microarray Data Analysis” on page 1-12
“Microarray Data Storage” on page 1-13
“Mass Spectrometry Data Analysis” on page 1-14
“Graph Theory Functions” on page 1-17
“Graph Visualization” on page 1-18
“Statistical Learning and Visualization” on page 1-18
1-8
“Prototyping and Development Environment” on page 1-19
“Data Visualization” on page 1-20
“Algorithm Sharing and Application Deployment” on page 1-20
Data Formats and Databases
The toolbox accesses many of the databases on the W eb and other online data
sources. It allows you to copy data into the MATLAB Workspace, and read
and write to files with standard bioinformatic formats. It also reads many
commongenomefileformats,sothatyoudonothavetowriteandmaintain
your own file readers.
Web-based databases — You can directly access public databases on the
Web and copy sequence and gene expression information into the MATLAB
environment.
The sequence databases currently supported are GenBank
GenPept (
(
getembl), and Protein Data Bank (PDB) (getpdb). You can also access data
getgenpept), European Molecular Biology Laboratory (EMBL)
®
(getgenbank),
Features and Functions
from the NCBI Gene Expression Omnibus (GEO ) Web site by using a single
function (
getgeodata).
Get multiply aligned sequences (
profiles (
gethmmprof), and phylogenetic tree data (gethmmtree)fromthe
gethmmalignment), hidden Markov model
PFAM database.
Gene Ontology database — Load the database from the Web into
a gene ontology object (
ontology with methods for the geneont object (
Read data from instruments — Read d ata generated from gene
sequencing instruments (
(
jcampread), and Agilent
scfread, joinseq, traceplot), mass spectrometers
®
microarray s canners (agferead).
Reading data formats — The toolbox provides a number of functions for
reading data from common bioinformatic file formats.
• Sequence data: GenBank (
(
emblread), PDB (pdbread), and FASTA (fastaread)
• Multiply aligned sequences: ClustalW and GCG formats (
genbankread), GenPept (genpeptread), EMBL
multialignread)
• Gene expression data from microarrays: Gene Expression Omnib us (GEO)
data (
geosoftread), GenePix
galread), SPOT data (sptread), Affymetrix
and ImaGene
®
results files (imageneread)
®
data in GPR and GAL files (gprread,
®
GeneChip®data ( affyread),
• Hidden Markov model profiles: PFAM -HM M file (
pfamhmmread)
Writing data formats — The functions for getting data from the Web
include the option to save the data to a file. However, there is a function to
write data to a file using the FASTA format (
BLAST searches — Request Web-based BLAST searches (
the results from a search (
BLAST formatted report file (
getblast) and read results from a previously saved
blastread).
fastawrite).
blastncbi), get
1-9
1 Getting Started
The MATLAB environment has built-in support for other industry-standard
file formats including Microsoft Excel and comma-separated-value (CSV) files.
Additional functions perform ASCII and low-level binary I/O, allowing you to
develop custom functions for working with any data format.
Sequence Alignments
You can select from a list of analysis methods to compare nucleotide or amino
acid sequences using pairwise or multiple sequence alignment functions.
Pairwise sequence alignment — Efficient implementations of standard
algorithms such as the Needleman-Wunsch (
(
swalign) algorithms for pairwise sequence alignment. The toolbox also
includes standard scoring matrices such as the PAM and BLOSUM
families of matrices (
sequence similarities with
Biological codes — Look up the letters or numeric equivalents for
commonly used biological codes (
revgeneticcode).
aminolookup, baselookup, geneticcode,
Sequence Utilities and Statistics
You can manipulate and analyze your sequences to gain a deeper
understanding of the physical, chemical, and biological characteristics of
your data. Use a graphical user interface (GUI) with many of the sequence
functions in the toolbox (
seqtool).
Features and Functions
Sequence conversion and manipulation — The toolbox provides routines
for common operations, such as converting DNA or RNA sequences to amino
acid sequences, that are basic to working w ith nucleic acid and protein
sequences (
codonbias, cpgisland, oligoprop), search for specific patterns within a
sequence (
(
seqshoworfs). In addition, you can create random sequences for test cases
(
randseq).
seqshowwords, seqwordcount), or search for open reading frames
Sequence utilities — Determine a consensus sequence from a set of multiply
aligned amino acid, nucleotide sequences (
profile (
seqprofile). Format a sequence for display (seqdisp)orgraphically
show a sequence alignment with frequency data (
seqconsensus,orasequence
seqlogo).
Additional MATLAB functions efficiently handle string operations with
regular expressions (
sequence and search through a library for string matches (
regexp, seq2regexp) to look for specific patterns in a
seqmatch).
Look for possible cleavage sites in a DNA/RNA sequence by searching for
palindromes (
palindromes).
Protein Property Analysis
You can use a collection of protein analysis methods to extract information
from your data. You can determine protein characteristics and simulate
enzyme cleavage reactions. The toolbox provides functions to calculate various
properties of a protein sequence, such as the atomic composition (
molecular weight (
cleave a protein with an enzyme (
and Ramachandran plots for PDB data (
molweight), and isoelectric point (isoelectric). You can
cleave, rebasecuts) and create distance
pdbdistplot, ramachandran). The
toolbox contains a graphical user interface for protein analysis (
and plotting 3-D protein and other molecular structures with information
from molecule m odel files, such as PDB files (
molviewer).
atomiccomp),
proteinplot)
1-11
1 Getting Started
Amino acid sequence u tilities — Calculate amino acid statistics for a
sequence (
aacount) and get information about character codes (aminolookup).
Phylogenetic Analysis
You can use functions for phylogenetic tree building and analysis. There is
also a GUI to draw phylograms (trees).
Phylogenetic tree data — Read and write Newick-formatted tree files
(
phytreeread, phytreewrite) into the MATLAB Workspace as phylogenetic
tree objects (
Create a phylogenetic tree — Calculate the pairwise distance between
biological sequences (
dndsml), build a phylogenetic tree from pairwise distances (seqlinkage,
seqneighjoin, reroot), and view the tree in an interactive GUI that allows
you to view, edit, and explore the data (
allows you to prune branches, reorder, rename, and explore distances.
phytree).
seqpdist), estimate the substitution rates (dnds,
phytreetool or view). This GUI also
1-12
Phylogenetic tree object methods — You can access the functionality
of the
(
the patristic distances between pairs of leaf nodes (
phytreetool GUI using methods for a phylogenetic tree object
phytree). Get property values (get) and node names (getbyname). Calculate
pdist, weights)
and draw a phylog enetic tree object in a MATLAB Figure window as a
phylogram, cladogram, or radial treeplot (
selecting branches and leaves using a specified criterion (
and removing nodes (
Newick-formatted strings (
prune). Co mpare trees (getcanonical) and use
getnewickstr).
plot). Manipulate tree data by
select, subtree)
Microarray Data Analysis
The MATLAB environment is widely used for microarray data analysis,
including reading, filtering, normalizing, and visualizing microarray data.
However, the standard normalization and visualization toolsthatscientists
use can be difficult to implement. The toolbox includes these standard
functions:
Microarray data — Read Affymetrix GeneChip files (
data (
sptread) and Agilent microarray scanner files (agferead). Read GenePix
affyread)andplot
Features and Functions
GPR files (gprread)andGALfiles(galread). Get Gene Expression Omni bus
(GEO)datafromtheWeb(
(
geosoftread).
getgeodata) and read GEO data from files
A utility function (
reader functions (
magetfield) extracts data from one of the microarray
gprread, agferead, sptread, imageneread).
Microarray normalization and filtering —Thetoolboxprovidesa
number of methods for normalizing microarray data, such as low ess
normalization (
multiple arrays (
clean raw data before analysis (
generangefilter, genevarfilter), and calculate the range and variance of
values (
exprprofrange, exprprofvar).
malowess) and mean normalization (manorm), or across
quantilenorm). You can use filtering functions to
geneentropyfilter, genelowvalfilter,
Microarray visualization — The toolbox contains routines for visualizing
microarray data. These routines include spatial plots of microarray data
(
clustergram, redgreencmap). You can create 2-D scatter plots of
principal components from the microarray data (
mairplot). You can also view clustered expression
mapcaplot).
Microarray utility functions — Use the following functions to work
with Affymetrix GeneChip data sets. Get library information for a probe
(
probelibraryinfo), gene information from a probe set (probesetlookup),
and probe set values from CEL and CDF information (
Show probe set information from NetAffx™ Analysis Center (
and plot probe set values (
probesetplot).
probesetvalues).
probesetlink)
The toolbox accesses statistical routines to perform cluster analysis and
to visualize the results, and you can view your data through statistical
visualizations such as dendrograms, classification, and regression trees.
Microarray Data Storage
The toolbox includes functions, objects, and methods for creating, storing, and
accessing microarray data.
The object constructor function,
object to encapsulate data and metadata from a microarray experiment. A
DataMatrix, lets you create a DataMatrix
1-13
1 Getting Started
DataMatrix object stores experimental data in a matrix, with rows typically
corresponding to gene names or probe identifiers, and columns typically
corresponding to sample identifiers. A DataMatrix object also stores
metadata, including the gene names or probe identifiers (as the row names)
and sample identifiers (as the column names).
You can reference microarray expressionvaluesinaDataMatrix object the
same way you reference data in a MATLAB array, that is, by using linear or
logical indexing. Alternately, you can reference this experimental data by
gene (probe) identifiers and sample identifiers. Indexing by these identifiers
lets you quickly and conveniently access subsets of the data without having
to maintain additional index arrays.
Many MATLAB operators and arithmetic functions are available to
DataMatrix objects by means of methods. These methods let you modify,
combine, compare, analyze, plot, and access information from DataMatrix
objects. Additionally, you can easily extend the functionality by using
general element-wise functions,
accessing the properties of a DataMatrix object.
dmarrayfun and dmbsxfun, and by manually
1-14
Note For more information on creating and using DataMatrix objects, see
“Representing Expression Data Values in DataMatrix Objects” on page 3-5.
Mass Spectrometry Data Analysis
The mass spectrometry functions preprocess and classify raw data from
SELDI-TOF and MALDI-TOF spectrometers and use statistical learning
functions to identify patterns.
Reading raw data — Load raw mass/charge and ion intensity data from
comma-separated-value (CSV) files, or read a JCAMP-DX-formatted file with
mass spectrometry data (
You can also have data in TXT files and use the
Preprocessing raw data — Resample high-resolution data to a lower
resolution (
the baseline (
msresample) where the extra data points are not needed. Correct
msbackadj). Align a spectrum to a set of reference masses
jcampread) into the MATLAB environ ment.
importdata function.
Features and Functions
(msalign) and visually verify the alignment (msheatmap). N ormalize the
area betw ee n spectra for comparing (
msnorm), and filter out noise (mslowess
and mssgolay).
Spectrum analysis — Load spectra into a GUI (
msviewer) for selecting mass
peaks and further analysis.
The following graphic illustrates the roles of the various mass spectrometry
functions in the toolbox.
1-15
1 Getting Started
mzXML File
mzxmlread
mzXML Structure
mzxml2peaks
Peak Lists
(Centroided Data)
mspeaksmsppresample
Raw
Data
Reconstructed
Data
Semicontinuous Signal
msresample
msdotplot
msheatmap
msviewer
Plot
Plot
Mass
Spectra
Viewer
1-16
Features and Functions
Graph Theory Fun
Graph theory fun
sparse matrices
the matrix repr
represent the a
edge. Graph al
if a
NaN or an In
information w
algorithms l
notattheval
Sparse matr
• Directed Gr
(column) i
in the diag
values.
• Undirect
real or lo
stored in
• Direct A
with zer
require
aDAGwi
ment of a DAG, it does not guarantee a DAG. An algorithm expecting
ctions in the toolbox apply basic graph theory algorithms to
. A sparse matrix represents a graph, any nonzero entries in
esent the edges of the graph, and the values of these entries
ssociated weight (cost, distance, length, or capacity) of the
gorithms that use the weight information will cancel the edge
f
is found. Graph algorithms that do not use the weight
ill consider the edge if a
ook only at the connectivity described by the sparse matrix and
ues stored in the sparse matrix.
ices can represent four types of graphs:
aph — Sparse matrix, either double real or logical. Row
ndex indicates the source (target) of the edge. Self-loops (values
onal) a re allowed, although most of the algorithms ignore these
ed Graph — Lower triangle of a sparse matrix, either double
gical. An algorithm expecting an undirected graph ignores values
the upper triangle of the sparse matrix and values in the diagonal.
cyclic Graph (DAG) — Sparse matrix, double real or logical,
o values in the diagonal. W hile a zero-valued diagonal is a
ll not test for cycles because this will add unwanted complexity.
ctions
NaN or an Inf is found, because these
• Spanni
connec
There
all fo
will
Grap
can i
h algorithm for DAG, however testing if a graph is acyclic is expensive
pat
pared to the algorithm. Therefore, it is important to select a graph theory
com
ction a nd properties appropriate for the type of the graph represente d by
fun
r input matrix. If the algorithm receives a graph type that differs from
you
at it expects, it will eith er:
wh
ng Tree — Undirected graph with no cycles and with one
ted component.
are no attributes attached to the graphs; sparse matrices representing
ur types of graphs can be passed to any graph algorithm. All functions
return an error on nonsquare sparse matrices.
h algorithms do not pretest for graph properties because such tests
ntroduce a time penalty. For example, there is an efficient shortest
1-17
1 Getting Started
• Return an error when it reaches an inconsistency. For example, if you pass
a cyclic graph to the
the
method property.
• Produce an invalid result. For example, if you pass a directed graph to a
function with an algorithm that expects an undirected graph, it will ignore
valuesintheuppertriangleofthesparsematrix.
The toolbox includes functions, objects, and methods for creating, viewing,
and manipulating graphs such as interactive maps, hierarchy plots, and
pathways. This allows you to view relationships between data.
The object constructor function (
hold graph data. Methods o f the biograph object let you calculate the position
of nodes (
edges (
and find relations between the nodes (
and
algorithms to the biograph object.
Various properties of a biograph object let you programmatically change the
properties of the rendered graph. You can customize the node representation,
for example, drawing pie charts inside every node (
you can associate your own callback functions to nodes and edges of the graph,
for example, opening a Web page with more information about the nodes
(
NodeCallback and EdgeCallback).
dolayout), draw the graph (view), get handles to the nodes and
getnodesbyid and getedgesbynodeid) to further query information,
getrelatives). There are also methods that apply basic graph theory
biograph) lets you create a biograph object to
getancestors, getdescendants,
CustomNodeDrawFcn). Or
1-18
Statistical Learning and Visualization
You can classify and identify features in data sets, set up cross-validation
experiments, and compare different classification methods.
Features and Functions
The toolbox provides functions that build on the classification and statistical
learning tools in the Statistics Toolbox software (
treefit).
classify, kmeans,and
These functions include imputation tools (
classifiers (
(
knnclassify).
svmclassify, svmtrain) and K-nearest neighbor classifiers
Other functions include set up of cross-validation experiments (
knnimpute), support vector m achine
crossvalind)
and comparison of the perform ance of different classification methods
(
classperf). In addition, there are tools for selecting diversity and
discriminating features (
rankfeatures, randfeatures).
Prototyping and Development Environment
The MATLAB environment lets you prototype and develop algorithms and
easily compare alternatives.
• Integrated environment — Explore biological data in an environment
that integrates programm i ng and visualization. Create reports and plots
with the built-in functions for mathematics, graphics, and statistics.
• Open environment — Access the source code for the toolbox functions.
The toolbox includes many of the basic bioinformatics functions you will
need to use, and it includes prototypes for some of the more advanced
functions. Modify these functions to create your own custom solutions.
• Interactive programming language — Test your ideas by typing
functions that are interpreted interactively with a language whose basic
data element is an array. The arrays do not require dimensioning and
allow you to solve many technical computing problems,
Using matrices for sequences or groups of sequences allows you to work
efficiently and not worry about writing loops or other programming controls.
• Programming tools — Use a visual debugger for algorithm development
and refinement and an algorithm performance profiler to accelerate
development.
1-19
1 Getting Started
Data Visualizat
You can visually
sequences, gene
protein charac
create custom g
can also creat
®
Adobe
PostSc
Algorithm Sh
The open MATL
other users
With the add
applicatio
of MATLAB B
applicati
• Share alg
algorith
platform
MATLAB e
Environ
• Deploy M
enviro
to crea
MATLAB
ns independent of the MATLAB environment, and, with the addition
ons w ithin other programming environments.
ms created in the MATLAB language across all supported
ment (GUIDE).
nment using GUIDE, and then use MATLAB Compiler software
te a standalone GUI application that runs separately from the
expression data from microarrays, and plot nucleic acid and
teristics. The 2-D and volume visualization features let you
raphical representations of multidimensional data sets. You
e montages and overlays, and export finished graphics to an
®
ript
image file or copy directly into Microsof t®PowerPoint®.
aring and Application Deployment
AB environment lets you share your analysis solutions with
, and it includes tools to create custom software applications.
ition of MATLAB Compiler software, you can create standalone
uilder NE software, you can create GUIs and standalone
orithms with other users — You can share data analysis
s by giving files to other users. You can also create GUIs within the
nvironment using the Graphical User Interface Development
ATLAB GUIs — Create a GUI within the MATLAB
environment.
ion
1-20
• Creat
• Crea
• Crea
• Cre
e dynamic link libraries (DLLs) — Use MATLAB Compiler
are to create DLLs for your functions, and then link these libraries to
softw
programming environments such as C and C++.
other
te COM objects — Use MATLAB Builder NE software to create
COM o
(Vis
cre
spr
au
bjects, and then use a COM-compatible programming environment
ual Basic
te Excel add-ins — Use MATLAB Builder EX software to
ate Excel add-in functions, and then use these functions with Excel
eadsheets.
ate Java classes — Use MATLAB Builder JA software to
tomatically generate Java classes from algorithms written in the
®
) to create a standalone application.
Features and Functions
MATLAB programming language. You can run these classes outside the
MATLAB environment.
1-21
1 Getting Started
Using Spreadsheet Link EX with Bioinformatic Data
If you have bioinformatic data in a Microsoft Excel spreadsheet, you can
use Spreadsheet Link EX software to connect Excel with the MATLAB
Workspace to exchange data and to use MATLAB and Bioinformatics Toolbox
computational and visualization functions.
Note The following example assumes you have Spreadsheet Link EX
software installed on your system.
The file used in th e following example contains data from DeRisi, J.L., Iyer,
V.R., and Brown, P.O. (Oct. 24, 1997). Exploring the metabolic and genetic
control of gene expression on a genomic scale. Science 278(5338), 680–686.
PMID: 9381177. The data has been filtered using the steps described in the
Gene Expression Profile Analysis demo.
1 If you have not already done so, modify your system path to include the
MATLAB root directory as described in “Modifying Your System Path” in
the Spreadsheet Link EX documentation.
1-22
2 If you have not already done so, configure the software as described in
“Configuring the Spreadsheet Link EX Software” in the Spreadsheet Link
EX documentation.
3 Close MATLAB if you have it open.
4 Start Microsoft Excel. MATLAB and Spreadsheet Link EX software open
also.
5 From Excel, open the following file provided with the Bioinformatics
Note The notation matlabroot is the MATLAB root directory, which is
the directory where the MATLAB softw are is installed on your system.
Using Spreadsheet Link™ EX with Bioinformatic Data
Note When the Security Warning appears, click Enable Macros.
6 Note that cells J5, J6, J7, and J12 of the spreadsheet contain formulas
using Spreadsheet Link EX functions
MLPutMatrix and MLEvalString.
Tip To view a cell’s formula, select the cell, then view the formula in the
formula bar
at the top of the Excel w indow .
Cells J5, J6, and J7 each create a MATLAB variable from the data
in the spreadsheet, and cell J12 runs the Bioinformatics Toolbox
clustergram function using these three variables as input. For more
information on adding formulas using Spreadsheet Link EX functions, see
“Entering Functions into Worksheet Cells” in the Spreadsheet Link EX
documentation.
1-23
1 Getting Started
Cells J5, J6, and J7 contain formulas
that use the MLPutMatrix function
to create three MATLAB variables.
Cell J12 contains a formula
that uses the MLEvalString function
to run the Bioinformatics Toolbox function
clustergram.
1-24
Cell J17 contains a formula
that uses a macro function,
Clustergram, created in
Visual Basic Editor.
Note that cell J17 contains a formula using a macro function Clustergram,
which was created in the Visual Basic Editor. R unning this macro does the
same as the formulas in cells J5, J6, J7, and J12. To view the
macro function, select Tools > Macro > Visual Basic Ed itor.Formore
information on creating macros using Visual Basic Editor, see “Examples:
Using Spreadsheet Link EX Functions in Macros” in the Spreadsheet Link
EX documentation.
Clustergram
Using Spreadsheet Link™ EX with Bioinformatic Data
7 Run the formula in cell J17 to ana ly z e and visualize the data:
a Select cell J17.
b Press F2.
c Press Enter.
The macro function
(
data, Genes,andTimeSteps) and displaying a Clustergram window
Clustergram runs creating three MATLAB variables
containing dendrograms and a heat map of the data.
8 Edit the formulas in cells J5 and J6 to analyze a subset of the data. Do
this by editing the formulas’ cell ranges to include data for only the first
30 genes:
a Select cell J5,thenpressF2 to display the formula for editing. Change
H617 to H33,thenpressEnter.
1-25
1 Getting Started
b Select cell J6,thenpressF2 to display the formula for editing. Change
A617 to A33,thenpressEnter.
9 Run the formulas in cells J5, J6, J7, and J12 to analyze and visualize
a subset of the data.
a Select cell J5,pressF2,thenpressEnter.
b Select cell J6,pressF2,thenpressEnter.
c Select cell J7,pressF2,thenpressEnter.
d Select cell J12,pressF2,thenpressEnter.
1-26
Using Spreadsheet Link™ EX with Bioinformatic Data
10 Use the commands in the Spreadsheet Link EX toolbar to interact with
the data:
a C lick-drag to select cells B5 through H7,clickputmatrix in the toolbar,
type YAGenes for the variable name, then click OK.Thevariable
YAGenes is added to the MATLAB Workspace as a 3-by-7 matrix.
b Click evalstring in the toolbar, type plot(YAGenes ) for the command,
then click OK. A Figure window displays a plot of the data.
Note Make sure you use the ' (transpose) symbol when plotting the
data in this step. You need to transpose the data in
YAGenes so that it
plots as three genes o ve r seven time interv als.
c Select cell J20,thenclickgetfigure in the toolbar. The figure is added
to the spreadsheet
1-27
1 Getting Started
1-28
Creating get Functions
In this section...
“WhatAregetFunctions?”onpage1-29
“Creating the getpubmed Function” on page 1-30
What Are get Functions?
Bioinformatics Toolbox includes several get functions that retrieve
information from various Web databases. Additionally, with some basic
MATLAB programming skills, you can create your own get function to
retrieve information from a specific Web database.
The following procedure illustrates how to create a function to retrieve
information from the NCBI PubMed database and read the information into
a MATLAB structure. The NCBI PubMed database contains biomedical
literature citations and abstracts.
Creating get Functions
1-29
1 Getting Started
Creating the get
The following pr
using the MATLAB
information fr
MATLAB structu
Specifically
to the PubMed d
structure ar
found by the s
identifier,
The functio
the user of t
number of r
1 From MATLA
ocedure shows you how to create a function named
om PubMed literature searches and write the data to a
, this function will take one or more search terms, submit them
atabase for a search, then return a MATLAB structure or
ray, with each structure containing information for an article
earch. The returned information will include a PubMed
publication date, title, abstract, authors, and citation.
n will also include property name/property value pairs that let
he function limit the search by publication date and limit the
ecords returned.
B, open the MATLAB Editor by selecting File > New >
pubmed Function
getpubmed
Editor. This function will retrieve citation and abstract
re.
M-File.
2 Define the getpubmed function, its input arguments, and return values by
typing:
function pmstruct = getpubmed(searchterm,varargin)
% GETPUBMED Search PubMed database & write results to MATLAB structure
1-30
3 Add code to do some basic error checking for the required input SEARCHTERM.
% Error checking for required input SEARCHTERM
if(nargin<1)
error('GETPUBMED:NotEnoughInputArguments',...
'SEARCHTERM is missing.');
end
4 Create variables for the two property name/property value pairs, and set
their default values.
% Set default settings for property name/value pairs,
% 'NUMBEROFRECORDS' and 'DATEOFPUBLICATION'
maxnum = 50; % NUMBEROFRECORDS default is 50
pubdate = ''; % DATEOFPUBLICATION default is an empty string
% Parsing the property name/value pairs
num_argin = numel(varargin);
for n = 1:2:num_argin
arg = varargin{n};
switch lower(arg)
% If NUMBEROFRECORDS is passed, set MAXNUM
case 'numberofrecords'
maxnum = varargin{n+1};
% If DATEOFPUBLICATION is passed, set PUBDATE
case 'dateofpublication'
pubdate = varargin{n+1};
end
end
6 You access the PubMed database through a search URL, which submits
a search term and options, and then returns the search results in a
specified f ormat. This search URL is comprised of a base URL and defined
parameters. Create a variable containing the base URL of the PubMed
database on the NCBI Web site.
% Create base URL for PubMed db site
baseSearchURL = 'http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=search';
7 Create variables to contain five defined parameters that the getpubmed
function will use, namely, db (database), term (search term), report (report
type, such as MEDLINE
®
), format (format type, such as text), and dispmax
(maximum number of records to display).
% Set db parameter to pubmed
dbOpt = '&db=pubmed';
% Set term parameter to SEARCHTERM and PUBDATE
% (Default PUBDATE is '')
termOpt = ['&term=',searchterm,'+AND+',pubdate];
1-31
1 Getting Started
% Set report parameter to medline
reportOpt = '&report=medline';
% Set format parameter to text
formatOpt = '&format=text';
% Set dispmax to MAXNUM
% (Default MAXNUM is 50)
maxOpt = ['&dispmax=',num2str(maxnum)];
8 Create a variable containing the search URL from the variables created
9 Use the urlread function to submit the search URL, retrieve the search
results, and return the results (as text in the MEDLINE report type) in
medlineText, a character array.
1-32
medlineText = urlread(searchURL);
10 Use the MATLAB regexp function and regular expressions to parse and
extract the information in
medlineText into hits, a cell array, where each
cell contains the MEDLINE-formatted text for one article. The first input
is the character array to search, the second input is a search expression,
which tells the
while the third input,
regexp function to find all records that start with PMID-,
'match',tellstheregexp function to return the
actual records, rather than the positions of the records.
When you are done, your file should look similar to the
getpubmed.m
file included with the Bioinformatics Toolbox software. The sample
getpubmed.m file, including help, is located at:
matlabroot\toolbox\bioinfo\biodemos\getpubmed.m
Note The notation matlabroot is the MATLAB root directory, which is
the directory where the MATLAB softw are is installed on your system.
1-33
1 Getting Started
Extracting Information from Large Multi-Entry Text Files
In this section...
“Overview” on page 1-34
“What Files Can You Access?” on page 1-34
“Before You B eg in” on page 1-35
“Creating a BioIndexedFile Object to Access Your Source File” on page 1-36
“Determining the Number of Entries in Your Source File” on page 1-37
“Retrieving Entries from Your Source File” on page 1-37
“Reading Entries from Your Source File” on page 1-38
Overview
Many biological experiments produce huge data files that are difficult to
access due to their size, which can cause memory issues when reading the file
into the MATLAB Workspace. You can construct a
to access the contents of a large text file (up to 4 GB) containing nonuniform
size entries, such as sequences, annotations, and cross-references to data sets.
The
BioIndexedFile object lets you quickly and efficiently access this data
without loading the source file into memory.
BioIndexedFile object
1-34
You can use the
subset of entries when the source file is too big to fit into memory. You can
access entries using indices or keys. You can read and parse one or more
entries using provided interpreters or a custom interpreter function.
Use the
• Access a subset of the entries for validation or further analysis.
• Parse entries using a custom interpreter function.
BioIndexedFile object in conjunction with your large source file to:
BioIndexedFile object to access individual entries or a
What Files Can You Access?
You can use the BioIndexedFile object to access large text files up to 4 GB
in size.
Extracting Information from Large Multi-Entry Text Files
Your source file can have these application-specific formats:
• FASTA
• FASTQ
• SAM
Your source file can also havethesegeneralformats:
• Table — Tab-delimited table with multiple columns. Keys can be in any
column. Rows with the same key are considered separate entries.
can be in any column. Contiguous rows with the same key are considered
a single entry. Noncontiguous rows with the same key are considered
separate entries.
• Flat — Flat file with concatenated entries separated by a character string,
typically
entrybyawhitespace.
//. Within an entry, the key is separated from the rest of the
Before You B egin
Before constructing a BioIndexedFile object, locate your source file on your
hard drive or a local network.
When you construct a
first time, you also create an auxiliary index file, wh ich by default is saved
to the same location a s your source file. However, if your source file is in a
read-only location, you can specify a different location to save the index file.
Tip If you construct a BioIndexedFile object from your source file on
subsequent occasions, it takes advantage of the existing index file, which
saves time. However, the index file must be in the same location or a location
specified by the subsequent construction syntax.
BioIndexedFile object from your source file for the
1-35
1 Getting Started
Tip If insufficient memory is not an issue when accessing your source file,
you m ay want to try a n appropriate read function, such as
importing data from GenBank files. For a complete list of read functions
in the Bioinformatics Toolbox, see “Data Formats and Databases” in the
Bioinformatics Toolbox Reference.
genbankread,for
Additionally, several read functio ns such as
and
sffread include a Blockread property, which lets you read a subset of
entries from a file, thus saving memory.
fastaread, fastqread, samread,
Creating a BioIndexedFile Object to Access Your
Source File
To construct a BioIndexedFile object from a multi-row table file:
1 Create a variable containing the full absolute path of your source file. For
your source file, use the
Bioinformatics Toolbox software.
sourcefile = which('yeastgenes.sgd');
2 Use the BioIndexedFile constructor function to construct a
BioIndexedFile object from the yeastgenes.sgd source file, which is a
multi-row table file. Save the index file in the Current Folder. Indicate that
thesourcefilekeysareincolumn3. Also, indicate that the header lines in
the source file are prefaced w ith
The BioIndexedFile constructor function constructs gene2goObj,a
BioIndexedFile object, and also creates an index file with the same name
as the source file, but with an IDX extension. It stores this index file in the
Current Folder because we specified this location. However, the default
location for the index file is the same location as the source file.
Extracting Information from Large Multi-Entry Text Files
CautionDo not modify the index file. If you modify it, you can get invalid
results. Also, the constructor function cannot use a modified index file to
construct future objects from the associated source file.
Determining the Number of Entries in Your Source File
To determine the number of entries in your source file, u se the NumEntries
property of the gene2goObj BioIndexedFile object.
gene2goObj.NumEntries
ans =
6476
Note For a list and description of all properties of a BioIndexedFile object,
see
BioIndexedFile class.
Retrieving Entries from Your Source File
Retrieve entries from your source file using either:
• Theindexoftheentry
• The entry key
Retrieving Entries Using Indices
Use the getEntryByIndex method to retrieve a subset of entries from your
source file that correspond to specified indices. For e xample, retrieve the first
12 entries from the
Use the getEntryByKey method to retrieve a subset of entries from your
source file that are associated with specified keys. For example, retrieve all
entries with keys of AAC1 and AAD10 from the
The output subset_entries is a single string of concatenated entries.
Because the keys in the
method returns all entries that have a key of AAC1 or AAD10.
yeastgenes.sgd source file are not unique, this
yeastgenes.sgd source file:
Reading Entries from Your Source File
The BioIndexedFile object includes a read method, which you can use to
read and parse a subset of entries from your source file. The
parses the entries using an interpreter function specified by the
property of the BioIndexedFile object.
Setting the Interpreter Property
Before using the read method,makesuretheInterpreter property of the
BioIndexedFile object is set appropriately.
read method
Interpreter
1-38
If you constructed a
BioIndexedFile object from
...
A source file with an
application-specific format
(FASTA,FASTQ,orSAM)
A source file with a table,
multi-row table, or flat format
The Interpreter property ...
By default is a handle to a function
appropriate for that file type and
typically d oe s not require you to change
it.
By default is [], which means the
interpreter is an anonymous function
in which the output is equivalent to the
input. You can change this to a handle
to a function that accepts a single string
of one or more concatenated entries
and returns a structure or an array of
structures containing the interpreted
data.
Extracting Information from Large Multi-Entry Text Files
TherearetwowaystosettheInterpreter property of the BioIndexedFile
object:
• When constructing the
BioIndexedFile object, use the Interpreter
property name/property value pair
• After constructing the
BioIndexedFile object, set the Interpreter
property
Note For more information on setting the Interpreter property of a
BioIndexedFile object, see BioIndexedFile class.
Reading a Subset of Entries
The read method reads and parses a subset of entries that you specify using
either entry indices or keys.
Example
To quickly find all the gene ontology (GO) terms associated with a particular
gene because the entry keys are gene names:
1 Set the Interpreter property of the gene2goObj BioIndexedFile object
to a handle to a function that reads entries and returns only the column
containing the GO term. In this case the interpreter is a handle to an
anonymous function that accepts strings and extracts strings that start
with the characters
Sequence analysis is the process you use to find in formation about a nucleotide
or amino acid sequence using com putational methods. Common tasks in
sequence analysis are identifying genes, determining the similarity of two
genes, determining the protein coded by a gene, and determining the function
of a gene by finding a similar gene in another organism with a known function.
• “Example: Sequence Statistics” on page 2-2
• “Example: Sequence Alignment” on pa ge 2-18
• “Sequence Tool” on page 2-38
2
• “Multiple Sequence Alignment Viewer” on page 2-52
• “Storing and Managing Short-Read Sequence D ata in Objects” on page 2-59
2 Sequence Analysis
Example: Sequence Statistics
In this section...
“Overview of Example” on page 2 -2
“Determining Nucleotide Content” on page 2-2
“Reading Sequence Information” on page 2-4
“Determining Nucleotide Composition” on page 2-6
“Determining Codon Composition” on page 2-9
“Open Reading Frames” on page 2-12
“Amino Acid Conversion and Composition” on page 2-15
Overview of Example
After sequencing a piece of DNA, one of the first tasks is to investigate the
nucleotide content in the sequence. Starting with a DNA sequence, this
example uses sequence statistics functions to determine mono-, di-, and
trinucleotide content, and to locate open reading frames.
2-2
Determining Nucleotide Content
The following procedure illustrates how to use the MATLAB Help browser
to search the Web for information. In this example you are interested in
studying the human mitochondrial genome. While many g enes that code for
mitochondrial proteins are found in the cell nucleus, the mitochondrial has
genes that code for proteins used to produce energy.
First research information about the human mitochondria and find the
nucleotide se quence for the genome. Next, look at the nucleotide content for
the entire sequence. And finally, determine open reading frames and extract
specific gene sequences.
1 Use the MATLAB Help browser to explore the Web. In the MATLAB
Command Window, type
web('http://www.ncbi.nlm.nih.gov/')
Example: Sequence Statistics
A separate browser window opens with the home page for the NC BI Web
site.
2 Search the NCBI Web site for information. For example, to search for the
human mitochondrion genome, from the Search list, select
the for box, enter
mitochondrion homo sapiens.
Genome,andin
The NCBI Web search returns a list of links to relevant pages.
3 Select a result page. For example, click the link labeled NC_001807.
The MATLAB Help browser displays the NCBI page for the human
mitochondrial genome.
2-3
2 Sequence Analysis
2-4
Readi
The fo
apubl
envi
acce
inte
envi
The
Gen
ng Sequence Information
llowing procedure illustrates how to find a nucleotide sequence in
ic database and read the sequence information into the MATLAB
ronment. Many public databases for nucleotide sequences are
ssible from the Web. The MATLAB Command Window provides an
grated enviro nment for bringing sequence informatio n into the MATLAB
ronment.
consensus sequence for the human mitochondrial genome has the
Bank accession number
NC_001807. Since the whole GenBank entry is
Example: Sequence Statistics
quite large and you might only be interested in the sequence, you can get
just the sequence information.
1 Get sequence information from a Web database. For example, to retrieve
sequence information for the human mitochondrial genome, in the
MATLAB Command W indow, type
2 If you don’t have a Web connection, you can load the data from a MAT file
included with the Bioinformatics Toolbox software, using the command
load mitochondria
The load function loads the sequence mitochondria into the MATLAB
Workspace.
3 Get information about the sequence. Type
whos mitochondria
Information about the size of the sequence displays in the MATLAB
Command Window.
NameSizeBytesClass
mitochondria1x1657133142char array
Grand total is 16571 elements using 33142 bytes
2-5
2 Sequence Analysis
Determining Nuc
The following pr
dimers, and then
sequence with a
parts of the seq
indicate poss
before a gene.
After you rea
the sequence
characteris
mitochondr
on page 2-4.
1 Plot monom
the MATLAB
ntdensity(mitochondria)
This grap
ocedure illustrates how to determine the monomers and
visualize data in graphs and bar plots. Sections of a DNA
high percent of A+T nucleotides usually indicate intergenic
uence, while low A+T and higher G+C nucleotide percentages
ible genes. Many times high CG dinucleotide content is located
d a sequence into the MATLAB environment, you can use
statistics functions to determine if your sequence has the
tics of a protein-coding region. This procedure uses the human
ial genome as an example. See “Reading Sequence Information”
er densities and combined monomer densities in a graph. In
Command Window, type
h shows that the genome is A+T rich.
leotide Composition
2-6
Example: Sequence Statistics
2 Count the nucleotides using the basecount function.
basecount(mitochondria)
A list of nucleotide counts is shown for the 5’-3’ strand.
ans =
A: 5113
C: 5192
G: 2180
T: 4086
3 Count the nucleotides in the reverse complement of a sequence using the
seqrcomplement function.
basecount(seqrcomplement(mitochondria))
2-7
2 Sequence Analysis
As expected, the nucleotide counts on the reverse complement strand are
complementary to the 5’-3’ strand.
ans =
A: 4086
C: 2180
G: 5192
T: 5113
4 Use the function basecount with the chart option to visualize the
nucleotide distribution.
basecount(mitochondria,'chart','pie');
A pie chart displays in the MATLAB Figure w indow.
2-8
5 Count the dimers in a sequence and display the information in a bar chart.
dimercount(mitochondria,'chart','bar')
Example: Sequence Statistics
Determining Codon Composition
The following procedure illustrates how to look at codons for the six reading
frames. Trinucleotides (codon) codeforanaminoacid,andthereare64
possible codons in a nucle otide sequence. Knowing the percent of codons
in your sequence can be helpful when you are comparing with tables for
expected codon usage.
After you read a sequence into the MATLAB environment, you can analyze the
sequence for codon composition. This procedure uses the human mitochondria
genome as an example. See “Reading Sequence Information” on page 2-4.
1 Count codons in a nucleotide sequence. In the MATLAB Command
Window, type
codoncount(mitochondria)
The codon counts for the first reading frame displays.
title(sprintf('Codons for reverse frame %d',frame));
end
2-10
Heat maps display all 64 codons in the 6 reading frames.
Example: Sequence Statistics
2-11
2 Sequence Analysis
2-12
Open Reading Frames
The following procedure illustrates how to locate the open reading frames
using a specific genetic code. Determining the protein-coding sequence for a
eukaryotic gene can be a difficult task because introns (noncoding sections)
are mixed with exons. However, prokaryotic genes generally do not have
introns and mRNA sequences have the introns removed. Identifying the
start and stop codons for translation determines the protein-coding section,
or open reading frame (ORF), in a sequence. Once you know the ORF for a
gene or mRNA, you can translate a nucleotide sequence to its corresponding
amino acid sequence.
After you read a sequence into the MATLAB environment, you can analyze
the sequence for open reading frames. This procedure uses the human
mitochondria genome as an example. Se e“ReadingSequenceInformation”
on page 2-4.
1 Display open reading frames (ORFs) in a nucleotide sequence. In the
MATLAB Command Window, type:
seqshoworfs(mitochondria);
Example: Sequence Statistics
If you compare this output to the genes shown on the NCBI page for
NC_001807, there are fewer genes than expected. This is because vertebrate
mitochondria use a genetic code slightly different from the standard genetic
code. For a list of genetic codes, see the “Genetic Code” table in the
aa2nt
reference p age in the Bioinformatics Toolbox Reference.
2 Display ORFs using the Vertebrate Mitochondrial code.
Notice that there are now two large OR Fs on the first reading frame. One
starts at position 4471 and the other starts at 5905. These correspond to
the genes ND2 (NADH dehydrogenase subunit 2 [Homo sapiens] ) and
COX1 (cytochrome c oxidase subunit I) genes.
3 Find the corresponding stop codon. ThestartandstoppositionsforORFs
getgenpept function retrieves the published conversion from the NCBI
The
tabase and reads it into the MATLAB Workspace.
da
2-15
2 Sequence Analysis
3 Count the amino acids in the protein sequence.
aacount(ND2AASeq, 'chart','bar')
A bar graph displays. Notice the high content for leucine, threonine and
isoleucine, and also notice the lack of cysteine and aspartic acid.
2-16
4 Determine the atomic composition and molecular weight of the prote in .
atomiccomp(ND2AASeq)
molweight (ND2AASeq)
The following disp la ys in the MAT LAB Workspace:
ans =
C: 1818
H: 3574
N: 420
O: 817
S: 25
ans =
Example: Sequence Statistics
3.8960e+004
If this sequence was unknown, you could use this information to identify
the protein by comparing it with the atomic composition of other proteins
in a database.
2-17
2 Sequence Analysis
Example: Sequence Alignment
In this section...
“Overview of Example” on page 2-18
“Finding a Model Organism to Study” on page 2-18
“Retrieving Sequence Information from a Public Database” on page 2-20
“Searching a Public Database for Related Genes” on page 2-23
“Locating Protein Coding Sequences” on page 2-25
“Comparing Amino Acid Sequences” on page 2-29
Overview of Example
Determining the similarity between two sequences is a common task in
computational biology. Starting with a nucleotide sequence for a human gene,
this exam pl e uses alignment algorithms to locate and verify a corresponding
gene in a model organism.
2-18
Finding a Model Or g anism to Study
The following procedure illustrates how to use the MATLAB Help browser
to search the Web for information. In this example, you are interested in
studying Tay-Sachs disease. Tay-Sachs is an autosomal recessive disease
caused by the absence of the enzyme beta-hexosaminidase A (Hex A). This
enzyme is responsible for the breakdown of gangliosides (GM2) in brain and
nerve cells.
First, research information about Tay-Sachs and the enzyme that is a ssociated
with this disease, then find the nucleotide sequence for the human gene
that codes for the enzyme, and finally find a corresponding gene in another
organism to use as a model for study.
1 Use the MATLAB Help browser to explore the Web. In the MATLAB
Command window, type
web('http://www.ncbi.nlm.nih.gov/')
Example: Sequence Alignment
The MATLAB Help browser opens with the home page for the NCBI Web
site.
2 Search the NCBI Web site for information. For example, to search for
Tay-Sachs, from the Search list, select
box, enter
Tay-Sachs.
NCBI Web Site,andinthefor
The NCBI Web search returns a list of links to relevant pages.
3 Select a result page. For example, click the link labeled Tay-Sachs
Disease.
A page in the genes and diseases section of the NCBI Web site opens. This
section provides a comprehensive introduction to medical genetics. In
particular, this page contains an introduction and pictorial representation
of the e nzyme Hex A and its role in the metabolism of the lipid GM2
ganglioside.
2-19
2 Sequence Analysis
2-20
4 After completing your research, you have co ncluded the following:
The gene HEXA codes for the alpha subunit of the dimer enzyme
hexosaminidase A (Hex A), while the gene HEXB codes for the beta subunit
of the enzyme. A third gene, GM2A, codes for the activator protein GM2.
However, it is a mutation in the gene HEXA that causes Tay-Sachs.
Retrieving Sequence Information from a Public
Database
The following procedure illustrates how to find the nucleotide sequence for a
human gene in a public database and read the sequence information into the
MATLAB environment. Many public databases for nucleotide sequences (for
example, GenBank, EMBL-EBI) are accessible from the Web. The MATLAB
Command Window with the MATLAB Help browser provide an integrated
environment for searching the Web and bringing sequence information into
the MATLAB environment.
After you locate a sequence, you need to move the sequence data into the
MATLAB Workspace.
Example: Sequence Alignment
1 Open the MATLAB Help browser to the NCBI Web site. In the MATLAB
Command Widow, type
web('http://www.ncbi.nlm.nih.gov/')
The MATLAB He lp browser w indow opens with the NCBI home page .
2 Search for the gene you are interested in studying. For example, from the
Search list, select
Nucleotide,andinthefor box enter Tay-Sachs.
The search returns entries for the genes that code the alpha and beta
subunits of the enzyme hexosaminidase A (Hex A), and the gene that codes
the activator enzyme. The NCBI reference for the human gene HEXA has
accession number
NM_000520.
2-21
2 Sequence Analysis
2-22
3 Get sequ
sequenc
ence data into the MATLAB environment. For example, to get
e information for the human gene HEXA, type
humanHEXA = getgenbank('NM_000520')
Note B
chara
The h
lank spaces in GenBank accession numbers use the underline
cter. Entering
'NM 00520' returns the wrong entry.
uman gene is loaded into the MATLAB Workspace as a structure.
humanHEXA =
LocusName: 'NM_000520'
LocusSequenceLength: '2255'
LocusNumberofStrands: ''
LocusTopology: 'linear'
LocusMoleculeType: 'mRNA'
LocusGenBankDivision: 'PRI'
LocusModificationDate: '13-AUG-2006'
Definition: 'Homo sapiens hexosaminidase A (alpha polypeptide) (HEXA), mRNA.'
Accession: 'NM_000520'
Version: 'NM_000520.2'
GI: '13128865'
Project: []
Keywords: []
Segment: []
Source: 'Homo sapiens (human)'
SourceOrganism: [4x65 char]
Reference: {1x58 cell}
Comment: [15x67 char]
Features: [74x74 char]
CDS: [1x1 struct]
Sequence: [1x2255 char]
SearchURL: [1x108 char]
RetrieveURL: [1x97 char]
Example: Sequence Alignment
Searching a Public Database for Related Genes
The following procedure illustrates how to find the nucleotide sequence for
a mouse gene related to a human gene, and read the sequence information
into the MATLAB environment. The sequence and fun ction of many gen es
is conserved during the evolution of species through homologous genes.
Homologous genes are genes that have a common ancestor and similar
sequences. One goal of searching a pu blic database is to find similar genes.
If you are able to locate a sequence in a database that is similar to your
unknown gene or protein, it is likely that the function and characteristics of
the known and unknown genes are the same.
After finding the nucleotide sequence for a human gene, you can do a BLAST
search or search in the genome of another organism for the corresponding
gene. This procedure uses the mouse genome as an example.
1 Open the MATLAB Help browser to the NCBI Web site. In the MATLAB
Command window, type
web('http://www.ncbi.nlm.nih.gov')
2-23
2 Sequence Analysis
2 Search the nucleotide database for the gene or protein you are interested in
studying. For example, from the Search list, select
for box e nter
hexosaminidase A.
Nucleotide,andinthe
The search returns entries for the mouse and human genomes. The NCBI
reference for the mouse gene HEXA has accession number
AK080777.
2-24
3 Get sequence information for the mouse gene into the MATLAB
environment. Type
mouseHEXA = getgenbank('AK080777')
Example: Sequence Alignment
The mouse gene sequence is loaded into the MATLAB Workspace as a
structure.
mouseHEXA =
LocusName: 'AK080777'
LocusSequenceLength: '1839'
LocusNumberofStrands: ''
LocusTopology: 'linear'
LocusMoleculeType: 'mRNA'
LocusGenBankDivision: 'HTC'
LocusModificationDate: '02-SEP-2005'
Definition: [1x150 char]
Accession: 'AK080777'
Version: 'AK080777.1'
GI: '26348756'
Project: []
Keywords: 'HTC; CAP trapper.'
Segment: []
Source: 'Mus musculus (house mouse)'
SourceOrganism: [4x65 char]
Reference: {1x8 cell}
Comment: [8x66 char]
Features: [33x74 char]
CDS: [1x1 struct]
Sequence: [1x1839 char]
SearchURL: [1x107 char]
RetrieveURL: [1x97 char]
Locating Protein Coding Sequences
The following procedure illustrates how to convert a sequence from nucleotides
to amino acids and identify the open reading frames. A nucleotide sequence
includes regulatory sequences before and after the protein coding section. By
analyzing this sequence, you can determine the nucleotides that code for
the amino acids in the final protein.
After you have a list of genes you are interested in studying, you can
determine the protein coding sequences. This procedure uses the human gene
HEXA and mouse gene HEXA as an example.
2-25
2 Sequence Analysis
1 If you did not retrieve gene data from the Web, you can load example data
from a MAT-file included with the Bioinformatics Toolbox software. In the
MATLAB Command window, type
load hexosaminidase
The structures humanHEXA and mouseHEXA load into the MATLAB
Workspace.
2 Locate open reading frames (ORFs) in the human gene. For example, for
the human gene HEXA, type
humanORFs = seqshoworfs(humanHEXA.Sequence)
seqshoworfs
creates the output structure humanORFs. This structure
contains the position of the start and stop codons for all open reading
frames (ORFs) on each reading frame.
humanORFs =
1x3 struct array with fields:
Start
Stop
The Help browser opens displaying the three reading frames with the
ORFs colored blue, red, and green. Notice that the longest ORF is in the
first reading frame.
2-26
Example: Sequence Alignment
3 Locate open reading frames (ORFs) in the mouse gene. Type:
2-27
2 Sequence Analysis
mouseORFs = seqshoworfs(mouseHEXA.Sequence)
seqshoworfs
mouseORFs =
1x3 struct array with fields:
creates the structure mouseORFS.
Start
Stop
The m o use gene shows the longest ORF on the first reading frame.
2-28
Example: Sequence Alignment
Comparing Amino
The following pr
functions to com
functions to lo
alignment func
are using amin
After you have
you can conve
their corres
for similari
displays the global alignment of the two sequences in
the Help browser. Notice that the calculated identity between the two
sequences is
60%.
Example: Sequence Alignment
2-31
2 Sequence Analysis
The alignment is very good between amino acid position 69 and 599, after
which the two sequences appear to be unrelated. Notice that there is a
stop (
*) in the sequence at this point. If you shorten the sequences to
include only the amino acids that are in the protein you might g et a better
alignment. Include the amino a cid positions from the first methionine (
the first stop (
4 Trim the sequence from the first start amino acid (usually M)tothefirst
stop (
*) and then try alignment again. Find the indices for the stops in
*) that occurs after the first methionine.
M)to
the sequences.
humanStops = find(humanProtein == '*')
humanStops =
41599611713722730
mouseStops = find(mouseProtein == '*')
2-32
mouseStops =
539557574606
Looking at the amino acid sequence for humanProtein,thefirstM is at
position 70, and the first stop after that position is actually the second
stop in the sequence (position 599). Looking at the amino acid sequence
for
mouseProtein,thefirstM is at position 11, and the first stop after that
position is the first stop in the sequence (position 557).
5 Truncate the sequences to include only amino acids in the protein and
The result from first truncating a nucleotide sequence before converting it
to an amino acid sequence is the same as the result from truncating the
amino acid sequence after conversion. See the result in step 6.
An alternative method to working with subsequences is to use a local
alignment function with the nontruncated sequences.
2-35
2 Sequence Analysis
8 Locally align the two amino acid seque nces using a Smith-Waterman
“Viewing Nucleotide Sequence Information” on page 2-40
“Searching for Words” on page 2-42
“Exploring Open Reading Frames” on page 2-44
“Viewing Amino Acid Sequence Statistics” on page 2-47
“Closing the Sequence Tool” on page 2-51
“References” on page 2-51
Overview of the Sequence Tool
The Sequence Tool window integrates many of the sequence functions in the
toolbox. Instead of entering commands in the MATLAB Command Window,
you can se le ct and enter options.
2-38
Importing a Sequence
The first step when analyz in g a nucleotide or amino acid sequence is to import
sequence information into the MATLAB environment. The Sequence Tool can
connect to Web databases such as NCBI and EMBL and read information into
the MATLAB environment.
The following procedure illustrates how to retrieve sequence information from
the NCBI database on the Web. This example uses the GenBank accession
number NM_000520, which is the human gene HEXA that is associated with
Tay-Sachs disease.
1 In the MATLAB Command Window, type
seqtool
The Sequence Tool window opens without a sequence loaded. Notice that
the panes to the right and bottom are blank.
Sequence Tool
2 To retrieve a sequence from the NCBI database, select File > Download
Sequence from > NCBI.
The Download Sequence from NCBI dialog box opens.
3 In the Enter Sequence box, type an accession number for an NCBI
database e ntry, for example, NM_000520.ClicktheNucleotide option
button, and then click OK.
The MATLAB software accesses the NCBI database on the Web, loads
nucleotide sequence information for the accession number you entered,
and calculates some basic statistics.
2-39
2 Sequence Analysis
2-40
Viewin
After y
infor
for OR
1 In the
info
2 Now click Features. The right pane displays NCBI feature information,
including index numbers for a gene and any CDS sequences.
3 Click ORF to show the search results for ORFs in the six reading frames.
g Nucleotide Sequence Information
ou import a sequence into the Sequence Tool window, you can read
mation stored with the sequence, or y ou can view graphic representations
Fs and CDSs.
left pane tree, click Comments. The right pane displays general
rmation about the sequence.
Sequence Tool
4 Click Annotated CDS to show the protein coding part of a nucleotide
sequence.
2-41
2 Sequence Analysis
2-42
Search
The fol
seque
and pa
1 Selec
2 In the Find Word dialog box, type a sequence word or pattern, for example,
atg,andthenclickFind.
ing for Words
lowing procedure illustrates how to search for characteristic words and
nce patterns. You will search for sequence patterns like the TATAA box
tterns for specific restriction enzymes.
t Sequence > Find Word.
Sequence Tool
The Sequence T
selected wor
d.
ool window s earches and displays the location of the
3 Clear the display by clicking the Clear Word Selection buttonon
the toolbar.
2-43
2 Sequence Analysis
Exploring Open R
The following pr
nucleotide sequ
of a nucleotide
coding part of a
amino a cid seq
1 In the left pan
TheSequenceToolwindowdisplaystheORFsforthesixreadingframesin
the lower-rig ht pane. Hover the cursor over a frame to display information
about it.
2 Click the longest ORF on reading frame 2.
The ORF is highlighted to indicate the part of the sequence that is selected.
ocedure illustrates how to identify the protein coding part of a
ence and copy it into a new view. Identifying coding sections
sequence is a common bioinformatics task. After locating the
sequence,youcancopyittoanewview,translateittoan
uence, and continue with your analysis.
e, click ORF.
eading Frames
2-44
3 Right-click the selected ORF and then select Export to Workspace.In
theExporttoMATLABWorkspacedialogbox,typeavariablename,for
example, NM_000520_ORF_2,thenclickExport.
The NM_000520_ORF_2 variable is added to the MATLAB Workspace.
4 Select File > Import from Workspace. Type the name of a variable
with an exported ORF, for example, NM_000520_ORF_2,andthenclick
Import.
The Sequence Tool window adds a tab at the bottom for the new sequence
while leaving the original sequence open.
Sequence Tool
2-45
2 Sequence Analysis
2-46
5 In the left pane, click Full Translation. Select Display > Amino Acid
Residue Display > One Letter Code.
The Sequence Tool window displays the amino acid sequence below the
nucleotide sequence.
Sequence Tool
Viewin
The fol
an ORF
acid s
This e
alpha
1 Sele
The Download Sequence from NCBI dialog box opens.
g Amino Acid Sequence Statistics
lowing procedure illustrates how to view an amino acid sequence for
located in a nucleotide sequence. You can import your own amino
equence, or you can get a protein sequence from the GenBank database.
xample uses the GenBank accession number NP_000511.1, which is the
subunit for a human enzyme associated with Tay-Sachs disease.
ct File > Download Sequence from > NCBI.
2-47
2 Sequence Analysis
2 In the Enter Sequence box, type an accession number for an NCBI
database entry, for example, NP_000511.1.ClicktheProtein option
button, and then click OK.
The MATLAB software accesses the NCBI database on the Web and loads
amino acid sequence information for the accession number you entered.
Function, Hydrophobicity, Structure,orTaylor. Fo r example, select
Function.
The display colors change to highlight charge information about the amino
acid residues. The following table shows color legends for the amino acid
color schemes.
2-49
2 Sequence Analysis
2-50
Amino Acid Color Schem eColor Leg end
Charge• Acidic — Red
• Basic — Light Blue
• Neutral — Black
tion
Func
• Acidic — Red
• Basic — Light Blue
• Hydropobic, nonpolar — Black
• Polar, uncharged — Green
Loading...
+ hidden pages
You need points to download manuals.
1 point = 1 manual.
You can buy points or you can get point for every manual you upload.