1.  Data used by PathwayLinker

PathwayLinker stores all data as directory names. For the data structure see the documentation. The following zip file contains the names of a large number of nested directories:

  » (136 MB zip file)

Top ↑

2.  Perl program used for compiling the data sets of PathwayLinker

To run this Perl script we used the shell command shown below, which contains data file names and parameters. Below you can also download the versions of the external files used for compiling PathwayLinker's database and we link to the current versions of these files too.

   » View/download the Perl script used for compiling the data set used by PathwayLinker

Top ↑

2.1  How to run the Perl script

Below is a shell command for Linux / Unix / Mac OS X computers. Some options for Windows computers:

  • installing Linux with VMware Workstation, VirtualBox, or CygWin
  • logging into a Linux computer with putty or Xmanager
# ========= shell command ========

./ \
./data/biogrid/BIOGRID-ALL-3.0.64.mitab.txt.gz \
"6239 7227 9606" \
./data/biogrid/ \
./data/ccsbgenetic/genetic.txt.gz \
./data/ccsbwi8/wi8.txt.gz \
./data/droidcuragen/DroID_curagen_yth.txt.gz \
./data/droidfinley/DroID_finley_yth.txt.gz \
./data/droidgenetic/DroID_genetic.txt.gz \
./data/droidhybrigenics/DroID_hybrigenics_yth.txt.gz \
./data/droidotherphysical/DroID_otherphysical.txt.gz \
./data/hprd/HPRD_Release9_041310/BINARY_PROTEIN_PROTEIN_INTERACTIONS.txt.gz \
./data/string/protein.links.detailed.v8.3.txt.gz \
400 \
"./data/kegg/*" \
./data/kegg_sign-pathways.csv \
./data/reactome/uniprot_2_pathways.stid.txt.gz \
./data/reactome/reactome--signalingPathways--shortNames.txt \
"./data/signalink/*-FULL-node.csv.gz" \
./data/signalink/SignaLink--pathwayInfo.txt \
"./data/uniprot/uniprot_*--ac.txt.gz" \
"./data/uniprot/uniprot_*_invertebrates--ac--cg--orf.txt.gz" \
"./data/uniprot/uniprot_*--ac_id_de_gn.txt.gz" \
./data/flybase_reference_report_pages/ \
./data/PathwayLinkerDb \
./data/unconvertedProteinIds.txt \
./data/acList.txt \
>o 2>e&

Top ↑

2.2  External data files used by the Perl script

Version of the external data file used for PathwayLinker Current version of the external data file
BIOGRID-ALL-3.0.64.mitab.txt.gz (12 MB) (73 MB)
Download from BioGrid
CCSB genetic: genetic.txt.gz (7 kB)
CCSB WI8: wi8.txt.gz (22 kB)
Download from CCSB
DroID_curagen_yth.txt.gz (446 kB)
DroID_finley_yth.txt.gz (60 kB)
DroID_genetic.txt.gz (151 kB)
DroID_hybrigenics_yth.txt.gz (27 kB)
DroID_otherphysical.txt.gz (46 kB)
Download from DroID
Download from HPRD
STRING: protein.links.detailed.v8.3.txt.gz (1.6 GB)Download from STRING
KEGG: (9 kB) (11 kB) (27 kB)
Download from KEGG:
KEGG pathways and pathway names, signaling pathways marked:
kegg_sign-pathways.csv (13 kB)
(same file)
Reactome: uniprot_2_pathways.stid.txt.gz (49 kB)Download from Reactome
Reactome: Lists of signaling pathway member proteins
(same file)
SignaLink: Lists of signaling pathway member proteins (nodes)
SignaLink-cel-nodes.csv (10 kB)
SignaLink-dme-nodes.csv (7 kB)
SignaLink-hsa-nodes.csv (12 kB)
Download from SignaLink
SignaLink: Pathway ID to name mapping: SignaLink--pathwayInfo.txt (same file)
UniProtKB plain text data files by taxonomic division (on July 21, 2009)Download from UniProt

Top ↑

2.3  Web services used by the Perl script

Top ↑

 !!Perl script quantifying how significantly two interacting proteins (on average) function in the same signaling pathway(s)

 !!!Computation of the functional similarity Z-score for one report

For a given report consider all selected sources of interactions (e.g., BioGrid and HPRD) and all selected sources of signaling pathways (e.g., KEGG and SignaLink) that you selected. For each source of signaling pathways the following analysis is performed.

  1. For each protein of the selected organism list all signaling pathways to which that protein belongs.
  2. Define the similarity, s, between the signaling pathway memberships of two proteins as the Jaccard correlation of their pathway lists.
  3. Consider all protein pairs that interact through at least one of the selected interaction types. Reduce this list to such protein pairs where at least one of the two proteins is a member of at least one signaling pathway according to the given source(s) (database(s)) of signaling pathways. The number of such protein pairs is N. Take the average of s over these N pairs: a1.
  4. Take the average, a2, of s for N pairs of proteins in which each protein is selected randomly from all proteins of the given organism such that in each pair at least one of the two selected proteins is a member of at least one signaling pathway.
  5. Repeat this step 100 times, compute a2 in each case and denote the average and standard deviation of a2 values by A and σ, respectively.
  6. The Z-score    Z = (a1 - A) / σ    quantifies how significantly a signaling pathway member protein and one other protein interacting via at least one of the selected interaction types function in the same signaling pathway(s) according to the given signaling pathway source(s) (database(s)).

 !!!Perl script

To run the Perl script quantifying the functional similarity of proteins we used as input the data directories above and the shell command shown below, which contains data file names and parameters.

   » Download the Perl script used for quantifying the functional similarity of interacting proteins

We used this shell command to run the Perl script ('1' in line 6 indicates only UniProtKB/Swiss-Prot are used, '0' in the same line means: UniProtKB/TrEMBL proteins are also allowed):

./ \
<directory_containing_PathwayLinker_data_(see_above)> \
"./data/uniprot/uniprot_*ac.txt.gz" \
1001 \
100 \
1 \
./data/stats/<names-of-z-score-output-files-start-with-this-string--> \
 !!!Output data file (Z scores)

This data file contains TAB-separated fields. Right-click on the link below, select 'save as', and then save. After saving, right-click on it and open with MS Excel, OpenOffice, etc. Only reviewed (SwissProt) proteins were used for this analysis.

  » Download data file (TAB-delimited text, 40kB)

Top ↑

3.  Autocomplete function on the search page

On the search page of PathwayLinker (in the input filed named "Proteins") an autocomplete functionality makes it easier for the user to enter gene/protein names and identifiers.

3.1  Input data for the autocomplete tree

The database of suggestions is extracted from two UniProt data files: uniprot_sprot_human.dat.gz (55,5 MB) and uniprot_sprot_invertebrates.dat.gz (22,3 MB). Note that while PathwayLinker allows both Swiss-Prot and TrEMBL proteins, the autocomplete tree contains only data from Swiss-Prot.

3.2  Download the autocomplete tree and use it for a search box on your own webpage

    » Download the database of suggestions (for C. elegans, D. melanogaster, and H. sapiens).

For the input box autocomplete and its popup, JQuery UI was used.

  • suggest.html provides a minimal example, containing a single autocomplete input box
  • is the Perl script to query for the prefixes. It is called using AJAX by suggest.html.

The Perl script ( uses the Tree::Prefix package:

3.3  How to build the autocomplete database on your own

To build the database of words (completions) manually, run the shell script, which is the main program running the following Perl scripts:

Download all the scripts here: autocomplete_build.tar.gz.

This package was written to generate a database structure and query interface for automatic completions. First, all possible completion strings are entered, then they are handled by this package. The main steps in generating this data set are the following:

  1. enter strings
  2. generate prefix tree as a Perl hash
  3. remove useless branches (containing only a few elements) from the tree
  4. save hash into a directory structure (e.g. when pla is entered, navigate to the directory p/l/a)
  5. use directory structure for the query

Top ↑