Introduction

pyanno is a python package that - given a list of genomic intervals - annotates these intervals to arbitrary genomic intervals of interest, as well as genes. The list of genomic intervals to be annotated, as well as the database interval regions of interest have to be encoded in a bed-like format.

Installation

The python package geanno is part of the Python Package Index (PyPI) and can be installed via pip

pip install geanno

Alternatively, you can get the sources via the projects github page:

https://github.com/HiDiHlabs/geanno

Example

Base file

The base file of intervals, that will be annotated against a database of bed like files, that contain genomic intervals of interest has to be itself bed-like. It suffuces that the first three lines follow bed conventions, i.e. * 1st column: Chromosome * 2nd column: Start position (0-based) * 3rd column: End position (1-based)

The base interval entries can in addition contain an arbitrary number of additional columns.

cat base.bed
#chrom      start   end
4       128887787       128887839
4       188862197       188862251
4       185746125       185746231

Database file

The datbase file is a tab separated file, containing information about the genomic regions of interest, that shall be annotated to the base file. The database file contains the following information:

  • FILENAME: Absolute path to the file

  • REGION.TYPE: E.g. protein.coding.genes, Enhancers, …

  • SOURCE: E.g., Cell type from which regions are derived

  • ANNOTATION.BY: SOURCE | NAME

  • MAX.DISTANCE: Maximal distance between base and database intervall, such that database intervall is anotated to base intervall.

  • DISTANCE.TO: The location to which the distance shall be computed. Can be START | END | MID | REGION

  • N.HITS: Integer value defining the number elements from the database that shall be annotated to the base

  • NAME.COL: If ANNOTATION.BY == NAME, then you can define the column (0-based) in which the name is stored. If NAME.COL == NA, then it is assumed, that the 4th column contains the name.

The first line of the tsv file must contain the above bold column identifiers!

cat database.tsv
FILENAME        REGION.TYPE     SOURCE  ANNOTATION.BY   MAX.DISTANCE    DISTANCE.TO     N.HITS  NAME.COL
E045_15_coreMarks_dense_7_Enh.bed    Enhancer.Roadmap        E045    SOURCE  0       REGION  1 NA
E036_15_coreMarks_dense_7_Enh.bed    Enhancer.Roadmap        E036    SOURCE  0       REGION  1 NA
protein_coding_genes.bed gencode19.protein.coding.TSS    gencode.v19     NAME    200000  START   1 6
enhancer_promoter_links_neutrophils.bed      PCHiC.neutrophils       neutrophils     NAME    10000   REGION  1     7

database bed like files

head -n 3 E045_15_coreMarks_dense_7_Enh.bed
10      179800  180000  7_Enh   0       .       179800  180000  255,255,0
10      182800  183200  7_Enh   0       .       182800  183200  255,255,0
10      265600  266000  7_Enh   0       .       265600  266000  255,255,0
head -n 3 E036_15_coreMarks_dense_7_Enh.bed
10      132200  132400  7_Enh   0       .       132200  132400  255,255,0
10      133400  133600  7_Enh   0       .       133400  133600  255,255,0
10      152600  153000  7_Enh   0       .       152600  153000  255,255,0
head -n 3 protein_coding_genes.bed
#chrom  start   end     ensembl.id      score   strand  hugo.name
1       69091   70008   ENSG00000186092.4       NA      +       OR4F5
1       134901  139379  ENSG00000237683.5       NA      -       AL627309.1
1       367640  368634  ENSG00000235249.1       NA      +       OR4F29
head -n 3 enhancer_promoter_links_neutrophils.bed
#oeChr  oeStart oeEnd   oeName  baitChr baitStart       baitEnd baitName
1       1150970 1156235 .       1       850619  874081  AL645608.1;RP11-54O7.3;SAMD11
1       1000704 1005126 .       1       903641  927394  C1orf170;PLEKHN1
1       1150970 1156235 .       1       903641  927394  C1orf170;PLEKHN1

Code example

import geanno
import pandas as pnd

database_filename = "database.tsv"
base_filename = "base.bed"
results_filename = "results.bed"

# Create a new GenomicRegionAnnotator instance
gra = geanno.Annotator.GenomicRegionAnnotator()

# load base
gra.load_base_from_file(base_filename)

# load database
gra.load_database_from_file(database_filename)

# Annotate base against all database genomic region files
gra.annotate()

# Retrieve annotated base intervals as pandas.DataFrame instance
annotated_base_df = gra.get_base()

# Write annotated base intervals to disk
annotated_base_df.to_csv(results_filename, sep="\t", index=False)

Annotated results file

cat results.bed
#chrom  start   end     Enhancer.Roadmap        gencode19.protein.coding.TSS    PCHiC.neutrophils
4       128887787       128887839       E036(0) MFSD8(-637)     PGRMC2(0);PGRMC2(1001);PGRMC2(2881);PGRMC2(7442)
4       188862197       188862251       NA      ZFP42(54674)    NA
4       185746125       185746231       NA      ACSL1(-1740)    ACSL1(3864)