API

vcf.Reader

class vcf.Reader(fsock=None, filename=None, compressed=False, prepend_chr=False)[source]

Reader for a VCF v 4.0 file, an iterator returning _Record objects

alts = None

ALT fields from header

fetch(chrom, start, end=None)[source]

fetch records from a Tabix indexed VCF, requires pysam if start and end are specified, return iterator over positions if end not specified, return individual _Call at start or None

filters = None

FILTER fields from header

formats = None

FORMAT fields from header

infos = None

INFO fields from header

metadata = None

metadata fields from header (string or hash, depending)

next()[source]

Return the next record in the file.

vcf.Writer

class vcf.Writer(stream, template, lineterminator='rn')[source]

VCF Writer

close()[source]

Close the writer

flush()[source]

Flush the writer

write_record(record)[source]

write a record to the file

vcf.model._Record

class vcf.model._Record(CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT, sample_indexes, samples=None)[source]

A set of calls at a site. Equivalent to a row in a VCF file.

The standard VCF fields CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO and FORMAT are available as properties.

The list of genotype calls is in the samples property.

aaf

The allele frequency of the alternate allele. NOTE 1: Punt if more than one alternate allele. NOTE 2: Denominator calc’ed from _called_ genotypes.

alleles = None

list of alleles. [0] = REF, [1:] = ALTS

call_rate

The fraction of genotypes that were actually called.

end = None

1-based end coordinate

genotype(name)[source]

Lookup a _Call for the sample given in name

get_hets()[source]

The list of het genotypes

get_hom_alts()[source]

The list of hom alt genotypes

get_hom_refs()[source]

The list of hom ref genotypes

get_unknowns()[source]

The list of unknown genotypes

is_deletion

Return whether or not the INDEL is a deletion

is_indel

Return whether or not the variant is an INDEL

is_monomorphic

Return True for reference calls

is_snp

Return whether or not the variant is a SNP

is_sv

Return whether or not the variant is a structural variant

is_sv_precise

Return whether the SV cordinates are mapped to 1 b.p. resolution.

is_transition

Return whether or not the SNP is a transition

nucl_diversity

pi_hat (estimation of nucleotide diversity) for the site. This metric can be summed across multiple sites to compute regional nucleotide diversity estimates. For example, pi_hat for all variants in a given gene.

Derived from: “Population Genetics: A Concise Guide, 2nd ed., p.45”

John Gillespie.
num_called

The number of called samples

num_het

The number of heterozygous genotypes

num_hom_alt

The number of homozygous for alt allele genotypes

num_hom_ref

The number of homozygous for ref allele genotypes

num_unknown

The number of unknown genotypes

samples = None

list of _Calls for each sample ordered as in source VCF

start = None

0-based start coordinate

sv_end

Return the end position for the SV

var_subtype

Return the subtype of variant. - For SNPs and INDELs, yeild one of: [ts, tv, ins, del] - For SVs yield either “complex” or the SV type defined

in the ALT fields (removing the brackets). E.g.:

<DEL> -> DEL <INS:ME:L1> -> INS:ME:L1 <DUP> -> DUP

The logic is meant to follow the rules outlined in the following paragraph at:

http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41

“For precisely known variants, the REF and ALT fields should contain the full sequences for the alleles, following the usual VCF conventions. For imprecise variants, the REF field may contain a single base and the ALT fields should contain symbolic alleles (e.g. <ID>), described in more detail below. Imprecise variants should also be marked by the presence of an IMPRECISE flag in the INFO field.”

var_type

Return the type of variant [snp, indel, unknown] TO DO: support SVs

vcf.model._Call

class vcf.model._Call(site, sample, data)[source]

A genotype call, a cell entry in a VCF file

called

True if the GT is not ./.

data

Dictionary of data from the VCF file

gt_alleles

The numbers of the alleles called at a given sample

gt_bases

The actual genotype alleles. E.g. if VCF genotype is 0/1, return A/G

gt_type

The type of genotype. hom_ref = 0 het = 1 hom_alt = 2 (we don;t track _which+ ALT) uncalled = None

is_het

Return True for heterozygous calls

is_variant

Return True if not a reference call

phased

A boolean indicating whether or not the genotype is phased for this sample

sample

The sample name

site

The _Record for this _Call

vcf.model._AltRecord

class vcf.model._AltRecord(type, **kwargs)[source]

An alternative allele record: either replacement string, SV placeholder, or breakend

type = None

String to describe the type of variant, by default “SNV” or “MNV”, but can be extended to any of the types described in the ALT lines of the header (e.g. “DUP”, “DEL”, “INS”...)

vcf.model._Substitution

class vcf.model._Substitution(nucleotides, **kwargs)[source]

A basic ALT record, where a REF sequence is replaced by an ALT sequence

sequence = None

Alternate sequence

vcf.model._SV

class vcf.model._SV(type, **kwargs)[source]

An SV placeholder

vcf.model._SingleBreakend

class vcf.model._SingleBreakend(orientation, connectingSequence, **kwargs)[source]

A single breakend

vcf.model._Breakend

class vcf.parser._Breakend(chr, pos, orientation, remoteOrientation, connectingSequence, withinMainAssembly, **kwargs)[source]

A breakend which is paired to a remote location on or off the genome

chr = None

The chromosome of breakend’s mate.

connectingSequence = None

The breakpoint’s connecting sequence.

orientation = None

The orientation of breakend. If the sequence 3’ of the breakend is connected, True, else if the sequence 5’ of the breakend is connected, False.

pos = None

The coordinate of breakend’s mate.

remoteOrientation = None

The orientation of breakend’s mate. If the sequence 3’ of the breakend’s mate is connected, True, else if the sequence 5’ of the breakend’s mate is connected, False.

withinMainAssembly = None

If the breakend mate is within the assembly, True, else False if the breakend mate is on a contig in an ancillary assembly file.