Filtering a VCF file based on some properties of interest is a common enough operation that PyVCF offers an extensible script. vcf_filter.py does the work of reading input, updating the metadata and filtering the records.
Filters sites with only low quality variants.
It is possible to have a high site quality with many low quality calls. This filter demands at least one call be above a threshold quality.
Filter sites that look like correlated sequencing errors.
Some sequencing technologies, notably pyrosequencing, produce mutation hotspots where there is a constant level of noise, producing some reference and some heterozygote calls.
This filter computes a Bayes Factor for each site by comparing the binomial likelihood of the observed allelic depths under:
The test value is the log of the bayes factor. Higher values are more likely to be errors.
Note: this filter requires rpy2
You can reuse this work by providing a filter class, rather than writing your own filter. For example, lets say I want to filter each site based on the quality of the site. I can create a class like this:
import vcf.filters
class SiteQuality(vcf.filters.Base):
'Filter sites by quality'
name = 'sq'
@classmethod
def customize_parser(self, parser):
parser.add_argument('--site-quality', type=int, default=30,
help='Filter sites below this quality')
def __init__(self, args):
self.threshold = args.site_quality
def __call__(self, record):
if record.QUAL < self.threshold:
return record.QUAL
This class subclasses vcf.filters.Base which provides the interface for VCF filters. The docstring and name are metadata about the parser. The docstring provides the help for the script, and the first line is included in the FILTER metadata when applied to a file.
The customize_parser method allows you to add arguments to the script. We use the __init__ method to grab the argument of interest from the parser. Finally, the __call__ method processes each record and returns a value if the filter failed. The base class uses the name and threshold to create the filter ID in the VCF file.
To make vcf_filter.py aware of the filter, you can either use the local script option or declare an entry point. To use a local script, simply call vcf_filter:
$ vcf_filter.py --local-script my_filters.py ...
To use an entry point, you need to declare a vcf.filters entry point in your setup:
setup(
...
entry_points = {
'vcf.filters': [
'site_quality = module.path:SiteQuality',
]
}
)
Either way, when you call vcf_filter.py, you should see your filter in the list of available filters:
usage: vcf_filter.py [-h] [--no-short-circuit] [--no-filtered]
[--output OUTPUT] [--local-script LOCAL_SCRIPT]
input filter [filter_args] [filter [filter_args]] ...
Filter a VCF file
positional arguments:
input File to process (use - for STDIN) (default: None)
optional arguments:
-h, --help Show this help message and exit. (default: False)
--no-short-circuit Do not stop filter processing on a site if any filter
is triggered (default: False)
--output OUTPUT Filename to output [STDOUT] (default: <open file
'<stdout>', mode 'w' at 0x1002841e0>)
--no-filtered Output only sites passing the filters (default: False)
--local-script LOCAL_SCRIPT
Python file in current working directory with the
filter classes (default: None)
sq:
Filter sites by quality
--site-quality SITE_QUALITY
Filter sites below this quality (default: 30)
Utilities for VCF files.
Trim a list of sequences by removing the longest common suffix while leaving all of them at least one character in length.
Standard convention with VCF is to place an indel at the left-most position, but some tools add additional context to the right of the sequences (e.g. samtools). These common suffixes are undesirable when comparing variants, for example in variant databases.
>>> trim_common_suffix('TATATATA', 'TATATA')
['TAT', 'T']
>>> trim_common_suffix('ACCCCC', 'ACCCCCCCC', 'ACCCCCCC', 'ACCCCCCCCC')
['A', 'ACCC', 'ACC', 'ACCCC']
This script converts a VCF file from wide format (many calls per row) to a long format (one call per row). This is useful if you want to grep per sample or for really quick import into, say, a spreadsheet:
$ vcf_melt < vcf/test/gatk.vcf
SAMPLE AD DP GQ GT PL FILTER CHROM POS REF ALT ID info.AC info.AF info.AN info.BaseQRankSum info.DB info.DP info.DS info.Dels info.FS info.HRun info.HaplotypeScore info.InbreedingCoeff info.MQ info.MQ0 info.MQRankSum info.QD info.ReadPosRankSum
BLANK 6,0 6 18.04 0/0 0,18,211 . chr22 42522392 G [A] rs28371738 2 0.143 14 0.375 True 1506 True 0.0 0.0 0 123.5516 253.92 0 0.685 5.9 0.59
NA12878 138,107 250 99.0 0/1 1961,0,3049 . chr22 42522392 G [A] rs28371738 2 0.143 14 0.375 True 1506 True 0.0 0.0 0 123.5516 253.92 0 0.685 5.9 0.59
NA12891 169,77 250 99.0 0/1 1038,0,3533 . chr22 42522392 G [A] rs28371738 2 0.143 14 0.375 True 1506 True 0.0 0.0 0 123.5516 253.92 0 0.685 5.9 0.59
NA12892 249,0 250 99.0 0/0 0,600,5732 . chr22 42522392 G [A] rs28371738 2 0.143 14 0.375 True 1506 True 0.0 0.0 0 123.5516 253.92 0 0.685 5.9 0.59
NA19238 248,1 250 99.0 0/0 0,627,6191 . chr22 42522392 G [A] rs28371738 2 0.143 14 0.375 True 1506 True 0.0 0.0 0 123.5516 253.92 0 0.685 5.9 0.59
NA19239 250,0 250 99.0 0/0 0,615,5899 . chr22 42522392 G [A] rs28371738 2 0.143 14 0.375 True 1506 True 0.0 0.0 0 123.5516 253.92 0 0.685 5.9 0.59
NA19240 250,0 250 99.0 0/0 0,579,5674 . chr22 42522392 G [A] rs28371738 2 0.143 14 0.375 True 1506 True 0.0 0.0 0 123.5516 253.92 0 0.685 5.9 0.59
BLANK 13,4 17 62.64 0/1 63,0,296 . chr22 42522613 G [C] rs1135840 6 0.429 14 16.289 True 1518 True 0.03 0.0 0 142.5716 242.46 0 2.01 9.16 -1.731
NA12878 118,127 246 99.0 0/1 2396,0,1719 . chr22 42522613 G [C] rs1135840 6 0.429 14 16.289 True 1518 True 0.03 0.0 0 142.5716 242.46 0 2.01 9.16 -1.731
NA12891 241,0 244 99.0 0/0 0,459,4476 . chr22 42522613 G [C] rs1135840 6 0.429 14 16.289 True 1518 True 0.03 0.0 0 142.5716 242.46 0 2.01 9.16 -1.731
NA12892 161,85 246 99.0 0/1 1489,0,2353 . chr22 42522613 G [C] rs1135840 6 0.429 14 16.289 True 1518 True 0.03 0.0 0 142.5716 242.46 0 2.01 9.16 -1.731
NA19238 110,132 242 99.0 0/1 2561,0,1488 . chr22 42522613 G [C] rs1135840 6 0.429 14 16.289 True 1518 True 0.03 0.0 0 142.5716 242.46 0 2.01 9.16 -1.731
NA19239 106,135 242 99.0 0/1 2613,0,1389 . chr22 42522613 G [C] rs1135840 6 0.429 14 16.289 True 1518 True 0.03 0.0 0 142.5716 242.46 0 2.01 9.16 -1.731
NA19240 116,126 243 99.0 0/1 2489,0,1537 . chr22 42522613 G [C] rs1135840 6 0.429 14 16.289 True 1518 True 0.03 0.0 0 142.5716 242.46 0 2.01 9.16 -1.731