All tools described here have an online help, as shown by typing the command name and then
This tool converts a multi-sample VCF file, read from stdin, to a simpler file which I call “freqSum”. The first lines of an example are:
#CHROM POS REF ALT IND1(2) IND2(2) IND1(2) IND2(2) 1 10539 C A 0 0 0 0 1 11008 C G 0 0 0 0 1 11012 C G 0 0 0 0 1 13110 G A 1 1 0 0 1 13116 T G 0 0 0 0
Here, the first line is a header line that describes the columns and individuals. The number in brackets behind each individual denotes the number of chromosomes sampled, which can be higher than 2 if individuals are grouped together (see groupFreqSum below).
-1 denotes missing data. The tool fails if you pass indels or multi-allelic SNPs. You should therefore combine this script with a filtering step, for example using bcftools (v1.2):
bcftools view -m2 -M2 -c1 -v snps -S <sample_list.txt> <vcf_file> | vcf2FreqSum > output.txt
And here is the help text output using
vcf2FreqSum version 1.2.7: a program to convert a multi-sample VCF, read from stdin, into a freqSum file Usage: vcf2FreqSum [-n|--names NAMES] Available options: -h,--help Show this help text -n,--names NAMES A comma-separated list of names to replace the VCF names with, if given.
This tools groups freqSum columns (read from stdin) into sample groups, giving the total allele count of all samples in that group. Here is the help file (type
groupFreqSum version 1.2.7: group columns of a freqSum file into groups, summing up the allele counts Usage: groupFreqSum ([-g|--group GROUP] | [-f|--groupFile GROUPFILE]) [-m|--missingThreshold MISSINGTHRESHOLD] Available options: -h,--help Show this help text -g,--group GROUP A group definition of the form groupname:sample1,sample2,..., where groupname is the name for the group, and sample1, sample2 denote columns of the incoming freqSum file. Those can also be already groups. This option can be given several times, once for each new group. Note that columns not listed in any group are not output, so if you want to keep individual columns as they are you need to define degenerate groups with just one column and the same name as before. Note that explicitly a sample can appear in multiple groups. -f,--groupFile GROUPFILE A file with two columns: First the sample or column name, and second the group name. This is a convenient name to input a complex setup with many columns and many groups. Note that a sample can occur in multiple groups. -m,--missingThreshold MISSINGTHRESHOLD Sets the level of missingness needed to declare a group as missing. If set to 0, it means that if any sample in a group has missing data at a site, declare that group at that site as missing. 1 means that only if all in a group have missing data, declare the group as missing. Default=0.0
The most convenient way to use this tool is by providing a group file, which has a simple layout of the form:
individual1 groupA individual2 groupA individual3 groupB individual4 groupB ...
The group definitions in the group file do not have to be ordered by group. This may be useful if you want to alphabetically list all individuals.
And an individual can actually appear in multiple groups, in which case the genotypes enter both groups. This can be useful in some contexts where you for example want to test both individuals as well as grouped rare allele sharing with some reference populations.
You can leave individuals ungrouped, by putting them into their own group. That way, groupFreqSum will essentially copy them over into the output without grouping them.
The output from
groupFreqSum will still be a freqSum file, but with groups instead of individuals given in the columns (and/or individuals if specified each within their own group). The advantage of this data format is that it is very general with respect to individuals vs. groups. In the first example above (in vcf2FreqSum), the output contained a single column for each individual, with allele counts not exceeding 2 (for homozygous non-ref genotypes). However, once you used groupFreqSum, you end up with columns describing allele counts at least partly in groups. The format is still the same. If a single individual in a group has missing data in the input, by default the entire group in the output will be declared missing. You can change this behaviour using
--missingThreshold as described in the help above.
This tools merges two freqSum files. It takes four arguments, as shown by typing
Usage: mergeFreqSum freqSumFile1 freqSumFile2 [--conditionOn POP] [--fillHomRef] mergeFreqSum version 1.2.7: a tool to merge two freqSumFiles into one. Available options: -h,--help Show this help text freqSumFile1 file 1, put - for stdin freqSumFile2 file 2 --conditionOn POP three options: 1 (keep only sites that are present in the first file); 2 (keep only sites that are present in the second file), Both (keep only sites that are present in both files). By default, no conditioning happens, so the union of both files is output, with missing sites flagged as missing or homRef according to the --fillHomRef option --fillHomRef treat sites that are missing in one file as hom-ref instead of missing
This tool merges two data sets, interleaving sites if necessary. If a site is present in one file but not the other, the sites in the latter data set are added as missing (by default), or as hom-ref if the
--fillHomRef flag is set.
Note that both freqSum files are expected to be ordered, first by chromosome and then by position.
downSampleFreqSum -h gives:
Usage: downSampleFreqSum <NAME> <N_AFTER> downSampleFreqSum version 1.2.7: A tool for downsampling a freqSum file. If a column is -1, the downsampled column will also have -1. Reads from stdin Available options: -h,--help Show this help text <NAME> the name of the population to sample from <N_AFTER> the new number of haplotypes to downsample to
This can be used to downsample the number of samples in a particular sample or group in the input file, read from stdin. The two arguments are the name of the column to downsample and the number of chromosomes to sample from that column. Downsampling is performed without replacement.
This is the key tool to convert a freqSumFile to an allele sharing histogram, as used by rarecoal. Type
-h for getting help:
Usage: freqSum2Histogram [-m|--maxM INT] (-N|--nrCalledSites ARG) [-r|--removeMissing] [-t|--removeTransitions] [-j|--jackknife] freqSum2Histogram version 1.2.7: a tool to convert a freqSum file, read from stdin, to to a histogram file as needed for rarecoal. Available options: -h,--help Show this help text -m,--maxM INT Specify the maximum allele count per population (default: 10) -N,--nrCalledSites ARG set the total nr of called sites. This sets the number of non-variant sites (via the pattern consisting of zeros only) such that the total number of sites matches the number given. This number is important for estimating population sizes correctly, see the README for instructions. -r,--removeMissing remove sites at which any selected column is missing (-1). By default, missing data is interpreted as reference alleles. -t,--removeTransitions remove transition SNPs -j,--jackknife run with weighted jackknife error estimate using chromosome-drop-out. Warning: This assumes that input is ordered by chromosome. It doesn't necessarily have to be sorted, but all sites from one chromosome need to be consecutive.
One key ingredient in this tool is the total number of sites, specified via
-N. This is an important input, as it will set the number of non-variant counts in your histogram, specified by the pattern consisting of zeros only, e.g.
0,0,0,0,0 in five populations. This number is important for estimating population sizes, which relies on knowing the ratio of variants and non-variants. If you are processing modern sequencing data (say from the 1000 Genomes Project), you can more or less assume that the entire mappable and callable genome is covered in all individuals. For humans, the size of the mappable autosomal genome is close to 2,500,000,000, but the details depend on your specific data set and processing.
selectInFreqSum -h gives:
Usage: selectInFreqSum (-n|--names NAME1,NAME2,...) selectInFreqSum version 1.2.7: a tool to select columns from a freqSum file, read from stdin Available options: -h,--help Show this help text -n,--names NAME1,NAME2,... comma-separated list of names to select
This tool can be used to select and/or re-order specific columns in the freqSum file, using the names passed to option
-n. This is useful before running freqSum2Histogram to select a number of populations from a large freqSum file containing dozens of populations.
samleHist -h gives:
Usage: sampleHist (-n|--name <NAME>) (-n|--howMany <INT>) (-i|--hist <path-to-histogram>) sampleHist version 1.2.7: a tool to sample a number of haplotypes (independently at each site) from a population in a histogram Available options: -h,--help Show this help text -n,--name <NAME> the population (by name) from which to sample -n,--howMany <INT> how many samples should be drawn at each site -i,--hist <path-to-histogram> the input histogram file, set - for stdin
This extracts samples from a subpopulation in the histogram, by sampling without replacement independently at every site underlying the histogram. The extracted samples therefore do not represent real individuals, but “average” individuals with genotypes sampled independently at every site from a full population. This can be useful if you need to extract individuals from histograms which were generated from data for which only allele frequencies but not genotypes are given.
Usage: combineHistograms histogram_file Tool to combine multiple histogram files, for help add option -h Available options: -h,--help Show this help text histogram_file histogram file, put as many as you want to add up
This simply adds up multiple histograms.
Usage: concatFreqSum FILES concatenates multiple freqSum files and checks that header lines are the same in all input files. Available options: -h,--help Show this help text FILES input file(s)
eigenstrat2FreqSum version 1.2.7: a program to convert eigenstrat files into a freqSum file Usage: eigenstrat2FreqSum (-g|--geno GENO) (-s|--snp SNP) (-i|--ind IND) Available options: -h,--help Show this help text -g,--geno GENO input eigenstrat-geno file -s,--snp SNP input eigenstrat-snp file -i,--ind IND input eigenstrat-ind file
filterFreqSum version 1.2.7: a program to filter freqSum files. Usage: filterFreqSum [-b|--bed BED] [-m|--missingness MISSINGNESS] [-s|--sampleMissingness SAMPLEMISSINGNESS] Available options: -h,--help Show this help text -b,--bed BED a bed file that contains the regions to be included -m,--missingness MISSINGNESS the maximum missingness allowed (0-1). Default=1 -s,--sampleMissingness SAMPLEMISSINGNESS an optional sample name to condition on having no missingness