Rarecoal Tools¶
Tip
All tools described here have an online help, as shown by typing the command name and then -h
.
vcf2FreqSum¶
This tool converts a multi-sample VCF file, read from stdin, to a simpler file which I call “freqSum”. The first lines of an example are:
#CHROM POS REF ALT IND1(2) IND2(2) IND1(2) IND2(2)
1 10539 C A 0 0 0 0
1 11008 C G 0 0 0 0
1 11012 C G 0 0 0 0
1 13110 G A 1 1 0 0
1 13116 T G 0 0 0 0
Here, the first line is a header line that describes the columns and individuals. The number in brackets behind each individual denotes the number of chromosomes sampled, which can be higher than 2 if individuals are grouped together (see groupFreqSum below).
A -1
denotes missing data. The tool fails if you pass indels or multi-allelic SNPs. You should therefore combine this script with a filtering step, for example using bcftools (v1.2):
bcftools view -m2 -M2 -c1 -v snps -S <sample_list.txt> <vcf_file> | vcf2FreqSum > output.txt
And here is the help text output using vcf2freqSum -h
:
vcf2FreqSum version 1.2.7: a program to convert a multi-sample VCF, read from stdin, into a freqSum file
Usage: vcf2FreqSum [-n|--names NAMES]
Available options:
-h,--help Show this help text
-n,--names NAMES A comma-separated list of names to replace the VCF
names with, if given.
groupFreqSum¶
This tools groups freqSum columns (read from stdin) into sample groups, giving the total allele count of all samples in that group. Here is the help file (type groupFreqSum -h
):
groupFreqSum version 1.2.7: group columns of a freqSum file into groups, summing up the allele counts
Usage: groupFreqSum ([-g|--group GROUP] | [-f|--groupFile GROUPFILE])
[-m|--missingThreshold MISSINGTHRESHOLD]
Available options:
-h,--help Show this help text
-g,--group GROUP A group definition of the form
groupname:sample1,sample2,..., where groupname is the
name for the group, and sample1, sample2 denote
columns of the incoming freqSum file. Those can also
be already groups. This option can be given several
times, once for each new group. Note that columns not
listed in any group are not output, so if you want to
keep individual columns as they are you need to
define degenerate groups with just one column and the
same name as before. Note that explicitly a sample
can appear in multiple groups.
-f,--groupFile GROUPFILE A file with two columns: First the sample or column
name, and second the group name. This is a convenient
name to input a complex setup with many columns and
many groups. Note that a sample can occur in multiple
groups.
-m,--missingThreshold MISSINGTHRESHOLD
Sets the level of missingness needed to declare a
group as missing. If set to 0, it means that if any
sample in a group has missing data at a site, declare
that group at that site as missing. 1 means that only
if all in a group have missing data, declare the
group as missing. Default=0.0
The most convenient way to use this tool is by providing a group file, which has a simple layout of the form:
individual1 groupA
individual2 groupA
individual3 groupB
individual4 groupB
...
Tip
The group definitions in the group file do not have to be ordered by group. This may be useful if you want to alphabetically list all individuals.
Tip
And an individual can actually appear in multiple groups, in which case the genotypes enter both groups. This can be useful in some contexts where you for example want to test both individuals as well as grouped rare allele sharing with some reference populations.
Tip
You can leave individuals ungrouped, by putting them into their own group. That way, groupFreqSum will essentially copy them over into the output without grouping them.
The output from groupFreqSum
will still be a freqSum file, but with groups instead of individuals given in the columns (and/or individuals if specified each within their own group). The advantage of this data format is that it is very general with respect to individuals vs. groups. In the first example above (in vcf2FreqSum), the output contained a single column for each individual, with allele counts not exceeding 2 (for homozygous non-ref genotypes). However, once you used groupFreqSum, you end up with columns describing allele counts at least partly in groups. The format is still the same. If a single individual in a group has missing data in the input, by default the entire group in the output will be declared missing. You can change this behaviour using --missingThreshold
as described in the help above.
mergeFreqSum¶
This tools merges two freqSum files. It takes four arguments, as shown by typing mergeFreqSum -h
:
Usage: mergeFreqSum freqSumFile1 freqSumFile2 [--conditionOn POP] [--fillHomRef]
mergeFreqSum version 1.2.7: a tool to merge two freqSumFiles into one.
Available options:
-h,--help Show this help text
freqSumFile1 file 1, put - for stdin
freqSumFile2 file 2
--conditionOn POP three options: 1 (keep only sites that are present in
the first file); 2 (keep only sites that are present
in the second file), Both (keep only sites that are
present in both files). By default, no conditioning
happens, so the union of both files is output, with
missing sites flagged as missing or homRef according
to the --fillHomRef option
--fillHomRef treat sites that are missing in one file as hom-ref
instead of missing
This tool merges two data sets, interleaving sites if necessary. If a site is present in one file but not the other, the sites in the latter data set are added as missing (by default), or as hom-ref if the --fillHomRef
flag is set.
Note
Note that both freqSum files are expected to be ordered, first by chromosome and then by position.
downSampleFreqSum¶
Typing downSampleFreqSum -h
gives:
Usage: downSampleFreqSum <NAME> <N_AFTER>
downSampleFreqSum version 1.2.7: A tool for downsampling a freqSum file. If a
column is -1, the downsampled column will also have -1. Reads from stdin
Available options:
-h,--help Show this help text
<NAME> the name of the population to sample from
<N_AFTER> the new number of haplotypes to downsample to
This can be used to downsample the number of samples in a particular sample or group in the input file, read from stdin. The two arguments are the name of the column to downsample and the number of chromosomes to sample from that column. Downsampling is performed without replacement.
freqSum2Histogram¶
This is the key tool to convert a freqSumFile to an allele sharing histogram, as used by rarecoal. Type -h
for getting help:
Usage: freqSum2Histogram [-m|--maxM INT] (-N|--nrCalledSites ARG)
[-r|--removeMissing] [-t|--removeTransitions]
[-j|--jackknife]
freqSum2Histogram version 1.2.7: a tool to convert a freqSum file, read from
stdin, to to a histogram file as needed for rarecoal.
Available options:
-h,--help Show this help text
-m,--maxM INT Specify the maximum allele count per
population (default: 10)
-N,--nrCalledSites ARG set the total nr of called sites. This sets the
number of non-variant sites (via the pattern
consisting of zeros only) such that the total number
of sites matches the number given. This number is
important for estimating population sizes correctly,
see the README for instructions.
-r,--removeMissing remove sites at which any selected column is missing
(-1). By default, missing data is interpreted as
reference alleles.
-t,--removeTransitions remove transition SNPs
-j,--jackknife run with weighted jackknife error estimate using
chromosome-drop-out. Warning: This assumes that input
is ordered by chromosome. It doesn't necessarily have
to be sorted, but all sites from one chromosome need
to be consecutive.
One key ingredient in this tool is the total number of sites, specified via -N
. This is an important input, as it will set the number of non-variant counts in your histogram, specified by the pattern consisting of zeros only, e.g. 0,0,0,0,0
in five populations. This number is important for estimating population sizes, which relies on knowing the ratio of variants and non-variants. If you are processing modern sequencing data (say from the 1000 Genomes Project), you can more or less assume that the entire mappable and callable genome is covered in all individuals. For humans, the size of the mappable autosomal genome is close to 2,500,000,000, but the details depend on your specific data set and processing.
selectInFreqSum¶
Typing selectInFreqSum -h
gives:
Usage: selectInFreqSum (-n|--names NAME1,NAME2,...)
selectInFreqSum version 1.2.7: a tool to select columns from a freqSum file,
read from stdin
Available options:
-h,--help Show this help text
-n,--names NAME1,NAME2,...
comma-separated list of names to select
This tool can be used to select and/or re-order specific columns in the freqSum file, using the names passed to option -n
. This is useful before running freqSum2Histogram to select a number of populations from a large freqSum file containing dozens of populations.
sampleHist¶
Typing samleHist -h
gives:
Usage: sampleHist (-n|--name <NAME>) (-n|--howMany <INT>)
(-i|--hist <path-to-histogram>)
sampleHist version 1.2.7: a tool to sample a number of haplotypes
(independently at each site) from a population in a histogram
Available options:
-h,--help Show this help text
-n,--name <NAME> the population (by name) from which to sample
-n,--howMany <INT> how many samples should be drawn at each site
-i,--hist <path-to-histogram>
the input histogram file, set - for stdin
This extracts samples from a subpopulation in the histogram, by sampling without replacement independently at every site underlying the histogram. The extracted samples therefore do not represent real individuals, but “average” individuals with genotypes sampled independently at every site from a full population. This can be useful if you need to extract individuals from histograms which were generated from data for which only allele frequencies but not genotypes are given.
combineHistograms¶
Usage: combineHistograms histogram_file
Tool to combine multiple histogram files, for help add option -h
Available options:
-h,--help Show this help text
histogram_file histogram file, put as many as you want to add up
This simply adds up multiple histograms.
concatFreqSum¶
Usage: concatFreqSum FILES
concatenates multiple freqSum files and checks that header lines are the same in all input files.
Available options:
-h,--help Show this help text
FILES input file(s)
eigenstrat2FreqSum¶
eigenstrat2FreqSum version 1.2.7: a program to convert eigenstrat files into a freqSum file
Usage: eigenstrat2FreqSum (-g|--geno GENO) (-s|--snp SNP) (-i|--ind IND)
Available options:
-h,--help Show this help text
-g,--geno GENO input eigenstrat-geno file
-s,--snp SNP input eigenstrat-snp file
-i,--ind IND input eigenstrat-ind file
filterFreqSum¶
filterFreqSum version 1.2.7: a program to filter freqSum files.
Usage: filterFreqSum [-b|--bed BED] [-m|--missingness MISSINGNESS]
[-s|--sampleMissingness SAMPLEMISSINGNESS]
Available options:
-h,--help Show this help text
-b,--bed BED a bed file that contains the regions to be included
-m,--missingness MISSINGNESS
the maximum missingness allowed (0-1). Default=1
-s,--sampleMissingness SAMPLEMISSINGNESS
an optional sample name to condition on having no
missingness