Rarecoal Tools

Tip

All tools described here have an online help, as shown by typing the command name and then -h.

vcf2FreqSum

This tool converts a multi-sample VCF file, read from stdin, to a simpler file which I call “freqSum”. The first lines of an example are:

#CHROM      POS     REF     ALT     IND1(2) IND2(2) IND1(2) IND2(2)
1   10539   C       A       0       0       0       0
1   11008   C       G       0       0       0       0
1   11012   C       G       0       0       0       0
1   13110   G       A       1       1       0       0
1   13116   T       G       0       0       0       0

Here, the first line is a header line that describes the columns and individuals. The number in brackets behind each individual denotes the number of chromosomes sampled, which can be higher than 2 if individuals are grouped together (see groupFreqSum below). A -1 denotes missing data. The tool fails if you pass indels or multi-allelic SNPs. You should therefore combine this script with a filtering step, for example using bcftools (v1.2):

bcftools view -m2 -M2 -c1 -v snps -S <sample_list.txt> <vcf_file> | vcf2FreqSum > output.txt

And here is the help text output using vcf2freqSum -h:

vcf2FreqSum version 1.2.7: a program to convert a multi-sample VCF, read from stdin, into a freqSum file

Usage: vcf2FreqSum [-n|--names NAMES]

Available options:
  -h,--help                Show this help text
  -n,--names NAMES         A comma-separated list of names to replace the VCF
                           names with, if given.

groupFreqSum

This tools groups freqSum columns (read from stdin) into sample groups, giving the total allele count of all samples in that group. Here is the help file (type groupFreqSum -h):

groupFreqSum version 1.2.7: group columns of a freqSum file into groups, summing up the allele counts

Usage: groupFreqSum ([-g|--group GROUP] | [-f|--groupFile GROUPFILE])
                    [-m|--missingThreshold MISSINGTHRESHOLD]

Available options:
  -h,--help                Show this help text
  -g,--group GROUP         A group definition of the form
                           groupname:sample1,sample2,..., where groupname is the
                           name for the group, and sample1, sample2 denote
                           columns of the incoming freqSum file. Those can also
                           be already groups. This option can be given several
                           times, once for each new group. Note that columns not
                           listed in any group are not output, so if you want to
                           keep individual columns as they are you need to
                           define degenerate groups with just one column and the
                           same name as before. Note that explicitly a sample
                           can appear in multiple groups.
  -f,--groupFile GROUPFILE A file with two columns: First the sample or column
                           name, and second the group name. This is a convenient
                           name to input a complex setup with many columns and
                           many groups. Note that a sample can occur in multiple
                           groups.
  -m,--missingThreshold MISSINGTHRESHOLD
                           Sets the level of missingness needed to declare a
                           group as missing. If set to 0, it means that if any
                           sample in a group has missing data at a site, declare
                           that group at that site as missing. 1 means that only
                           if all in a group have missing data, declare the
                           group as missing. Default=0.0

The most convenient way to use this tool is by providing a group file, which has a simple layout of the form:

individual1 groupA
individual2 groupA
individual3 groupB
individual4 groupB
...

Tip

The group definitions in the group file do not have to be ordered by group. This may be useful if you want to alphabetically list all individuals.

Tip

And an individual can actually appear in multiple groups, in which case the genotypes enter both groups. This can be useful in some contexts where you for example want to test both individuals as well as grouped rare allele sharing with some reference populations.

Tip

You can leave individuals ungrouped, by putting them into their own group. That way, groupFreqSum will essentially copy them over into the output without grouping them.

The output from groupFreqSum will still be a freqSum file, but with groups instead of individuals given in the columns (and/or individuals if specified each within their own group). The advantage of this data format is that it is very general with respect to individuals vs. groups. In the first example above (in vcf2FreqSum), the output contained a single column for each individual, with allele counts not exceeding 2 (for homozygous non-ref genotypes). However, once you used groupFreqSum, you end up with columns describing allele counts at least partly in groups. The format is still the same. If a single individual in a group has missing data in the input, by default the entire group in the output will be declared missing. You can change this behaviour using --missingThreshold as described in the help above.

mergeFreqSum

This tools merges two freqSum files. It takes four arguments, as shown by typing mergeFreqSum -h:

Usage: mergeFreqSum freqSumFile1 freqSumFile2 [--conditionOn POP] [--fillHomRef]
  mergeFreqSum version 1.2.7: a tool to merge two freqSumFiles into one.

Available options:
  -h,--help                Show this help text
  freqSumFile1             file 1, put - for stdin
  freqSumFile2             file 2
  --conditionOn POP        three options: 1 (keep only sites that are present in
                           the first file); 2 (keep only sites that are present
                           in the second file), Both (keep only sites that are
                           present in both files). By default, no conditioning
                           happens, so the union of both files is output, with
                           missing sites flagged as missing or homRef according
                           to the --fillHomRef option
  --fillHomRef             treat sites that are missing in one file as hom-ref
                           instead of missing

This tool merges two data sets, interleaving sites if necessary. If a site is present in one file but not the other, the sites in the latter data set are added as missing (by default), or as hom-ref if the --fillHomRef flag is set.

Note

Note that both freqSum files are expected to be ordered, first by chromosome and then by position.

downSampleFreqSum

Typing downSampleFreqSum -h gives:

Usage: downSampleFreqSum <NAME> <N_AFTER>
  downSampleFreqSum version 1.2.7: A tool for downsampling a freqSum file. If a
  column is -1, the downsampled column will also have -1. Reads from stdin

Available options:
  -h,--help                Show this help text
  <NAME>                   the name of the population to sample from
  <N_AFTER>                the new number of haplotypes to downsample to

This can be used to downsample the number of samples in a particular sample or group in the input file, read from stdin. The two arguments are the name of the column to downsample and the number of chromosomes to sample from that column. Downsampling is performed without replacement.

freqSum2Histogram

This is the key tool to convert a freqSumFile to an allele sharing histogram, as used by rarecoal. Type -h for getting help:

Usage: freqSum2Histogram [-m|--maxM INT] (-N|--nrCalledSites ARG)
                         [-r|--removeMissing] [-t|--removeTransitions]
                         [-j|--jackknife]
  freqSum2Histogram version 1.2.7: a tool to convert a freqSum file, read from
  stdin, to to a histogram file as needed for rarecoal.

Available options:
  -h,--help                Show this help text
  -m,--maxM INT            Specify the maximum allele count per
                           population (default: 10)
  -N,--nrCalledSites ARG   set the total nr of called sites. This sets the
                           number of non-variant sites (via the pattern
                           consisting of zeros only) such that the total number
                           of sites matches the number given. This number is
                           important for estimating population sizes correctly,
                           see the README for instructions.
  -r,--removeMissing       remove sites at which any selected column is missing
                           (-1). By default, missing data is interpreted as
                           reference alleles.
  -t,--removeTransitions   remove transition SNPs
  -j,--jackknife           run with weighted jackknife error estimate using
                           chromosome-drop-out. Warning: This assumes that input
                           is ordered by chromosome. It doesn't necessarily have
                           to be sorted, but all sites from one chromosome need
                           to be consecutive.

One key ingredient in this tool is the total number of sites, specified via -N. This is an important input, as it will set the number of non-variant counts in your histogram, specified by the pattern consisting of zeros only, e.g. 0,0,0,0,0 in five populations. This number is important for estimating population sizes, which relies on knowing the ratio of variants and non-variants. If you are processing modern sequencing data (say from the 1000 Genomes Project), you can more or less assume that the entire mappable and callable genome is covered in all individuals. For humans, the size of the mappable autosomal genome is close to 2,500,000,000, but the details depend on your specific data set and processing.

selectInFreqSum

Typing selectInFreqSum -h gives:

Usage: selectInFreqSum (-n|--names NAME1,NAME2,...)
  selectInFreqSum version 1.2.7: a tool to select columns from a freqSum file,
  read from stdin

Available options:
  -h,--help                Show this help text
  -n,--names NAME1,NAME2,...
                           comma-separated list of names to select

This tool can be used to select and/or re-order specific columns in the freqSum file, using the names passed to option -n. This is useful before running freqSum2Histogram to select a number of populations from a large freqSum file containing dozens of populations.

sampleHist

Typing samleHist -h gives:

Usage: sampleHist (-n|--name <NAME>) (-n|--howMany <INT>)
                  (-i|--hist <path-to-histogram>)
  sampleHist version 1.2.7: a tool to sample a number of haplotypes
  (independently at each site) from a population in a histogram

Available options:
  -h,--help                Show this help text
  -n,--name <NAME>         the population (by name) from which to sample
  -n,--howMany <INT>       how many samples should be drawn at each site
  -i,--hist <path-to-histogram>
                           the input histogram file, set - for stdin

This extracts samples from a subpopulation in the histogram, by sampling without replacement independently at every site underlying the histogram. The extracted samples therefore do not represent real individuals, but “average” individuals with genotypes sampled independently at every site from a full population. This can be useful if you need to extract individuals from histograms which were generated from data for which only allele frequencies but not genotypes are given.

combineHistograms

Usage: combineHistograms histogram_file
  Tool to combine multiple histogram files, for help add option -h

Available options:
  -h,--help                Show this help text
  histogram_file           histogram file, put as many as you want to add up

This simply adds up multiple histograms.

concatFreqSum

Usage: concatFreqSum FILES
  concatenates multiple freqSum files and checks that header lines are the same in all input files.

Available options:
  -h,--help                Show this help text
  FILES                     input file(s)

eigenstrat2FreqSum

eigenstrat2FreqSum version 1.2.7: a program to convert eigenstrat files into a freqSum file

Usage: eigenstrat2FreqSum (-g|--geno GENO) (-s|--snp SNP) (-i|--ind IND)

Available options:
  -h,--help                Show this help text
  -g,--geno GENO           input eigenstrat-geno file
  -s,--snp SNP             input eigenstrat-snp file
  -i,--ind IND             input eigenstrat-ind file

filterFreqSum

filterFreqSum version 1.2.7: a program to filter freqSum files.

Usage: filterFreqSum [-b|--bed BED] [-m|--missingness MISSINGNESS]
                     [-s|--sampleMissingness SAMPLEMISSINGNESS]

Available options:
  -h,--help                Show this help text
  -b,--bed BED             a bed file that contains the regions to be included
  -m,--missingness MISSINGNESS
                           the maximum missingness allowed (0-1). Default=1
  -s,--sampleMissingness SAMPLEMISSINGNESS
                           an optional sample name to condition on having no
                           missingness