swarm

A robust and fast clustering method for amplicon-based studies.

The purpose of swarm is to provide a novel clustering algorithm to handle large sets of amplicons. Traditional clustering algorithms results are strongly input-order dependent, and rely on an arbitrary global clustering threshold. swarm results are resilient to input-order changes and rely on a small local linking threshold d, the maximum number of differences between two amplicons. swarm forms stable high-resolution clusters, with a high yield of biological information.

Table of Content

Quick start

swarm most simple usage is (with default parameters, use -h to get help or see the user manual for details):

./swarm amplicons.fasta

Install

Get the source code and a swarm binary from GitHub using the ZIP button or git:

git clone https://github.com/torognes/swarm.git cd swarm/

Use the command make to compile swarm from scratch.

If you have administrator privileges, you can make swarm accessible for all users. Simply copy the binary to /usr/bin/. The man page can be installed this way:

gzip -c swarm.1 > swarm.1.gz mv swarm.1.gz /usr/share/man/man1/

Once installed, the man page can be accessed with the command man swarm.

Prepare amplicon fasta files

To facilitate the use of swarm, we provide examples of shell commands that can be use to format and check the input fasta file (warning, this may not be suitable for very large files). The amplicon clipping step (adaptor and primer removal) and the filtering step are not discussed here.

Linearization

Swarm accepts wrapped fasta files as well as linear fasta files. However, linear fasta files where amplicons are written on two lines (one line for the fasta header, one line for the sequence) are much easier to manipulate. For instance, many post-clustering queries can be easily done with grep when fasta files are linear. You can use the following code to linearize your fasta files. Code tested with GNU Awk 4.0.1.

awk 'NR==1 {print ; next} {printf /^>/ ? "\n"$0"\n" : $1} END {printf "\n"}' amplicons.fasta > amplicons_linearized.fasta

Dereplication

To speed up the clustering process, strictly identical amplicons should be merged. This step is not mandatory, but it is an important time saver, especially for highly redundant high-throughput sequencing results.

grep -v "^>" amplicons_linearized.fasta | \ grep -v [^ACGTacgt] | sort -d | uniq -c | \ while read abundance sequence ; do hash=$(printf "${sequence}" | sha1sum) hash=${hash:0:40} printf ">%s_%d_%s\n" "${hash}" "${abundance}" "${sequence}" done | sort -t "_" -k2,2nr -k1.2,1d | \ sed -e 's/\_/\n/2' > amplicons_linearized_dereplicated.fasta

Amplicons containing characters other than "ACGT" are discarded. The dereplicated amplicons receive a meaningful unique name (hash codes), and are sorted by decreasing number of copies and by hash codes (to guarantee a stable sorting). The use of a hashing function also provides an easy way to compare sets of amplicons. If two amplicons from two different sets have the same hash code, it means that the sequences they represent are identical.

Launch swarm

If you want swarm to partition your dataset with the finest resolution (a local number of differences d = 1, with post-processing to eliminate the potential chained OTUs) on a quadricore CPU:

./swarm -d 1 -t 4 amplicons.fasta > amplicons.swarms python ../scripts/swarm_breaker.py -f amplicons.fasta \ -s amplicons.swarms 2> amplicons.log > amplicons.swarms2

See the user manual (man page and PDF) for details on swarm's options and parameters.

Parse swarm results

To facilitate the use of swarm, we provide examples of shell commands that can be use to parse swarm's output. We assume that the amplicon fasta file was prepared as describe above (linearization and dereplication).

Refine swarm OTUs

Chains of amplicons can form when using short sequences, a slowly-evolving marker or a high d value. Using amplicon abundance information, these chains can be easily identified and broken to improve swarm's precision. Until we include that functionality directly into swarm, we provide a simple to use python script, and recommend to apply it to all swarm results. The script, tested with python 2.7, is located in the folder scripts, and can be used as follows:

python swarm_breaker.py --help python swarm_breaker.py -f amplicons.fasta -s amplicons.swarms 2> amplicons.log > amplicons.swarms2

The script produces refined OTUs and writes them to the standard output. It also writes debugging information to the standard error, that could be redirected to a log file, or redirected to /dev/null. As of today, swarm_breaker.py is rapid enough to deal with 454 data sets, but might be too slow for large Illumina data sets. We are currently testing a much faster algorithm, and plan to release it as soon as possible.

Count the number of amplicons per OTU

You might want to check the size distribution of OTU (number of amplicons in each OTU), and count the number of singletons (OTUs containing only one amplicon).

awk '{print NF}' amplicons.swarms | sort -n | uniq -c awk 'NF == 1 {sum+=1} END {print sum}' amplicons.swarms

The number of amplicons in each OTU and several other metrics are available in the statistics file produced by swarm when using the -s option.

Get the seed sequence for each OTU

It is frequent for subsequent analyses to keep only one representative amplicon per OTU (usually the seed) to reduce the computational burden. That operation is easily done with swarm results.

SEEDS=$(mktemp) cut -d " " -f 1 amplicons.swarms | sed -e 's/^/>/' > "${SEEDS}" grep -A 1 -F -f "${SEEDS}" amplicons.fasta | sed -e '/^--$/d' > amplicons_seeds.fasta rm "${SEEDS}"

Get fasta sequences for all amplicons in a OTU

For each OTU, get the fasta sequences for all amplicons. Warning, this loop can generate a very large number of files. To limit the number of files, a test can be added to exclude swarms with less than n elements.

INPUT_SWARM="amplicons.swarms" INPUT_FASTA="amplicons.fasta" OUTPUT_FOLDER="swarms_fasta" AMPLICONS=$(mktemp) mkdir "${OUTPUT_FOLDER}" while read swarm ; do tr " " "\n" <<< "${swarm}" | sed -e 's/^/>/' > "${AMPLICONS}" seed=$(head -n 1 "${AMPLICONS}") grep -A 1 -F -f "${AMPLICONS}" "${INPUT_FASTA}" | sed -e '/^--$/d' > "./${OUTPUT_FOLDER}/${seed/>/}.fasta" done < "${INPUT_SWARM}" rm "${AMPLICONS}"

Troubleshooting

If swarm exists with an error message saying This program requires a processor with SSE2, your computer is too old to run swarm (or based on a non x86-64 architecture). swarm only runs on CPUs with the SSE2 instructions, i.e. most Intel and AMD CPUs released since 2004.

New features

version 1.2.19

swarm 1.2.19 fixes a problem related to abundance information when the sequence identifier includes multiple underscore characters.

version 1.2.18

swarm 1.2.18 reenables the possibility of reading sequences from stdin if no file name is specified on the command line. It also fixes a bug related to cpu features detection.

version 1.2.17

swarm 1.2.17 fixes a memory allocation bug introduced in version 1.2.15.

version 1.2.16

swarm 1.2.16 fixes a bug in the abundance sort introduced in version 1.2.15.

version 1.2.15

swarm 1.2.15 sorts the input sequences in order of decreasing abundance unless they are detected to be sorted already. When using the alternative algorithm for d=1 it also sorts all subseeds in order of decreasing abundance.

version 1.2.14

swarm 1.2.14 fixes a bug in the output with the swarm_breaker option (-b) when using the alternative algorithm (-a).

version 1.2.13

swarm 1.2.13 updates the citation.

version 1.2.12

swarm 1.2.12 improves speed of new search strategy for d=1.

version 1.2.11

swarm 1.2.11 corrects the number of differences reported in the break_swarms output.

version 1.2.10

swarm 1.2.10 allows amplicon abundances to be specified using the usearch style in the sequence header (e.g. ">id;size=1") when the -z option is chosen. Also fixes the bad url shown in the previous version of swarm.

version 1.2.9

swarm 1.2.9 includes a parallelized variant of the new search strategy for d=1. It seems to be fairly scalable up to about 16 threads for longer reads (~400bp), while up to about 8 threads for shorter reads (~150bp). Using about 50% more threads than available physical cores is recommended. This version also includes the d parameter in the beginning of the mothur-style output (e.g. swarm_1). Also, in the break_swarms output the real number of differences between the seed and the amplicon is indicated in the last column.

version 1.2.8

swarm 1.2.8 fixes an error with the gap extension penalty. Previous versions effectively used a gap penalty twice as large as intended. This version also introduces an experimental new search strategy in the case where d=1 that appears to be almost linear and faster at least for datasets of about half a million sequences or more. The new strategy can be turned on with the -a option.

version 1.2.7

swarm 1.2.7 incorporates a few small changes and improvements to make it ready for integration into QIIME.

version 1.2.6

swarm 1.2.6 add an option (-r or --mothur) to format the output file as a mothur-compatible list file instead of the native swarm format. When swarm encounters an illegal character in the input sequences it will now report the illegal character and the line number.

version 1.2.5

swarm 1.2.5 can be run on cpus without the POPCNT feature. It automatically checks whether the cpu feature is available and uses the appropriate code. The code that avoids POPCNT is just slightly slower. Only basic SSE2 is now required.

version 1.2.4

swarm 1.2.4 changes the name of the new option from --break_swarms to --break-swarms for consistency with other options, and also adds a companion script swarm_breaker.py to refine swarm results (scripts folder).

version 1.2.3

swarm 1.2.3 adds an option (-b or --break_swarms) to output all pairs of amplicons to stderr. The data can be used for post-processing of the results to refine the swarms. The syntax of the inline assembly code is also changed for compatibility with more compilers.

version 1.2.2

swarm 1.2.2 fixes an issue with incorrect values in the statistics file (maximum generation and radius of swarms). This version is also a bit faster.

version 1.2.1

swarm 1.2.1 removes the need for a SSE4.1 capable CPU and should now be able to run on most servers, desktops and laptops.

version 1.2.0

swarm 1.2.0 introduces a pre-filtering of similar amplicons based on k-mers. This eliminates most of the time-consuming pairwise alignments and greatly improves speed. The speedup can be more than 100-fold compared to previous swarm versions when using a single thread with a large set of amplicons. Using multiple threads induces a computational overhead, but becomes more and more efficient as the size of the amplicon set increases.

version 1.1.1

swarm now works on Apple computers. This version also corrects an issue in the pairwise global alignment step that could lead to sub-optimal alignments. Slightly different alignments may result relative to previous version, giving slightly different swarms.

version 1.1.0

swarm 1.1.0 introduces new optimizations and is 20% faster than the previous version on our test dataset. It also introduces two new output options: statistics and uclust-like format.

Statistics

By specifying the -s option to swarm it will now output detailed statistics about each swarm to a specified file. It will print the number of unique amplicons, the number of copies, the name of the seed and its abundance, the number of singletons (amplicons with an abundance of 1), the number of iterations and the maximum radius of the swarm (i.e. number of differences between the seed and the furthermost amplicon). When using input data sorted by decreasing abundance, the seed is the most abundant amplicon in the swarm.

Uclust-like output format

Some pipelines use the uclust output format as input for subsequent analyses. swarm can now output results in this format to a specified file with the -u option.

Citation

To cite swarm, please refer to:

Mahé F, Rognes T, Quince C, de Vargas C, Dunthorn M. (2014) Swarm: robust and fast clustering method for amplicon-based studies. PeerJ 2:e593 http://dx.doi.org/10.7717/peerj.593

Contact

You are welcome to: