This dissertation develops technical and governance infrastructure for a "free factory" by building on parallels with free and open source software and related communities. By viewing varied technologies and people as comprising free factories—or a federation of co-operating and competing factories with certain common ideals and infrastructure—I argue many scientific questions become easier to answer.
In the first chapter, I briefly summarize the dissertation. I then describe the hardware, staff and other resources required to implement the computational aspects of a free factory with reasonable economies of scale. In the next chapter, I use the infrastructure to search for DNA and RNA editing events in more than 600 million genomic traces from ten organisms at NCBI. I find numerous examples of traces that support the existence of these phenomena and set the stage for a more comprehensive investigation. The subsequent chapter uses the same tools to analyze four individual human genomes for variants of clinical interest. This work demonstrates such analyses need not lead to costly or harmful medical workup. In the last chapter, I describe the initial data release of the Personal Genome Project. The release is derived from two gigabases of targeted sequence data from ten individuals. I investigate the quality of the data by comparison with Affymetrix 500K SNPs and discuss one variant of clinical interest. This data release—linking scientists, physicians and members of the general public—demonstrates the utility of free factories for advancing the state-of-the-art in personalized, genomic medicine.
In Appendix A, I indicate how the Quantum Coreworld—earlier work on a digital evolution system consistent with the rules of quantum information processing—could efficiently use free factories. Such projects could allow free factories to fully utilize idle resources. Finally, in Appendix B, a novel, open-source primary data analysis pipeline is used to reprocess 100 gigabytes of image data derived from the exome of a Personal Genome Project participant. This approach demonstrates a 14% increase in placeable reads, on the PGP sample, over the vendor's pipeline.
Recommendations
Alignment-free detection of local similarity among viral and bacterial genomes
Motivation: Bacterial and viral genomes are often affected by horizontal gene transfer observable as abrupt switching in local homology. In addition to the resulting mosaic genome structure, they frequently contain regions not found in close ...
An alignment-free method to identify candidate orthologous enhancers in multiple Drosophila genomes
Motivation: Evolutionarily conserved non-coding genomic sequences represent a potentially rich source for the discovery of gene regulatory region such as transcriptional enhancers. However, detecting orthologous enhancers using alignment-based ...
Alignment-free estimation of nucleotide diversity
Motivation: Sequencing capacity is currently growing more rapidly than CPU speed, leading to an analysis bottleneck in many genome projects. Alignment-free sequence analysis methods tend to be more efficient than their alignment-based counterparts. ...