Prototype bio-forensic analysis with WEKA
Final project for Wright State University CS4840/6840, Intro to Machine Learning, Spring 2016
GNU GPL v3 (see separate license file). Weka, which this work references, is also licensed under GNU GPL v3.
The MixtureMining project is intended to explore methods for estimating/inferring the number of contributors present in mixed DNA samples. This project includes:
> ruby driver.rb min_contributors max_contributors mixtures_per_class -f [AS/ASB/PC] -n features_to_keep -c BS'
All paths relative to ./preprocessing/
Mixture simulation
Ex.
> ruby mix_gen.rb --infile ./path_to/genotypes.csv --outfile ./path_to/mixture_output.csv --per num_samples_per_mix --mixtures num_mixture_to_make [--seed PRNG_seed_value]
> ruby mix_gen.rb --infile single_source/361_caucasian_identifiler_loci.csv --outfile mixtures/361_cau_id_2_mix_500.csv --per 2 --mixtures 500
Feature creation
2.1 Allele frequency feature creation: uses [—aftable aftable.csv] flag, requires allele frequency table to be passed to script.
2.2 Allele counting feature creation: uses [—ac] flag
> ruby locus_info.rb --infile ./path_to/mixture_output.csv --outfile ./path_to/preprocessed_mixtures.csv [--aftable ./path_to/allele_frequencies_table.csv] [--ac]
Ex.
> ruby locus_info.rb --infile mixtures/361_cau_id_mix_2_3_4_1000each.csv --outfile mixes_preprocessed/361_cau_id_mix_2_3_4_1000each_preprocessed.csv --aftable frequencies/361_cau.csv --ac
> java -jar MixtureMining.jar training_file test_file -f [AS/ASB/PC] -n features_to_keep -c BS
Example data taken from NIST genotype dataset and accompanying allele frequencies, available at http://www.cstl.nist.gov/strbase/NISTpop.htm