I work on clinical bioinformatics research in immunology and skeletal medicine. I'm passionate about data science and machine learning in a clinical research setting, maximising the valuable data we have.
I'm highly proficient in R, and love taking on new and complex data science problems. Take a look at my CV and projects, if you ever want a chat about a project, or job, then get in touch!
I deal with large scale genomic data, applying varying statistical methods, as such it's a natural choice for me to work in R (or RNoteBooks nowadays...). R gives me the flexibility to run code anywhere that R is installed, rapidly develop/ prototype methods, and even tie it all together into a robust application using Shiny. I'm a highly competent R programmer, and now more of a functional programmer, than OO. I'm very familiar with a lot of genomic (especially human) databases hosted by the EBI, Broad, Ensembl, all of which utilise SQL in some way. If R is my bread, then Bash is my butter - the majority of sequencing data I analyse requires some form of preprocessing or normalisation before it's in a usable form, and most of tools that provide that functionality are C++ or Java based, so elegant bash scripting is essential. Visualisation is another key element of my day to day working, and I'm extremely familiar with the ggplot2 framework in R, designing and implementing informative plots at publication quality.
I work day to day in a Linux environment, as such I'm familiar with Debian and RedHat flavours of Linux. I've been responsible for system administration of Ubuntu based servers, developing in a CentOS based compute cluster, and managing NAS storage arrays. I've also used AWS for projects in the past, which involved the automation of spinning up and stopping instances, along with general management of AWS instances.
Experimental Design
I've been lead analyst on a number of projects, and a key responsibility of mine has been experimental design, in which I'm consulted by PIs, Professors, and PhD students. Typically these are experiments for array studies, and high throughput RNA/miRNA/DNA sequencing. One recent project saw a problem in the number of sample types, and the number of unique samples that could be sequenced at one (multiplexing limit). I designed a function to simulate the experimental parameters and fit a model that was full rank, based on the studies hypothesis. This process allowed the samples to be allocated the optimum groups, to absorb for between batch variance in the model design. The project code is linked below.
Statistical analysis is a core aspect of my current role, as such I've become highly capable of taking high level concepts and applying them to very large datasets. An example of which is with a recent eQTL study I analysed, where the PI wanted a bespoke analysis, and standard packages couldn't offer a solution. An eQTL analysis involves looking for a trend in expression, based on a patient's genotype to identify disease traits. The bespoke analysis involved running a simple linear regression, over hundreds of thousands of combinations, and evolved to implementing a Multinomial Log-linear Model. This kind of model allowed us to identify disease specific eQTLs. The github project is available in the link below.
Pediatric patients with complex Primary Immunodeficiency (PID) are seen at the Great North Children's hospital in Newcastle, where typically they're screened for a number of known genetic variants that marker various sub-classifications of PID conditions. Often there are cases where genetic screens show no definitive diagnosis, in which the patient, and in some cases family members are exome sequenced in house. I've designed and implemented the analysis pipeline consisting of 34 families, and 95 singleton, managing over 200 samples. The analysis pipeline can deal with samples from different sequencers, from different chemistries, and from different pedigrees elegantly. This system also implements sanity checks for inferred gender, and relatedness checks within pedigrees. The system is built on the foundations of GATK 3.4, and is optimised to run on Sun of GridEngine (SoGE). The analysis allows for incremental batches of samples to be added, with an average of 9 samples per month (45GB compressed raw data) coming in.
Poster | Github Project | Client - Prof. Sophie Hambleton (Newcastle University,NHS) | Category - Computational Diagnostics