Feb 11: Identifying KDIGO trajectory subphenotypes after acute kidney injury with increased mortality rates
Presenter: Taylor Smith
Mar 4: Medication application pattern mining using recurrent neural networks
Presenter: Sajjad Fouladvand
Mar 25: Computational Tools for the Untargeted Assignment of FT-MS Metabolomics Datasets
Presenter: Joshua Mitchell
Apr 8: Evidence of Peroxidase Catalysed Formation of Cysteine-Tyrosine and Dityrosine Cross-Linking in Mammalian Sperm Protamines
Presenter: Christian Powell
Apr 15: CCTS Spring Conference
Presenter: Brian Davis
Fall 2018 (6 Presentations)
Sept 10: Mining emerging phenomena on large-scale longitudinal phenomics data
Presenter: Chen Jin
Type: Research Seminar
Sept 17: Review of annotation enrichment analyses for omics-level datasets
Presenter: Hunter Moseley
Sept 24: A Machine Learning Approach to Computational Polypharmacology and Lessons Learned
It is common knowledge that drugs have polypharmacological properties that can be explored for new insights in drug discovery. That is a given drug interacts with many different proteins and a given protein interacts with multiple drugs. The polypharmacological, promiscuous nature of pharmaceuticals can have both beneficial and detrimental consequences. This attribute can be exploited to improve drug efficacy and prevent drug resistance. In addition to the ability of chemical compounds to interact with an array of protein targets, many diseases have multiple genetic determinants, and individual genetic determinants may be involved in multiple diseases. Furthermore, protein function and expression are controlled by a regulatory network of other proteins. When targeted therapies work initially patients often develop resistance due to secondary mutations or compensation from other parts of the underlying biological network. This illustrates the potential benefits of establishing computational polypharmacology methods – discovering drugs that intentionally target multiple proteins for a beneficial therapeutic result. Many adverse drug reactions (ADRs) result from drugs interacting with non-therapeutic off-targets (unintended interactions). Animal studies during preclinical trial are not always a good indication of these adverse interactions in humans, and such adverse effects are generally not discovered until a drug has reached clinical trial or is already on the market. With the number of different proteins in humans and the genetic variations observable in the population, a full understanding of all possible interactions through experiments and clinical testing alone is not feasible, making computational investigations particularly useful and relevant.
In our digitalized, data-driven world, there is a wealth of knowledge available that is beyond the processing power of an individual researcher or even team of researchers. The abundance of available biomedical data combined with the massive computing power we have available today with leadership class supercomputers provides great opportunities to advance computational drug research. A tool that reliably predicts protein and drug binding would revolutionize the pharmaceutical industry. An accurate representation of polypharmacological networks would provide a wealth of knowledge and insights on drug repurposing, side-effect prediction, and drug efficacy. This would lead the way to personalized polypharmacological networks including individual’s genetic variations resulting in a breakthrough for precision medicine. However, there are still many obstacles to overcome when it comes to utilizing massive computational power and ensuring accuracy of our predictions.
Using machine learning to score potential drug candidates may offer an advantage over traditional imprecise scoring functions because the parameters and model structure can be learned from the data. However, models may lack interpretability, are often overfit to the data, and are not generalizable to drug targets and chemotypes not in the training data. Benchmark datasets are prone to artificial enrichment and analogue bias due to the overrepresentation of certain scaffolds in experimentally determined active sets. Datasets can be evaluated using spatial statistics to quantify the dataset topology and better understand potential biases. Dataset clumping comprises a combination of self-similarity of actives and separation from decoys in chemical space and is associated with overoptimistic virtual screening results. This talk explores data, methods, and potential data biases relevant to computational drug binding predictions.
Type: Research Presentation
Presenter: Sally Ellingson
Oct 1: Untargeted lipidomics of NSCLC shows differentially abundant lipid categories in cancer vs non-cancer
Presenter: Joshua Mitchell
Type: Research Seminar
Oct 29: Investigating the role of iron and the tumor microenvironment in breast cancer progression
Presenter: Luis Sordo Vieira
Type: Research Seminar
Breast cancer cells are addicted to iron. The mechanisms by which malignant cells acquire and contain high levels of iron are not completely understood. Macrophages and fibroblasts in the tumor microenvironment are significant contributors to iron acquisitions. In this talk, we will summarize some of our research in progress of a mathematical model of how iron affects breast cancer progression, and survey some of the results of how iron affects breast cancer progression.
Nov 5: Automatic 13C Chemical Shift Reference Correction of Protein NMR Spectral Data Using Data Mining and Bayesian Statistical Modeling
Presenter: Xi Chen
Type: Research Seminar
Nov 12: Computational exploration of the molecular basis of calmodulin-dependent calcineurin activation
Presenter: Bin Sun
Type: Research Seminar
Abstract: Calmodulin (CaM) binds to calcineurin (CaN) 's CaM-recognition motif with an affinity in the low picomolar range, however this alone is insufficient to fully activate CaN. It has been shown that the CaN regulatory domain folds upon CaM binding and that there is a region C-terminal to the canonical CaM-binding region, the 'distal helix', that becomes helical and is critical for activation. Intriguingly, a soybean-derived CaM variant competes with mammalian CaM and suppresses CaN activation. Further, although the plant variant exhibits relatively high sequence homology with the mammalian isoform, many of the non-conserved amino acids are positioned far from the canonical binding site, which suggests that secondary protein-protein interactions are responsible for regulating CaN activation. We hypothesized that plant CaM variants exhibit impaired distal helix/CaM interactions that prevent CaN activation. To test this hypothesis, we utilized molecular simulations including replica-exchange molecular dynamics to model distal helix conformations, Brownian dynamics to generate trial distal helix/CaM poses and conventional molecular dynamics to evaluate the stability of the predicted binding modes. From these simulations we have isolated a potential binding site (site IV), which yields poses characterized by strong interprotein interactions and comparatively small conformational fluctuations. Further, molecular simulations of distal helix (A454D) and site IV (soybean CaM-inspired substitutions K30E and G40D) variants exhibit impaired interactions that correlate with reduced CaN activation in those systems. This study therefore provides a potential structural basis for the role of secondary CaM/CaN interactions in mediating CaN activation.
Spring 2018 (9 Presentations)
Jan 29: Elevated RNA Editing Activity Is a Major Contributor to Transcriptomic Diversity in Tumors
Mar 5: Dual roles of electrostatic-steering and conformational dynamics in the binding of calcineurin’s intrinsically-disordered recognition domain to calmodulin
Presenter: Bin Sun
Abstract: Calcineurin (CaN) is a serine/threonine phosphatase that regulates a variety of physiological and pathophysiological processes in most mammalian tissue. It has been established that the calcineurin regulatory domain (RD) is highly disordered when inhibiting CaN, yet it undergoes a disorder-to-order transition upon binding calmodulin (CaM) to activate the phosphatase. The prevalence of polar and charged amino acids in the RD domain implicate electrostatic interactions in mediating CaM binding, yet it unclear whether properties of the RD conformational ensemble, such as its effective volume and accessibility of its CaM binding motif help or hinder its ability to participate in protein-protein recognition events. In the present study, we investigated via computational modeling the extent to which electrostatics and structural disorder co-facilitate or hinder CaM /CaN association kinetics. We examined several peptides containing the CaM binding motif via molecular dynamics (MD) and Brownian dynamics (BD), for which lengths and amino acid charge distributions were varied, to isolate the contributions of electrostatics versus conformational diversity to predicted, diffusion-limited association rates via microsecond-scale molecular dynamics and Brownian dynamics simulations. Our results indicate that the RD amino acid composition and sequence length influence both the dynamic availability of conformations amenable to CaM binding, as well as long-range electrostatic interactions to steer association. These findings provide intriguing insight into the interplay between conformational diversity and electrostatically-driven protein-protein association involving CaN, which are likely to extend to wide-ranging processes regulated by intrinsically-disordered proteins.
Mar 12: CANCELED! Automatic 13C Chemical Shift Reference Correction for Unassigned Protein NMR Spectra
Presenter: Xi Chen
Abstract: Poor chemical shift referencing, especially for 13C in protein Nuclear Magnetic Resonance (NMR) experiments, fundamentally limits and even prevents effective study of biomacromolecules via NMR, including protein structure determination and analysis of protein dynamics. To solve this problem, we constructed a Bayesian probabilistic framework that circumvents the limitations of previous reference correction methods that required protein resonance assignment and/or protein structure. Our software named Bayesian Model Optimized Reference Correction (BaMORC) can detect and correct 13C chemical shift referencing errors on the order of +/- 0.45 ppm at a 90% confidence interval (CI) before the protein resonance assignment step of the analysis. By combining the BaMORC methodology with a new intra-peaklist grouping algorithm, we created a combined method referred to as SoBaMORC that can be applied to unassigned experimental peak lists. SoBaMORC kept all experimental three-dimensional HN(CO)CACB-type peak lists tested within +/- 0.4 ppm of the correct 13C reference value. SoBaMORC can be applied to correct 13C chemical shift referencing errors when it will have the most impact, right before protein resonance assignment and other downstream analyses are started. Moreover, this web application allows non-NMR experts to quickly detect and correct 13C referencing before they use try to use spectral data with referencing errors. Thus, this software lowers the bar of NMR expertise required to perform effective protein NMR studies. Software implementing SoBaMORC is available for download and through a web-based interface for use by the broader scientific community.
Mar 19: Mutational Characterization of Squamous Cell Lung Cancers from Appalachian Kentucky: Moving Closer to Personalized Treatment
Presenter: Hunter Moseley*
Mar 19: CANCELED! How can public engagement change the way you do and present your science
Presenter: Sylvie Garneau-Tsodikova
Mar 26: Detangling PPI networks to uncover functionally meaningful clusters
Presenter: Eugene Hinderer
April 2: CANCELED!
Presenter: Varun Dwaraka
April 9: Metabolic network segmentation: A probabilistic graphical modeling approach to identify the sites and sequential order of metabolic regulation from non-targeted metabolomics data
April 16: Developing a Global Homology Analysis for Comparative Genomics
Presenter: Kelly Sovacool
April 23: Determination of Protein Functional Regions from Pre-existing Protein-Level Annotations
Presenter: Christian Powell
Fall 2017 (9 presentations)
Aug 14 - Systems Biology Using Amazon Web Services
Presenter: Peyton Biggs, AWS Genomics
Sep 6 - The 22q11.2 Deletion Syndrome: Transmission and Variation Annotation
Presenter: Matthew Hestand
Abstract: The 22q11.2 deletion syndrome is the most common chromosomal deletion syndrome in humans with an incidence of 1 in 2-4000 live births and shows extremely variable clinical presentations. Here we utilize multiple state-of-the-art genomic strategies, including fiber-FISH and whole-genome sequencing, to characterize the structure of the region. This includes identifying common and a-typical deletion sizes, fine-tuning the deletion breakpoints, and identifying nested inversion polymorphisms that predispose parents for transmitting chromosome 22q11.2 deletions to their offspring. In addition, the hemizygous nature of the region offers the opportunity to evaluate mutations in relation to known recessive disorders, such as SNAP29 mutations causative for cerebral dysgenesis, neuropathy, ichthyosis and keratoderma, Kousseff, and a potentially autosomal recessive form of Opitz G/BBB syndrome. However, we also identify variation that appears damaging, but is actually benign, such as SCARF2 mutations in patients that do not clinically present signs of Van den Ende-Gupta syndrome. Overall, these in depth studies on more than one thousand individuals are providing a rich annotation detailing structure, pathogenic and benign variation, and transmittance of the 22q11.2 deletion.
Sep 18 - Data Quality & Consistency in Various Scientific Repositories
Presenter: Andrey Smelter (& others)
Abstract: Our lab recently developed an API to the metabolomics repository Metabolomics Workbench, mwtab. Using this API package to investigate the data sets in Metabolomics Workbench revealed some interesting data issues. In addition to Metabolomics Workbench, we have also had experience working with data from the Biological Magnetic Resonance Bank, RefDB, Gene Ontology, Kyoto Encyclopedia of Genes and Genomes, and Protein Data Bank. Each of these data repositories have gotchas that casual users may not be aware of.
Sep 25 - Small Molecule Isotope Resolved Formula Enumerator (SMIRFE): a tool for assigning isotopologues and metabolites in Fourier transform mass spectra
Presenter: Joshua Mitchell
Abstract: Fourier-transform mass-spectrometry (FTMS) is often utilized in the detection of small molecules derived from biological samples. What is directly detected in the FTMS spectra are peaks for related sets of isotopologues or molecules that differ only in their isotopic composition for various adducted and charged species corresponding to specific molecules present in a given biological sample or introduced by contamination. The sheer complexity of the what is detected along with a variety of analytically-introduced variance, error, and artifacts have hindered the systematic analysis of the complex patterns of detected peaks. We have developed and prototyped a novel algorithm SMIRFE that detects small biomolecules less than 2000 daltons in mass at a desired statistical confidence and determines their specific elemental molecular formula (EMF) using detected cliques of related isotopologue peaks with compatible isotope resolved molecular formulae (IMFs). The methodology works on both mass spectra derived from non-stable isotope tracing experiments, but especially on mass spectra from stable isotope tracing experiments that contain metabolites labeled with specific stable isotopes like 13C, 15N, and 2H from a given labeling source and/or from natural abundance. The current prototype efficiently searches a roughly 4.8 quintillion (4.8x1018) IMF space for each peaks m/z, based on molecular masses <=2000 daltons, but larger IMF spaces are searchable. This approach has none of the limitations of current methods that can only detect known metabolites in a database. Thus, this new method enables the full interpretation of untargeted metabolomics studies through the identification of metabolites at the level of structural isomers representing the same EMF. We validated the assignment performance using verified assignments from a peak list derived from a Thermo Orbitrap Fusion Tribrid FTMS spectrum of a biological sample that had been treated with the ECF (2Cl-CO2Et) chemoselection agent. The current SMIRFE prototype provided both high accuracy for untargeted assignment for verified metabolite cliques and unambiguous IMF assignment for over half of the detected peaks in the tested peak list.
Presenter: Bradley Stewart
Oct 9 - Between-scan peak correspondence and normalization for direct-injection Fourier transform mass spectrometry data
Abstract: Direct-injection Fourier-transform mass-spectrometry (FTMS) is employed by many research groups as a method in metabolomics, gathering abundance information about all possibly present metabolites in a biological system of interest. In many cases, the data are acquired in multiple scans, and the point intensities are averaged across scans for peak identification, fitting, and integration, resulting in upwards of tens of thousands of individual peaks for assignment. However, differences in the relative scale between scans can span multiple orders of magnitude, reducing the effectiveness of simple averaging approaches and leading to the introduction of a significant number of noise peaks. Previous work in our lab has identified peak artifacts present in FTMS acquired data. As a parallel to that work, we have developed methods for characterizing peaks in individual scans and subsequently combining peaks across multiple scans. The developed methods include peak identification, peak fitting and integration, noise peak determination, followed by peak correspondence, normalization, and averaging across scans. Through the combination of these methods, the data density for a given sample is greatly reduced, going from 13000 or 30000 peaks to 2000 or 4000 peaks, while gaining information about the reliability of a peak via the number of scans it was observed in, as well as the (relative) standard deviation of the peak mass-to-charge ratio, height (often reported as intensity), and area. These scan-level peak characterizations and aggregations to significantly improve downstream data analyses including assignment to specific metabolite isotopologues.
Feb 13 - Navigating the Evolving Computational Landscape
Presenter: Jim Griffioen
Feb 15 - CANCELED! - Modular Ontology Modeling for Data Access and Reuse (IBI Seminar)
Presenter: Pascal Hitzler, Wright State University
Location: Hardymon Theater in the Marksbury Building
Abstract: One of the original motivations for developing ontologies was that they were to act as generic domain models which can be easily reused and repurposed. However, ontology modeling for applications in practice is often driven by very concrete use cases, and thus the corresponding ontologies are often strongly tailored towards meeting very specific use case requirements. As a consequence, ontologies in practice are often not easy to repurpose, and their added value for data access and reuse is limited. In this presentation, we discuss how to model ontologies in such a way as to simplify future reuse. In particular, we will discuss modularization of ontologies, the role of ontology design patterns, and ontology views.
Feb 20 - Big Compute, Big Data, and Better Drugs, Beyond Docking: Increasing the Accuracy of Virtual Screens
Presenters: Sally Ellingson & Amir Kucharski
Feb 27 - Citizen Science, Data Science
Presenter: Jin Chen
Mar 6 - Exogenous Metabolic Enzymes as Therapies
Presenter: Chang-Guo Zhan
Mar 13 - Metabolomics of thrombotic myocardial infarction: systems characterization of plasma metabolome perturbations and the development of a diagnostic classifier
Presenter: Patrick Trainor, invited speaker from University of Louisville
Abstract: Heart disease is the leading cause of global mortality. Acute Myocardial Infarction (MI), is an acute disease event that is characterized by myocardial ischemia and necrosis. While myocardial necrosis is a unifying pathological characteristic and a central tenet of diagnostic criteria, detection of necrosis does not inform clinicians as to the antecedent cause. Specifically, MI may follow spontaneous atherosclerotic plaque disruption that results in a coronary thrombus (thrombotic MI) or may follow non-thrombotic causes such as coronary vasospasm that result in oxygen supply and demand mismatch (non-thrombotic MI). We sought to achieve two aims using untargeted metabolomic profiling of human plasma: (1) to determine the differential affects on human metabolism of thrombotic MI versus non-thrombotic MI and (2) to develop a preliminary diagnostic classifier capable of discriminating between thrombotic MI, non-thrombotic MI, and stable disease. We enrolled subjects presenting with thrombotic MI, non-thrombotic MI or stable coronary artery disease (CAD) and quantified plasma metabolites by untargeted UPLC-MS/MS and GC-MS in the acute event phase and a stable disease state for each subject. In this talk we discuss a systems approach for evaluating the dynamic change in modules of interrelated metabolites across the transition from a stable disease state to acute event. Modules were inferred by analyzing the topology of a weighted network constructed from plasma abundances. We then pivot to a discussion of methodology we have developed for automated metabolite selection for diagnostic classifiers which borrows from related work in biologically inspired computing and artificial intelligence. We conclude with a discussion of our important findings and future research.
Mar 20 - Gene regulation in ischemia-reperfusion injury of the retina
Presenter: Kalina Andreeva
Abstract: Ischemia-reperfusion injuries are associated with several diseases/disorders of the retina. Current treatments often have a poor outcomes, in part due to a lack of understanding of the molecular mechanisms and potential therapeutic targets. Our lab has generated mRNA and miRNA microarray expression data to investigate the transcriptional and post-transcriptional regulation of gene expression following induction of ischemia-reperfusion injury in the rat retina. We have identified several regulatory elements including transcription factors (TFs), micro RNAs (miRs) and mRNAs all of which play key roles in the early and late phases of IR injury. Recently, a new class of non-coding RNA regulators, termed circular RNAs (circRNAs), have been reported to be encoded in the genome and expressed in all tissues and cell types investigated thus far. We have examined the genome-wide expression of circRNAs in rats model of retinal ischemia. The analyses reviled that thousands of circRNAs accumulate in neural rats retina and that their accumulation was altered in the IR-injured eye when compared with the corresponding sham control.
Mar 27 - Comparative Transcriptomics of Limb Regeneration: Identification of Conserved Gene Expression Changes Among Three Species of Ambystoma
Presenter: Varun Dwarka
Abstract: Advances in sequencing technologies and analyses are beginning to allow robust testing of the gene networks underlying limb regeneration. In order to elucidate a core set of genes that are commonly expressed among Ambystomatid salamanders that elicit a natural regenerative response, we used a comparative approach between close and distant relatives of the Ambystoma mexicanum, or the Mexican axolotl. We reasoned that it would be possible to identify and parse species-specific expression differences using newly developed expression analyses. Here we report commonly expressed genes among three naturally regenerating Ambystomatid species: A. mexicanum, A.andersoni, and A. maculatum at 24 hours of wound healing.
April 3 - Bioinformatic approaches to characterizing salamander sex chromosomes
April 10 - Tools for assigning mass spectra from labeled chemoselectively derivatized samples
Presenter: Joshua Mitchell
April 17 - Association Kinetics of CaN binding to CaM
Abstract: Calcineurin (CaN) is a serine/threonine phosphatase that regulates a variety of physiological and pathophysiological processes in most mammalian tissue. It has been established that the calcineurin (CaN) regulatory domain is highly disordered when inhibiting CaN, yet it undergoes a disorder-to-order transition upon binding calmodulin (CaM) to activate the phosphatase. Given the enrichments of negatively charged residues in CaM and positively charged residues in CaM-binding region in CaN, it is intuitive to postulate that the electrostatic interaction between these two binding partners should play an important role in association kinetics. Meanwhile, the conformational dynamics of highly-flexible CaN could dictate the availability of CaM-accessible CaN states, which could affect the overall association kinetics as well. In this presentation, I will talk about a series of computational studies we performed to explore the electrostatic and conformational roles in the CaN:CaM association process.
Presenter: Bin Sun
April 19 - IBI Seminar: High-throughput Biomedical Image Computing for Digital Health
Abstract: In biomedical informatics, a large amount of image data has been collected to support clinical diagnosis, treatment decision and medical prognosis. The large volume and the diversity of informatics across different imaging modalities require advanced and high-throughput image computing technologies for more accurate disease detection, deeper understanding of the mechanisms of disease progression, and better healthcare in precision medicine. With the ever increasing amount of biomedical image data, it is very important to design and develop efficient technologies for large-scale biomedical image analysis. This talk will describe high-throughput biomedical image computing methods for digital health, focusing on three significant topics: object detection, segmentation, and image understanding in medical diagnosis. Specifically, I will present several novel machine learning and imaging informatics technologies to process biomedical big image data and introduce the applications of these technologies in medical diagnosis
Presenter: Fuyong Xing, University of Florida
Location & Time: 12:00-12:50pm, April 19, 2017 in 170 Biopharm Complex (Todd Building)
April 24: The importance of experimental design and QC samples in large-scale and MS-driven untargeted metabolomic studies of humans
April 24: IBI Seminar - Feature Selection and Learning on High-Dimensional and Large-Scale Data
Presenter: Qiang Cheng, PhD, Southern Illinois University Carbondal
Abstract: Diverse areas of scientific research and everyday life, such as healthcare, biomedicine and finance, are now deluged with high-dimensional data and big data. There is a need of data mining and prediction techniques for finding patterns and discovering knowledge from such data. In this talk I will present our feature selection and learning methods for handling such data effectively and efficiently. The feature selection methods integrate intrinsic discriminative information and exploit global optimization techniques on Markov random fields, giving rise to a closed-form solution of linear complexity. The learning methods are built within our minimax pattern learning framework, extending lasso-type sparse representation and possessing efficient complexity and fast convergence. I will present both supervised and unsupervised models that exploit jointly representation and learning. It is expected that these methods will have potentially a significant impact on various fields such as medicine and science
May 1 - Automating the semantic enumeration and extraction of concepts from ontologies
Presenter: Eugene Hinderer
Fall 2016 (13 presentations)
Aug 29 - Canceled due to no-one volunteering
Sep 5 - Labor Day, No Meeting
Sep 12 - Organizational Meeting
Sep 19 - Analysis of protein-coding genetic variation in 60,706 humans
Sep 26 - DeepSplice: Deep Classification of Novel Splice Junctions Revealed by RNA-seq
Abstract: Alternative splicing (AS) is a regulated process that enables the production of multiple mRNA transcripts from a single multi-exon gene. The availability of large-scale RNA-seq datasets has made it possible to predict splice junctions, as well as splice sites through spliced alignment to the reference genome. This greatly enhances the capability to decipher gene structures and explore the diversity of splicing variants. However, existing ab initio aligners are vulnerable to false positive spliced alignments as a result of sequence errors and random sequence matches. These spurious alignments can lead to a significant set of false positive splice junction predictions, confusing downstream analyses of splice variant detection and abundance estimation. In this work, we illustrate that splice junction sequence characteristics can be ascertained from experimental data with deep learning techniques. We employ deep convolutional neural networks for a novel splice junction classification tool named DeepSplice that (i) outperforms state-of-the-art methods for predicting splice sites, (ii) shows high computational efficiency and (iii) can be applied to self-defined training data by users.
Presenter: Yi Zhang
Oct 3 - Bhattacharyya distance From concept to grant application
Abstract: Recent work by our group revisited feasible solution algorithms (FSAs) first popularized by Doug Hawkins in the early 1990s. We use FSAs to find interactions between explanatory variables in predictive models. Initial versions of the algorithm failed miserably for logistic regression in big data problems with n << p, where p is the number of explanatory variables. We were able to overcome this issue using the Bhattacharyya distance between two bivariate distributions, which allowed us to write grants to further investigate the idea.
July 18 - A More Comprehensive Examination of the Human Genome by Long-Read Sequencing
Presenter: Matthew Hestand, Laboratory for Cytogenetics and Genome Research
Abstract: Cheap, high-throughput, short-read sequencing has brought forth a genomics revolution. However, the nature of this technology limits variant detection primarily to unphased single-nucleotide variants and small indels, as well as creating 'black boxes' in the genome due to low complexity sequences, repetitive elements, and regions of skewed GC content. However, long-read technologies do enable sequencing through many of these difficult regions, including those of clinical relevance. For example, we have determined length and AGG interruptions in the CGG tandem repeat that causes FXTAS, Primary Ovarian Insufficiency, and the Fragile X syndrome. Long-read sequencing also enabled us to determine the structure, breakpoint sequences, and postulate the underlying mechanism of previously unobserved chromothripsis-like chromosomes. Overall, we demonstrate the utility of PacBio long-read technology to evaluate structural variations, discriminate pseudogene sequences, directly phase single-nucleotide variants, identify variation in tandem repeats, and even de novo assemble a full human genome.
July 25 - Deep Convolutional Neural Networks: Concepts and Examples
Presenter: Nathan Jacobs, Associate Professor of Computer Science, Center for Visualization and Virtual Environments, University of Kentucky
Abstract: For the past 5 years, methods based on Deep Convolutional Neural Networks (CNNs) have been dramatically advancing the state of the art in computer vision, approaching, and often exceeding, human speed and accuracy. This has been made possible by a combination of factors, including novel neural network architectures, massive datasets, faster hardware, improved software abstractions, and various low-level algorithmic innovations. This talk will provide a technical introduction to CNNs, an overview of their recent rise in popularity in the field of computer vision, and examples of their use for a variety of semantic and geometric image understanding tasks.
Aug 1 - Understanding Cation Binding to SERCA using Molecular Dynamics
Presenter: Bradley Stewart
Aug 8 - Detailed gene-based association study of genes linked to hippocampal sclerosis of aging and cerebral age-related TDP-43 with sclerosis neuropathology: GRN, TMEM106B, ABCC9, and KCNMB2
Abstract: Hippocampal sclerosis of aging (HS-Aging) is a common and distinctive clinical-pathological entity that can cause dementia. To learn more about genetic risk of HS-Aging pathology, we tested gene-based associations of the GRN, TMEM106B, ABCC9, and KCNMB2 genes, which were reported to be associated with HS-Aging in previous studies. We used genetic data obtained from the Alzheimers Disease Genetics Consortium (ADGC), linked to autopsy-derived neuropathological outcomes from the National Alzheimers Coordinating Center (NACC). Of the 3,251 subjects included in the study and who died after age 60 years, 271 (8.3%) were identified as a HS-Aging case. The highest association signals came from SNPs on the ABCC9 gene (rs7966849), and on the KCNMB2 gene (rs73183328). The ABCC9 gene had a significant gene-based association with HS-Aging assuming recessive mode of inheritance (MOI) when applying the Bonferroni correction. We confirmed the same results in people aged 80 years or older. The significant gene-based association of the ABCC9 gene is driven by the region in which the most significant variants are introns, whereas our studies underscore the many different SNPs that are in linkage disequilibrium of the HS-Aging associated SNP in the TMEM106B gene.
Presenter: Yuriko Katsumata
Aug 15 - A new Booster Proposal for Inference in Population Genetics
Abstract: Impostance Sampling plays a fundamental role in likelihood based inference for various population genetic models. Although exact algorithms are not practical for moderate datasets, they can provide valuable intuition for improving proposals. Using one such intuition, we propose a booster proposal that works with an existing proposal to bring about more than an order of magnitute improvement in accuracy under the standard neutral coalescent model of a single, well-mixed population of constant size over time following infinite sites model of mutation. The improvement is consistent in both simulated and real datasets. The method is not based on resampling and thus preserves independence of the samples. It is also faster and the memory requirements are comparable to that of the existing methods. It is generic in nature and thus readily applicable to more complex models involving migration and recombination. It provides a strong support towards our continued advocacy from earlier works that systems approach can be a viable solution to the Felsenstein`s 2^8 programs problem.
Presenter: Susanta Tewari
Aug 22 - The FAIR Guiding Principles for scientific data management and stewardship
Feb 15 - Discrete Models for the Simulation and Control of Gene Regulatory Networks
Abstract: Understanding how the physiology of organisms arises through the dynamic interaction of the molecular constituents of life is an important goal of molecular systems biology, for which mathematical modeling can be very helpful. Different modeling strategies have been used for this purpose. Dynamic mathematical models can be broadly divided into two classes: continuous, such as systems of differential equations and their stochastic variants and discrete, such as Boolean networks and their generalizations. This talk will focus on the discrete modeling approach, which employs techniques from discrete mathematics, combinatorics, graph theory, and computational algebra. Discrete models play an important role in modeling processes that can be viewed as evolving in discrete time, in which state variables have only finitely many possible states. This talk will present an approach for stochastic simulations of discrete models. This approach will be used to study optimal control techniques to identify a control policy to navigate the system so that the probability of reaching a desirable state is maximized. The algorithms assume a set of intervention targets represented by control nodes and edges in the wiring diagram and uses techniques from Markov decision processes for the identification of a control policy that dictates how to move from one state to another.
Presenter: David Murrugarra
Feb 22 - Programming for Multicore CPUs with Python
Abstract: A brief introduction to techniques in Python to take advantage of multiple CPU cores to accelerate calculations. A description of the different types of parallel computing in python will be provided but the majority of the information will be on the different ways to use the multiprocessing library in Python. Also, how to use linear algebra libraries (namely OpenBlas) in Numpy will be covered. Code examples will be provided.
Presenter: Joshua Mitchell
Feb 29 - Approaches to linkage mapping in diverse non-model vertebrates
Abstract: A relatively informal discussion of a few non-standard approaches that my lab has been using to generate dense linkage maps for non-model vertebrates. Including outbred crossing designs, genotyping by sequencing, genotyping by RNAseq, and single sperm sequencing. Time permitting, I also plan discuss a few recently developed technologies that were presented at the recent AGBT meeting.
Presenter: Jeramiah Smith
Mar 7 - Computational Prediction of Adverse Drug Reactions
Presenter: Sally Ellingson
Mar 14 - Zhx2, liver metabolism, and sex-biased gene expression
Abstract: A discussion of my current project within the Spear lab. Our newest data provide evidence for Zhx2 as a novel regulator of cytochrome p450 gene expression in the liver, as well as many other known sex-biased genes. These genes are important for lipid, drug, and steroidal metabolism in the liver and contribute to the development of non-alcoholic fatty liver disease (NAFLD) in our mouse model.
Presenter: Alexandra Nail
Mar 18 - Pheno-Informatics: A New Framework For Analyzing Phenomics Data
Location: Wethington 014 (Basement Auditorium)
Abstract: Nowadays, DNA sequence data are available for many species, but the systematic quantification and analysis of phenotypes remains a big challenge. My research aim is to bridge the genotype-phenotype gap by developing novel data mining techniques so that multi-omics data can be transformed into testable hypotheses to identify important genes in various aspects. In this talk, I will first introduce our recent progress in phenomics data modeling, including a new inter-functional phenomics clustering method and a new phenotype-environment relationship learning framework. I will illustrate how these tools have led us to discover new biological mechanism. In the second part, I will discuss our future plan in bioinformatics and data science, and their applications in biomedical research.
Presenter: Jin Chen
Mar 21 - Non-lethal Inhibition of Gut Microbial Trimethylamine Production for the Treatment of Atherosclerosis
Mar 28 - Protein NMR Reference Correction: A statistical approach for an old problem
Presenter: Bill (Xi) Chen
April 4 - Classification of Cancer Using Metabolomics
Abstract: Metabolomics is being regularly applied to understand and characterize various cancers. This seminar will discuss the use of random forests to generate models able to discriminate normal from cancer samples using lipids from lung cancer tissue with small numbers of lipids, and the development of a non-parametric method to evaluate the power of the classification method.
Presenter: Robert M Flight
April 11 - Canceled due to scheduling problems
April 18 - Metagenomic assessment of possible microbial contamination in the equine reference genome assembly
Presenter: Scotty DePriest
April 25 - Co-occurring Genomic Alterations Define Major Subsets of KRAS-Mutant Lung Adenocarcinoma with Distinct Biology, Immune Profiles, and Therapeutic Vulnerabilities
Sep 14 - Working with STRING PPIs Offline for Cancer Network Analysis
Presenter: Robert Flight
Sep 21 - Big Data and Systems Biology Approaches to Explore Transcriptome and RNA Regulatory Networks
Presenter: Juw Won Park, University of Louisville
Abstract: The high-throughput RNA sequencing (RNA-seq) has provided a powerful tool for transcriptome analysis. Due to the dramatic decrease in cost, it became quite common to generate millions and billions of sequence reads from a given RNA sample to identify/quantify the abundance of mRNA isoforms across the entire transcriptome. Large consortium projects also started generating massive RNA-seq data on tens of thousands of samples along with various other genomic/phenotypic measurements. However, the extraordinary potentials embedded in these large, complex datasets cannot be fully recognized without the development of proper methods for analyzing these big transcriptome and genome datasets. In this presentation, I will discuss my recent efforts in developing computational and statistical methods for the analysis of transcriptome isoform complexity and RNA regulatory networks using RNA-seq datasets.
Sep 28 - Applying data fusion approaches on multiply platform metabolomics data acquired from breast cancer tumours
Abstract: The PacBio single-molecule sequencing platform produces long (avg 12-15kb) error-prone reads, though the errors are randomly distributed. Therefore, combined with read coverage or circularizing a DNA molecule and repeatedly sequencing both strands produces highly accurate consensus sequences. We have utilized this circular sequencing approach to determine error rates and profiles across six commonly used polymerases. Besides accurately determining mutations in double strands, the platform permits the identification of heteroduplexes, where a base on one strand is not complimentary to a base on the other strand. Interestingly, we observed that Watson-Crick base-pairing errors are not equally distributed, but that across most polymerases there is a bias for pyrimidine transitions over purine transitions. Moving from single molecule errors to chromosome spanning errors, the long reads also provide a unique resource to identify structural variation, including sequencing across repetitive elements. Indeed, we used PacBio to demonstrate an insertional translocation of chrX sequence into chrY, generating an extended pseudoautosomal region (PAR). The insertion is generated by non-allelic homologous recombination between a 548 bp LTR6B repeat within the chrY PAR1 and a second LTR6B repeat located 105 kb from the PAR boundary on chrX. PacBio phasing within the duplicated region also enabled identification of the paternally inherited insert sequence and findings of multiple haplotypes from ancestrally related individuals, demonstrating X/Y recombination. In a separate cohort, aCGH identified three patients containing distinct clusters of only copy number gains across a single chromosome 18 or 22. A combination of Illumina, PacBio, and Sanger sequencing was used to identify and characterize the breakpoints in these patients. For these highly rearranged chromosomes, breakpoint sequences lead to the hypothesis of an origin different from traditional chromothripsis and chromoanasynthesis, possibly a repair process driven by non-canonical non-homologous end joining mediated by polymerase theta. In conclusion, we demonstrate the PacBio platform provides unique capabilities to detect variation, from single molecules to whole chromosomes rearrangements.
Oct 5 - Knowledge boosting: a graph-based integration approach with multi-omics data and genomic knowledge for cancer clinical outcome prediction
Nov 30 - Establishing Precise Evolutionary History of a Gene Improves Predicting Disease Causing Missense Mutations
Presenter: Igor Zhulin
Abstract: Predicting the phenotypic effects of mutations has become an important application in population genetics studies and clinical genetic diagnostics. Computational tools, such as PolyPhen and SIFT, utilize comparative genomics to evaluate the behavior of the variant over evolutionary time and assume that variants seen during the course of evolution are likely benign in humans. However, due to full automation and applicability to all human genes these tools do not reconstruct the detailed evolutionary history of any given gene, such as assignment of orthologous/paralogous relationships. On the other hand, it is known that paralogs have dramatically different roles in Mendelian diseases. For example, while inactivating mutations in the NPC1 gene cause the neurodegenerative disorder Niemann-Pick C, inactivating mutations in its paralog NPC1L1 are not disease causing and moreover are implicated in protection from coronary heart disease. We identified major events in NPC1 evolution and revealed and compared orthologs and paralogs of the human NPC1 gene through phylogenetic and protein sequence analyses. Based on the results, we built an algorithm to distinguish deleterious from neutral variants. We demonstrated that by removing the NPC1 paralogs and distant homologs from the analysis we can improve the overall performance of categorizing damaging and benign single amino acid substitutions. Our results show that a thorough analysis of gene history followed by identification of functionally equivalent orthologs improves the accuracy in predicting disease-causing missense mutations. We anticipate that this approach will be used as a reference in the interpretation of variants in other genetic diseases as well.
Dec 7 - Canceled - Scheduling mixup
Dec 14 - Canceled - Room 501C needed for exams
Summer 2015 (15 presentations)
May 4 - Canceled
May 11 - Finding biologically relevant interactions in a big data world
Presenter: Arnold Stromberg
May 18 - UK Bioinformatics Education Planning Meeting
May 25 - Memorial Day
June 1 - Batch Effects & Gene Expression Conservation
Abstract: Horses, zebras, and asses represent the only living members of the equid family. This family originated in Northern America some 55 millions ago and flourished into a large number of species during the Tertiary period. The deep evolutionary history of equids is well documented in the paleontological record and represents a textbook example of evolution. However, their recent evolutionary history remains largely unknown. By sequencing the genome of a 700,000 year-old horse, representing the oldest genome hitherto sequenced, our group has shown that the most recent common ancestor of extant equids lived some 4 million years ago. Further genome sequencing for each species within the family illuminated the patterns and processes of the equine radiation, from its early split in the New World to its subsequent migrations into Eurasia and Africa. This revealed large-scale demographic expansions and contractions following major climatic changes as well as the genetic toolkit that underlies the species adaptations. Importantly, our comparative genome dataset revealed multiple cases of gene flow between species. This shows that the species barrier is not always waterproof, and challenges current speciation models, which assume that changes in the chromosomal structure often result in full reproductive isolation. In addition to help better comprehending the processes driving the origins of species and their adaptation, the equid family, which includes not less than two domesticated species, also offers a fantastic opportunity to study how humans transformed wild animals into domesticates that best suit their purpose. Using ancient DNA, our group reconstructed the complete genomes from horses that lived prior to the domestication and identified 125 genes that have been positively selected since. This conservative set of genes reveals the range of physical, physiological and behavioral functions that have been reshaped by humans during history and antiquity.
Jan 12 - A Negative Binomial Model-Based Method for Di fferential Expression Analysis Based on NanoString nCounter Data
Presenter: Hong Wang
Type: Research Presentation
Jan 19 - Martin Luther King Day, No seminar
Jan 26 - ComPPI: a cellular compartment-specific database for protein-protein interaction network analysis
Sept 29 - RESCHEDULED - Galaxy: An Open, Web-based platform for Data-Intensive Biological Analysis
Presenter: Jeremy Goecks, George Washington University
Type: Invited Speaker Research Presentation
Abstract: Galaxy ( http://galaxyproject.org) is a popular Web-based analysis platform for data-driven biology and for genomics in particular. Galaxys mission statement is to make computational biology analyses accessible, reproducible, and collaborative. Galaxy addresses the complete scientific data analysis process, with support for analysis tools and complete histories, reproducible workflows, data visualizations, and interactive publication supplements. Galaxy has been cited more than 1500 times in scientific publications, there are 62 active public servers, and our main public server ( http://usegalaxy.org) processes ~130,000 analysis jobs each month. In this talk, I will describe the Galaxy platform, the Galaxy team and community, and future directions for the project.
Oct 6 - IBSeq: An island-based approach for RNA-seq differential expression analysis
Presenter: Abdallah Eteleeb, University of Louisville
Type: Invited Speaker Research Presentation
Abstract: High-throughput mRNA sequencing (also known as RNA-Seq) promises to be the technique of choice for studying transcriptome profiles. This technique provides the ability to develop precise methodologies for transcript and gene expression quantification, novel transcript and exon discovery, and splice variant detection. One of the limitations of current RNA-Seq methods is the dependency on annotated biological features (e.g. exons, transcripts, genes) to detect expression differences across samples. This forces the identification of expression levels and the detection of significant changes to known genomic regions. Any significant changes that occur in unannotated regions will not be captured. To overcome this limitation, we developed a novel segmentation approach, Island-Based (IBSeq), for analyzing differential expression in RNA-Seq and targeted sequencing (exome capture) data without specific knowledge of an isoform. The IBSeq segmentation determines individual islands of expression based on windowed read counts that can be co