The Data Science Facility at the Donald Danforth Plant Science Center is a computing and data analytics hub that develops and deploys technologies in computer science, mathematics, and statistics to accelerate discoveries from data and models in plant science.
Physically, the facility supports computing through several modalities: 1) high-performance computing and workflow management on an HTCondor cluster; 2) virtualized applications using machine- and container-level virtualization; 3) web/database applications and support. Currently, the infrastructure contains over 1300 processing cores and 2800 graphics processing cores, more than 8 terabytes of memory, and a single, high-performance 721 terabyte storage area network. These resources are shared in a managed, multi-user environment and communicate via a 10 gigabit Ethernet network. Management of the system is simplified through virtualization of key services, which also allows for the deployment of diverse applications and platforms simultaneously.
Services offered by the Data Science Facility include 1) user services: authentication services/user accounts, software installation, patches and upgrades, troubleshooting, advising, Slack (virtual help desk), GitHub (version control), training (system usage, specific software, workflows), documentation, and outreach; 2) computing: cluster resources, web server hosting, database server hosting, maintenance/upgrades, system monitoring, and virtual machine and container management; 3) storage: monitoring, performance configuration, and maintenance. Additionally, the facility consults on the development of computational, data analysis, and experimental design components of proposals and assists with editing of computational and statistical analysis sections of manuscripts. The core facility also offers analysis services, ranging from whole project consulting to individual analyses.
Intellectual development is offered by members of the facility through regular workshops and training events, custom application development for lab or group projects, and community-based sharing of software, ideas, and methods. In addition, the facility enhances interaction between groups at the center and partner institutions and facilitates interoperation between local computing and storage resources and public/private cloud/cyberinfrastructures such as Amazon Web Services, CyVerse, and Open Science Grid.
The Data Science group at the Danforth Center uses and develops computational approaches and infrastructure that leverage large datasets to address biological problems. We emphasize the development of modular, reusable, and open-source tools through collaborator- and community-driven efforts. Our aim is to apply these tools to high-throughput genotyping and phenotyping data to identify the genetic basis of traits in research model plants and biofuel and food security crops.
The ability to rapidly and non-destructively measure plant physical and physiological features is a key bottleneck in plant research and breeding. Imaging coupled with computer vision algorithms and statistical analysis are a set of technologies that have the potential to address the plant phenotyping bottleneck, but they introduce their own computing, interpretation, and data management challenges that our group develops tools to address so that these technologies can be utilized more broadly by the scientific community. Plant Computer Vision (PlantCV
) is our primary platform for developing a plant phenotyping toolbox. Through PlantCV we are deploying computer vision, machine learning, and other data science algorithms to extract biologically relevant data from image and sensor datasets.
A major emphasis of the Data Science group is collaboration, which enables us to apply the tools we develop to a variety of plant systems. Diverse candidate biofuel feedstocks such as Camelina sativa
(oilseed) and Sorghum bicolor
(lignocellulosic feedstock) are major focuses in the group where we are utilizing natural variation and high-throughput phenotyping to study the genetic basis of traits that could improve these crops for bio-based fuels. We are also developing tools for model systems (e.g. Arabidopsis thaliana
and Setaria viridis
), food security crops (e.g. cassava), and other systems for producing plant natural products (e.g. indigo).