What is Data Science Modeling?

Data science is an interdisciplinary field that leverages statistics, computer science, and machine learning to extract information from both structured and unstructured data.


Statistics is the science concerned with developing and studying methods for collecting, analyzing, interpreting, and presenting empirical data.

Computer Science

Computer Science is the study of computation, automation, and information. Computer science spans theoretical disciplines, such as algorithms, theory of computation, and information theory, to practical disciplines.

Machine Learning

Machine Learning is the study of algorithms that automatically identify patterns in data. These algorithms are capable of learning and improving over time as additional data is made available to the algorithm.

Blog: How Machine Learning is Changing the Data Organization Game

The addition of data science into traditional reliability methods allows models to evolve continually and learn, preventing results from becoming stagnant. To learn about more applications of machine learning in reliability, read our blog in Inspectioneering.

Why is Data Science Valuable in Reliability?

The addition of data science into traditional reliability methods allows models to continually evolve and learn, preventing results from becoming stagnant. Data science and machine learning eliminate the need for humans to code rules that tend to be fragile and hard to work with. The goal of data science is to set up algorithms that learn and improve automatically without a significant amount of human intervention, ultimately creating a better solution than what could have been coded by hand. Having a Data-Driven Reliability framework can elevate your program to improve performance and allows you to connect data to business decisions.

Machine Learning Models Leveraged in the Data-Driven Reliability Framework

Linear Regression

Linear Regression is a set of statistical processes used to estimate the relationships between a dependent variable (occasionally multiple) and one or more independent variables.

Non-Linear Regression

Non-Linear Regression is a form of regression analysis in which observational data are modeled by a function which is a nonlinear combination of the model parameters and depends on one or more dependent/independent variables.


Classification is the process of trying to predict a class-based outcome from some set of independent variables.

Supported Vector Machines

Supported Vector Machine are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis.

Pattern Recognition

Pattern Recognition is a data analysis method that uses machine learning algorithms to automatically recognize patterns and regularities in data.

Bayesian Modeling

Bayesian Modeling are statistical models that make probability-based inference from data. Unlike other statistical models, however, Bayesian models enable one to specify a prior distribution on the model parameters of interest. This prior distribution represents any beliefs that we have about how the model should behave. As an example, a prior distribution can encode the beliefs of a subject matter expert. This prior distribution serves to regularize the statistical model and is especially useful when working with a relatively small amount of data.

Graph Theory

Graph Theory is the study of graphs, which are mathematical structures used to model pairwise relations between objects. A graph in this context is made up of vertices (also called nodes or points) which are connected by edges (also called links or lines). Graph theory is used to find the shortest path in road or a network.


Resampling is any of a variety of methods such as estimating the precision of sample statistics by using subsets of available data or drawing randomly with replacement from a set of data points, permutation tests, and validation models by using random subsets. These methods are useful when there is a small amount of, or gaps, in data.

    • Bootstrapping is a statistical method for estimating the sampling distribution of an estimator by sampling with replacement from the original sample, most often with the purpose of deriving robust estimates of standard errors and confidence intervals of a population parameter like a mean, median, proportion, odds ratio, correlation coefficient or regression coefficient.
    • Jackknifing is used in statistical inference to estimate the bias and standard error (variance) of a statistic, when a random sample of observations is used to calculate it.

Applications of Data Science in Reliability

Data-Driven Reliability Models

An example of this type of model is Lifetime Variability Curves (LVCs). LVCs create a prediction of when assets are likely to fail based on a facility’s data. The data obtained is usually provided in a wide range of formats and requires the use of Bayesian techniques to help Subject Matter Experts (SMEs) steer the models to something fit for use.

Big Data in Mechanical Integrity: The Next Generation of Corrosion Models

Watch the full presentation or download the eBook below.
Listen to Director, Data Science, Andrew Waters, and Principal of Corrosion Technology, Fred Addington, discuss the ways that recent advances in technology have transformed the way the industry can use “big data.” From CML Optimization to Corrosion Modeling “big data” is shaping the future of how we build reliability.

Condition Monitoring Location (CML) Prioritization

Unlike traditional CML Optimization methods, CML Prioritization takes degradation rates, historical inspection data, maintenance histories, and the uncertainty of the facility’s data into account. As a part of this method, a LVC is created for every CML to predict a range of dates in which the CML is likely to fail. From this information, CML Prioritization recommends which CMLs require inspection activities in the near-term as well as those that can be explored at some future date.

Task Optimization

A facility must complete a variety of tasks such as inspections, repairs, and maintenance tasks to keep its assets operating/to ensure its assets function properly. Task Optimization (TO) finds the “optimal” set of tasks that should be done and tells a facility when to do them to reduce downtime and costs and improve overall facility availability and throughput. TO is a deliverable of Quantitative Reliability Optimization (QRO) and leverages multiple data science techniques such as Bayesian Modeling and LVCs to create a regression model.

Throughput Modeling

This deliverable of QRO utilizes graph theory, network analysis, and Bayesian modeling to analyze a facility’s assets, how they are connected, and how much product they can produce. Additionally, throughput modeling predicts when assets are likely to fail and need to be repaired, and forecasts how much total product the facility can expect to produce over time. This can ultimately help facilities look for bottlenecks that are hurting their production capacity and make improvements as needed.

Natural Language Processing

Natural Language Processing (NLP) and classification can be leveraged for a broad range of applications such as:

  • Inspection Grading
  • Quality Checks
  • U-1 Form Mining
  • Incident Log Analysis


NLP can evolve inspection grading by eliminating the need for people to read reports and make subjective judgments about inspection quality. More importantly, NLP applies machine learning to all kinds of document-related tasks, like mining U1 forms, etc. In addition to being tiresome for humans, these tasks can also be prone to error. Leveraging data science techniques to get a computer to do these things automatically is a huge win on costs and accuracy.

Case Study Highlight

North American Refiner Leverages Data Science to Improve Quality Control in PMI Program

A refiner was experiencing significant failures in its HF Alky units across several sites and traced part of the issue back to issues on material intake forms. Pinnacle provided a data science solution pilot project that efficiently provided quality checks for over 420K documents and reduced the organization’s workload by over 95% through Machine Learning and Natural Language Processing. These methods can be used in several additional applications including inspection grading, datamining, and document analysis.

Corrosion Modeling

Data science can aid in the creation of unit studies where data can estimate how quickly we expect an asset to degrade as a function of inputs like temperature, pressure, metallurgy, and stream info. It also automatically assigns damage mechanisms and shows how subjective these assignments currently are when left up to the discretion of humans. These tools are leveraged to support Subject Matter Expert (SME) knowledge and create consistency in the industry.

Anomaly Detection

Imagine there is a rotating asset (like a pump), and it has a variety of sensors on it (like pressure, temperature, flow, etc.). When things are working normally, the sensors report certain streams of information. Data science can be used to figure out the baseline performance of an asset and can then sound the alarm when anomalies occur. Additionally, anomaly detection gives facilities the ability to identify what type of problem is occurring (like is the impellor failing, are the bearings shot, etc.) and then predict failure dates and recommend maintenance activities based on the specific problem.

Case Study Highlight

CML Optimization Pilot Project Helps Refinery Reduce Risk and Identify Minimum Reduced Inspection Spend of $384K

Pinnacle leveraged an intelligence model rooted in data science and enhanced by Subject Matter Expert (SME) expertise to enhance traditional CML Optimization. CML Prioritization by the solution presented in this case study is a new CML Optimization methodology that predicts future thinning based on past performance from that same CML across repairs and replacements, prioritizing CMLs by the risk posed to safety and production prior to the next scheduled action. Unlike traditional methods, CML Prioritization takes degradation rates and uncertainty of the facility’s data into account which impact the accuracy of predictions. This solution also creates a model that can be dynamically updated to maximize reliability and return on investment (ROI) as the latest information becomes available.

More resources like this