Binary data only has two possible categories most often considered to be a “0” or “1”
Examples of this data could be yes or no outcomes. Commonly, 0 would be no, and 1 would be yes.
Categorical Data is made up of observations belonging to one of a finite number of possible categories. For example, flavors of ice cream in an ice cream shop.
A coefficient is a constant value used to multiply a variable in the scoring formula. A positive coefficient indicates there is a positive relationship between that variable and your Y-variable while a negative coefficient indicates a negative relationship.
Veera Predict uses a process of comparing records whose historic outcome was a 1, and those who were a 0. The more frequently a model correctly identifies the 1’s as more likely than the 0’s, the more concordant the model is with real outcomes. For more detail, read about Percent Concordance.
A variable that represents data on a continuous scale. For example, age, height, and weight.
Correlation analysis (also called “Pearson Correlation”)
Correlation identifies the relationship between two variables, represented on a scale between -1 and 1. When a relationship is perfectly positively correlated it will have an outcome of 1 and perfectly negatively correlated an outcome of -1. That means that for every change one variable experiences, an exact proportionate response will be seen in the second variable. Generally, the further a correlation is from 0, the stronger that correlation is, positive or negative.
A ranking based on the predicted outcomes of the historical records. Decile 1 consists of the 10% of records that had the highest probability of the model outcome, and decile 10 consists of the 10% of records with the lowest probability of the model outcome.
Veera Predict uses a process of comparing records whose historic outcome was a 1, and those who were a 0. The more frequently a model incorrectly expects 1’s as less likely than the 0’s, the more disconcordant the model is with real outcomes. For more detail, read about Percent Concordance.
A character used to separate columns in one row of a dataset. Common ways to separate fields of data can be: commas, tab spacing, or a vertical line, (|) also called a “pipe”. For instance, name is often delimited with a comma, as in Smith, John.
By default, when a model is generated, 50% of the records are randomly selected and used to create the model. Then, the model is scored for accuracy against the other 50%. The sample that is withheld is called the hold-out sample. Holdout sample sizes can be customized within the Modeling Options window underneath the automine tab.
The intercept is the expected mean value of your outcome, when all the variables used in the model equal 0.
Takes a single variable with a “yes” or “no” (binary) outcome and compares how the behavior of the other variables in the data set contribute to the single variable’s “yes” or “no” outcome.
Commonly called the “average”, this is the sum of all the observations divided by the number of observations. Common examples would be mean (average) of age, or mean (average) of income.
An analysis of more than one variable at a time. This analysis explores the behavior of a y-variable – a continuous or binary variable- against the change in another characteristic, or x-variable- continuous, binary, or categorical. A common example would be exploring height (the y-variable) as it relates to shoe size (an x-variable, a continuous characteristic).
Ordinary Least Square (OLS) Regression
Uses a standard regression analysis, with the goal of minimizing the differences between the observed responses (values of the variable being predicted) and those that are predicted. Used when the outcome you’re predicting is continuous, like a test score or the height of a person.
Data values that fall far outside the range of what would be considered normal for the observed population.
The way in which significance is reported statistically. P-values of .01 and .05 are common thresholds of measurement. If an outcome has a P-value of less than .01, than that means that there is less than a 1% chance of that result happening.
In every statistical test you are aiming to prove some hypothesis. In modeling you expect that your outcome variable has a relationship to some, if not all, of the predictive variables. The p-value is a way to identify whether or not that assumption is accurate. In other words, a low P-value gives evidence to the fact that the relationship between that variable and your outcome variable is not due to chance.
The percent concordance is the percentage that a model’s outcome will be true, based on the observed outcomes from within the dataset. The concordance is calculated by pairing all records in the modeling dataset based on the actual outcome and determining how often the model was accurate in scoring those records. For example, if you were predicting customer churn, the concordance value would be the percentage of the time that those that churned had a higher probability of churn than those that did not.
Customers that Churned Customers that Retained CustomerID Probability CustomerID Probability Concordant 1 75% 2 83% No 1 75% 3 41% Yes 1 75% 4 32% Yes 1 75% 5 28% Yes 1 75% 6 15% Yes Concordance Rate 80%
The percentage that a model’s outcome will be false, based on the observed outcomes from within the dataset. The discordance is calculated by pairing all records in the modeling dataset based on the actual outcome and determining how often the model was inaccurate in scoring those records. For example, if you were predicting customer churn, the discordance value would be the percentage of the time that those that churned had a lower probability of churn than those that did not.
Customers that Churned Customers that Retained CustomerID Probability CustomerID Probability Discordant 1 75% 2 83% Yes 1 75% 3 41% No 1 75% 4 32% No 1 75% 5 28% No 1 75% 6 15% No Discordance Rate 20%
An analysis of a single variable. This analysis doesn’t look at causes or relationships, but is rather used to describe data. A common example would be exploring the range and frequency of test scores seen among test-takers.
Wald Chi-Square Test
The Wald Chi-Square Test is used to test the relationship between variables. This is determined by observing values in a dataset, and comparing those values to a normal distribution outcome. The variance in the observed outcomes vs. the expected outcome under a normal distribution is then used to measure the relationship. The larger a Wald Chi Square score, than the stronger the relationship between the Y-variable and the dataset variable in question.