Data Selection, Cleansing, and Preprocessing: All modeling projects begin with an evaluation of your data to assess its completeness, consistency, accuracy, and the need for any preprocessing. This is typically the most time consuming step in a predictive modeling project. Although all data must be evaluated and cleaned where necessary, the modeling tools that PreMo employs can dramatically reduce the requirement for preprocessing making the projects quicker and more cost efficient.
Cleansing your data means several things. 1). Is the data complete and what specific strategies and steps will be used to complete or fill-in missing values. 2.) Does the data contain format, keying or reference errors that cause some values to be in error, and what strategies and steps will be used to correct them. 3.) What will be our strategy regarding outlier values (those values that are so far outside the statistical norm that they may either be invalid or will result in degraded model accuracy).
Enriching your data through appended external data if appropriate also occurs at this step.
Additional preprocessing or integration of data from different client databases expands its potential. You probably have more information than you realize. For example, in CRM models with existing customers we will be looking to add information about the recency, frequency, and size of recent orders, and even method of payment.
Although the methodology is represented as steps along a line, they can also be iterative. If preliminary model building does not yield the desired accuracy we may retrace our steps to add data or perform additional preprocessing to make the model more sensitive to its hidden patterns. |