Data Driven Decision Making with SAS Enterprise Miner Case Study

DАTА DRIVЕN DЕСISIОN MАKING WITH SАS ЕNTЕRРRISЕ MINЕR Name of Institution Lecturer’s Name Name of student Course Date Problem description The health care facility provides health care services to approximately five million veterans. Of the five million, approximately 200,000 thousand are HIV positive. In order to effectively provide the HIV care services, accurate data and information on the number of infected and registered persons is vital. A large clinical database called the Immunology Case Registry (ICR) currently holds approximately 20,000 HIV infected patients. The current algorithm used for entry of patients into the ICR database is based on positive HIV test results. Additionally the algorithm is based on minimal number of selected HIV and AIDS ICD-9 codes. Accuracy and completeness of the algorithm is a key determinant to realizing the desired results on the management of the patients. However, this algorithm is prone to errors originating from both random and systematic sources. Hence the questioning of the accuracy and completeness of the ICR. Application of supervised data mining methods in developing an algorithm for patient identification and testing will eliminate the shortfalls of the preceding methodologies. The current algorithm entirely depends on the diagnostic codes to determine the number of persons with HIV infection. On the contrary, additional variables including socioeconomic, geographic, laboratory, service utilization and pharmacy play a major role. The proposed algorithm is developed by Inco-operating enterprise miner with logistic regression (LR), neural network (NN) and decision tree (DT) models. This is aimed at predicting a binary outcome variable for determining the HIV status and entry into the ICR. Data needs Data requirements and needs are met through data mining approach. This is a process of acquiring meaningful patterns in large data sets that explain past events. The process is carried out in a manner that the patterns can be applied to new data to predict future events. A source of cases with known status is needed for training and validation of predictive models. Data requirements are mined through a series of steps that are broadly classified into two; data preparation and data analysis. During data preparation a set of steps are followed; building of the analysis dataset, preparation of the data, identification of pre-classified cases and creation of modeling sample. The first step entails building an Analysis Dataset (AD) which is obtained from varied sources of data. The AD is characterized by a single record for each patient. Some of the databases of the VA include the National Patient care Database (NPCD), the Decision Support System (DSS) and the Pharmacy Benefits Management (PBM) package. Data preparation step entails inputting missing values, reducing and cleaning the dimensionality of data, deriving new variables and finally summarizing and transforming the data to keep one record per patient. This is achieved through interaction with clinical experts who have a better knowledge of the most important variables. Using available records it is possible to identify pre-classified cases. This can either be True Positive, or True Negative. True positive are patients that have already been tested positive whilst true negative are people tested negative. Once data has been pre classified, then it is possible to create samples from the data. The samples represent the different variables identified. After completing the data preparation, data analysis can commence. The steps here include; development and evaluation of case finding models, compare model performance to a reference model and applying new algorithms. Data analysis can be conducted using SAS enterprise Miner 5.2, SAS 9.3.1 and R software. As identified in the preclassification stage, the sample TP and TN cases can be used for model development, comparisons and validation. To develop and evaluate case finding models that can be used to predict binary target HIV status is achieved using, Logistic Regression, Artificial neural networks and decision trees data mining methods. For the respective methods, several sets of input variables, tuning parameters and different model specifications are used. All models are then cross validated using the validation and training datasets. Logistic regression predicts the probability that a binary variable target will have the event of interest as a major function of one or more independent inputs. Decision Tree models on the other hand are procedures that repeatedly segments data. The partitioning is done using algorithms and rules which maximize uniformity of target variable within a segment. The decision tree provides a set of rules that can be applied to new datasets to give new predictions. Artificial Neural Networks (NN) are a set of flexible non-linear models. This model allows interactions between input variables. This model is based on computer applications that model neuron interactions in the human brain. It uses the multi-layer perceptron architecture which has three layers; input, hidden and output. The neurons in these layers are not linked to one another, however, they are connected to subsequent layers by activation functions. The reference model approximates the current algorithm for entry into the registry. This model has one input variable which is the indicator for HIV – specific ICD-9 codes. The next step in analysis is comparing the model performance. In the previous steps the best models are selected and the selected ones have lowest error rates. The typical method currently used is the reference model. In this stage, the three identified models are compared to the reference model. Two criteria can be used to compare these models; misclassification rate which is compared on the validation set and AUC index which is compared on the Test set. In the first criteria, a confusion matrix is used which compares Actual and predicted values of the binary target variable. It then classifies them into four possible outcomes; False Negative, False Positive, True Negative and True Positive. In addition, specificity and sensitivity measures are other useful measures that can be derived from the confusion matrix. On the other hand the AUC index evaluates the overall performance of each model. This can be comprehended as the probability that if one positive and one negative is randomly picked, the model assigns a high score to the positive one than the negative. A Mann- Whitney two sample statistic is used to test the difference in the AUC indices between the models and the RM. Model and application After the data has been pre-classified, there was a need to create a sample that consisted of 30,950 patients. These patients consisted of all TPs. There was the same number of TNs which was selected in a random manner. The sample was then partitioned into three subsets. The three subsets consisted of 40% of training, 30% of validation and another 30% testing. These were used in the development of the model. The education sector is quite an important sector which has many pitfalls especially in the information management and handling. This sector is the backbone of any country and even the whole world. The more educated more people are, the better the sector of the learning and the education system. The problem that I have decided to look into is the strategies employed especially in the schools and how the decisions are made regarding the general situation that prevails in the schools. Let us get this straight, the systems adopted especially in the schools seem to be lacking (Ikemoto, & Marsh, 2007). The kinds of data are mostly specific in a particular country or sector, not minding the other areas and in the whole world generally. The decisions should be based on the data that is obtained from the systems so as to enhance the general view; and perspective so desired to enhance flexibility in the learners not dependent upon the area and the system that a person/learner passes through. The problems that arise as a result of poor decisions is wastage of resources and lack of specific logical aims for the benefit of the educators in any particular system. The data of the schools which seem not to perform are many. The general performance of the students is affected by many issues which seem to be very important like the ethnicity, the background, the poverty and the race. These are not put into consideration seriously in the decision making yet they should be considered. The other issue would be the performance of the schools worldwide. The kind of performance is as shown in the graph below: The performance of the students is clearly low, judging from the graph that is depicted below. This shows that a lot needs to be done; not only in one region, but in Asia, Africa and even Latin. This data will ensure that all the needs in the schools are catered for provided they are based upon the data that are obtained as shown above. The data that might be needed to ensure that this issue is implemented successfully include the assessments of the classrooms for learning, the differing education systems and how they have impacted in the learning so desired. The other issues include the challenges that the schools in all the levels ranging from district to the highest levels face, this will ensure that the main challenges like ethnicity, the people’s beliefs, the race and the poverty levels are catered for as they strategize the plans for the betterment of the education levels (Park, & Datnow, 2009). This data could be gained thorough research and one-on-one talks with the schools. The research might be in form of technological uses like the use of Skype for the developed schools and the use of questionnaires and interviews to ensure that accurate results are obtained. Another method of obtaining the data would be through collaborative inquiry of the leadership that is employed in the schools. The mode of collecting the data would include a lot of research of even the quality of the e-learning and all the issues to be looked into. The kinds of systems that will need to be used could be the development of a leadership team, collect and organize the different types of data, analysis of the data patterns, generation of the hypotheses, development of the goal setting guidelines, designation of the specific strategies for the action plan, the evaluation of the data and finally the implementation. The kinds of data that will be required here show the need for accuracy to ensure that the resources are not wasted (Hedgebeth, 2007). As much as there might be the long term goals in the plans, there will also be the short term goals which should be achieved in a specific set aside time to ensure that the process is at its best and the needed adjustments made. The goals of the teams that and the sectors that will be used in the collection of the data should also have goals and aims to enhance their effectiveness that will be quite crucial in the collection of the data. There are many variations in the data that will be obtained therefore; the analysis will have to be multivariate considering the many variables that are put into consideration to ensure that all the important aspects that are obtained in the data are catered for succinctly. This will enable the decisions that even cut across and are related to be analyzed easily. It enhances the issues that are related to be dealt with jointly if they can and quality enhancement and assurance will be enhanced (Bertsimas, & Thiele, 2006). The multivariate analysis would be the best option because the conclusions made on the data and the data can be easily analyzed from different perspectives easily. My conclusions regarding this very important sector of the general economy of the country and even the world is that; if only the decisions are made with deep insights into the data, the important aspects that need to be looked into first and require fast action will be given the weight required in order to make steps towards the improvement of the quality of education in all sectors. Education is important but, if the systems do not favor the kind and quality so desired to enhance flexibility and research in the learners regarding the prevailing issues, the amount of people who are uneducated or learned but have no job will be on the rise continually. SAS enterprise miner is data analysis that is insightful and offers not only solutions but also predictions about any data that is obtained. The kind of data here would include the schools’ performance over the years, the kind of leadership, the classrooms, the teaching staff and the general challenges that come about in all sectors of the schools. It can be in terms of the districts level, provincial, county and even the countries. The SAS data miner can be very important for this kind of research simple because all the knowledge needed can be easily obtained and compared as to whether they fit certain needs of the institutions in our schools, their level of coverage of the syllabus, the conditions which they read under among other main issues which act as the determinants quality performance and relevant knowledge obtained by the learners (Marsh, Pane, & Hamilton, 2006). The SAS enterprise miner is fast and all the adjustments that might be required in the whole process can be easily analyzed and the adjustments to be made during the implementation of the plan; can be easily achieved during the process as the issues will be dealt with specifically and fast to enhance that this is done effectively. References Bertsimas, D., & Thiele, A. (2006). Robust and data-driven optimization: modern decision-making under uncertainty. INFORMS Tutorials in Operations Research: Models, Methods, and Applications for Innovative Decision Making. Hedgebeth, D. (2007). Data-driven decision making for the enterprise: an overview of business intelligence applications. VINE, 37(4), 414-420. Ikemoto, G. S., & Marsh, J. A. (2007). chapter 5 Cutting Through the “Data‐Driven” Mantra: Different Conceptions of Data‐Driven Decision Making. Yearbook of the National Society for the Study of Education, 106(1), 105-131. Isaacs, M. L. (2003). Data-Driven Decision Making: The Engine of Accountability. Professional School Counseling, 6(4), 288-95. Mandinach, E. B., Honey, M., & Light, D. (2006, April). A theoretical framework for data-driven decision making. In annual meeting of the American Educational Research Association, San Francisco, CA. Marsh, J. A., Pane, J. F., & Hamilton, L. S. (2006). Making sense of data-driven decision making in education. Park, V., & Datnow, A. (2009). Co-constructing distributed leadership: District and school connections in data-driven decision-making. School leadership and Management, 29(5), 477-494. Wayman, J. C. (2005). Involving teachers in data-driven decision making: Using computer data systems to support teacher inquiry and reflection. Journal of Education for Students Placed at Risk, 10(3), 295-308. WEB, S. L. O., & LESSON, W. P. Data-Driven Decision Making. Wohlstetter, P., Datnow, A., & Park, V. (2008). Creating a system for data-driven decision-making: Applying the principal-agent framework. School Effectiveness and School Improvement, 19(3), 239-259. Read More

Data Driven Decision Making with SAS Enterprise Miner - Case Study Example

Extract of sample "Data Driven Decision Making with SAS Enterprise Miner"

CHECK THESE SAMPLES OF Data Driven Decision Making with SAS Enterprise Miner

Business Plan: First Choice Drive-in

Sustainable Management Futures

Paradise Links Resort - Change Management in the Hospitality Industry

Business Intelligence and Enterprise Data Mining

Flow Chart, Root Cause Analysis, Event Tree, Decision Tree

Profiling the Entrepreneur

Understanding Management Theory

Differences in Systems Development Methodology of Enterprise Systems and ERP Systems