Business Intelligence Coursework Example | Topics and Well Written Essays

Business Intelligence Introduction A survey is carried out for the distribution of household incomes. The survey is supposed to predict the income of each household member in an attempt to determine if a person makes more than 50,000 a year. Clark and Gellfand (2006) state that new tools for storing statistical data are changing ways in which scientists interpret and analyze models and data. The report performs an analysis of the income stages dividing itself into work class, age, and education, education number showing the number of year’s income earners were in school, occupation, marital status, relationship, capital-gain & loss, race and working hours per week. The analyzed data will use Modeler 14.2 IBM SPSS from income data and provided results will predict if the income earner gets more than 50,000 annually from the income test data so as to help the bank provide lower interest rates individuals with from households making less than 50,000. Earlier on the information would use SPSS Modeler for cleaning up data meaning for selection and binning of data, the Modeler would help to first clear away some of the missing values, as well as do away with extreme figures from the table and outlier so as to smooth the given data. Doing so will help increase accuracy for the analyst. Afterwards, some of the data is grouped and reclassified in order for the new fields to clearly indicate the factors in a clearer and comprehendible manner. The best approach for this problem is the use of classification (decision tree analysis) and clustering rules, which is a technique for collecting data that is unsupervised and data cleaning where inaccurate data from a table is detected and corrected. It is mainly used for grouping data together in different groups but without a target attribute, for example in the case of classification. The groups are created according to the overall value similarity for the input attributes. Clustering is particularly useful for grouping similar things in the same group for a person to study any interesting group separately. Use of the clustering technique does not use the testing or training idea that is a classification part. As there is no use of target variables, there is no distinction between dependent and independent variables. Clusters are described as the underlying structures in a data that is not defined as a single ‘right’ description. For example, one could separate playing cards according to their color, but equally, one can choose from four clusters to separate cards according to each one’s suit, i.e. clubs, heart, spades, diamonds. Classification is a technique for data mining that is more direct. The target variable is present and data is classified according to each of its variable. The target variable is named a class variable and one need to choose from the rest of the provided variables that are going to be used as classification input. Decision trees are simple but powerful classification techniques. They are most useful as they are considered the easiest technique to interpret as changes caused by human change of mind and reasoning can be controlled by use of this technique. Main section Data Cleaning Picking some of the data from the income data from excel, The table formatting is in commonly used software that responds to data. There are three participants’ data, 1, 2, 3 in column A, four trials for column B, and two conditions for experimenting, column C, the actual times of reaction in column D and finally column E for coding the correct or incorrect response. 1 equals correct and 0 equals error. As a general rule, trial reaction times where the participant made an error should not be an analysis that is subsequent. The only exceptions to this rule are some certain tasks where the error trials may be of a particular interest like Go/No-Go tasks. Through RTs from the above trials of error that are thought to be unreliable, there is an added component process that operates the error trials, whatever created the error. The easiest means to accomplish this is by inserting an additional column. Code all the trials with errors 0, as well as, trials without an error as the time of origin. IF (error=1) RT=RT Else RT=0 In the table below, I entered the formula: =IF (E2=1, D2,0) in F2 cell and then copied down the remaining column for applying to all the of the subsequent data. Subject trial condition RT correct Errors removed 53 1 1 35 1 35 51 1 2 40 1 40 71 1 2 10 0 0 31 1 1 35 1 35 33 1 2 40 1 40 39 1 2 20 1 20 52 1 1 40 0 0 27 1 1 24 1 24 54 1 2 60 1 60 30 1 1 40 1 40 64 1 2 40 1 40 26 1 1 40 1 40 From a theoretical view, it is desirable to remove both long and short outliers. E.g., since most people earn less than 50,000, we can set a limit of 30,000. The added column G and H using the IF statement to remove the trial errors, Subject trial condition RT correct Errors removed remove < 35 Remove > 40 53 1 1 35 1 35 35 35 51 1 2 40 1 40 40 40 71 1 2 10 0 10 0 10 31 1 1 35 1 35 35 35 33 1 2 40 1 40 40 40 39 1 2 20 1 20 0 20 52 1 1 40 0 40 40 40 27 1 1 24 1 24 0 24 54 1 2 60 1 60 60 0 30 1 1 40 1 40 40 40 64 1 2 40 1 40 40 40 26 1 1 40 1 40 40 40 You can see that two short RTs and one long one have been removed and recoded as 0. Reformatting the analyzed data, Subject trial condition RT correct Errors removed remove < 35 Remove > 40 53 1 1 35 1 35 35 35 31 1 1 35 1 35 35 35 51 1 2 40 1 40 40 40 71 1 2 10 0 10 0 10 52 1 1 40 0 40 40 40 27 1 1 24 1 24 0 24 33 1 2 40 1 40 40 40 39 1 2 20 1 20 0 20 30 1 1 40 1 40 40 40 26 1 1 40 1 40 40 40 54 1 2 60 1 60 60 60 64 1 2 40 1 40 40 40 Restructure the data from the last column; Subject RT1 RT2 RT3 RT4 53 35 0 40 0 39 0 0 0 20 30 40 40 60 40 Dahan (2014) argues decision trees require data that is proactive for easier predictions. Models of decision tree analysis allow one to develop systems that classify future predictions/ observations depending on the set rules. The process automatically includes its own rules only the variables that really matter in the decision making. Variables not contributing to the tree accuracy are ignored. It is a good way of yielding useful data and used to reduce the relevant fields only before testing another learning technique, e.g. a natural net. 1. Insert Data to a stream; depending on its format Data is stored in a different format, choose the appropriate node from the palette for sources, in this case, a .cvs file and choose the Var. file node then drag and drop the icon to stream the data. 2. Double click on the Var. File to insert the data. Click on the eclipse icon and browse the dive to find the Credit2.csv file and open it 3. Check your data storage Choose data tab and check the data storage of each field. Check the overdrive tab on the DOB field, choose date from the drop down menu, click apply and then OK. Your stream icon will have the name of the filed data. 4. Check for type of data Choose type tab, press button indicated read values and automatically choose a data type Data type can be changed to whichever task required. 5. Specify recognition of missing values in the system Press control+A to select the fields, click in any of the fields and select the ON option from the menu that pops up. By default, it will identify a null and blank missing values. Click on the Apply button and then press OK 6. Insert an output node to view the raw data Add Table node from the output palette and connect the data by selecting the last one, press on F2 then click on the Table node that you just inserted Double-click on the Table node and click the Execute button from the opened window. A new window will appear with your model data 7. Data Audit Node & Descriptive Statistics In order to obtain your data use a special node called Data Audit allowing us to check for several database features. Click the Output Operations button then add the Data Audit node to your database. Connect it to your data source i.e. the Credit2.csv node. Double-click on the Data Audit node and specify what to be included in the process of data audit. First, look at the boxes with the Graphs and Basic statistics. Second, click on the Quality tab, check for the Interquartile ranges from lowe/upper quartiles box. It will allow for the program to identify the extreme and outlier values within the data. Click on Apply and Execute There are 30725 values compared with the original data that indicated the values to be 32561. It can be noted that the capital gain & loss, as well as working hours per week has some unusual and extreme data. However, as most of the family has capital surplus that equals zero, the figures cannot be seen as the unusually extreme data uses Modeler to clear out the hour per week section, Kahraman (2012). 1. 75.1% of household members make an income that is less than 50,000 and only 24.9% make more than 50,000. 2. Clear the entire outliner and extreme section. 3. Following the clearing of the outliner and extreme is selecting the Partition Node to divide the entire data into two separate sections. One is the training data section and the other the section for testing data. In the training data section, it is considered that all factors of age, hour per week, education number and capital surplus (subtracting capital loss from capital gain) so as to look at the impact from the income earning. 4. Add a Type Node to the stream through changing the income into a target. Then use three various deployment phases that are namely CHAID, QUESR and CRT to test the model and assist in getting the best decision making. Michaelwicz (2007) insisted that algorithms provided generally perform the same function, to examine all database fields to find with the best prediction or classification through splitting the data into smaller subgroups. In order to find the best algorithm suitable for our data, each algorithm will need to be tested to determine their power for classification. It means each algorithm can build an alternative decision tree analysis able to classify the cases of training samples into one of the initial categories. However, these recommendations can turn out to be erroneous. 5. Figure 1, 2 and 3 indicate training data using the three different types of phases of deployment to find results for the analyzed data. The first figure, figure 1, is the CHAID analysis results where there was found to be 81.6% accuracy. Figure 2 shows the QUEST result analysis indicating 79.74% accuracy and lastly, Figure 3 shows the result for analysis of CRT with 82.85% accuracy. Model Overall Accuracy Position CRT 82.85% 1st position CHAID 81.6% 2nd position QUEST 79.74% 3rd position From the table of summary above, we see that CRT and CHAID are models with the best performance in the training phase. 6. Accuracy of the phases must be tested to determine the most suitable one to use. The other partition for the test data is used for this purpose. The 3 nodes are used to analyze the data 7. For accuracy testing, three nodes are commonly used, the CHID, CRT and CHAID nodes that also help us determine the best Deployment phase to use for training. The CHID node developed an accuracy of 82.1% and 81.67% CRT node produced 81.65% and finally, the CHAID node produced an accuracy of 79.42%. 1st training CHAID 81.67% CRT 82.85% QUEST 79.74% 2nd test CHAID 81.67% CRT 81.65% QUEST 79.42% The two groups above show that CRT produces the most accurate data due to the performance evaluation. CRT is much smaller than CHAID, CRT having 0.107 and CHAID with 0.129 in test data. Kantardzic (2011) argues that data mining has been made easier through manipulating data and models to create the best results. 8. During the process, CHAID and CRT nodes did not reach the last ‘terminal node’ meaning that at the bottom of the branch contains more than one terminal node. As a result, it is very challenging to make a decision based on which branch to go with, however, this report would use QUEST deployment phase. 9. According to results achieved from QUEST, individuals with more than 14 years of education would most likely have a probability of earning more than 50,000 annually. Those who have been in school for less than 14 years, though more than 10, would be most probably likely to earn more than individuals that have not had more 10 years of education. Individuals with more than 10 years of education and those with smaller than 14 years of education aged more than 45 years, in a stable relationship, have a higher probability of earning more than 50,000 a year than individuals younger than 45 years of age. Individuals older than 39.5 years of age and younger than 45 years who are in a marriage have a higher probability of earning more than 50,000 than those younger than 39.5 years of age. All the information can be seen below Now that is finally clear on the model is most suitable for use, it is a good idea go through the decision tree and its rules for better understanding. In this case, QUEST proved to be the most suitable Conclusion Most scientific problems require labeling information with a provided data or a finite set of classes depending on the features of the items in the data. For example, oncologists base tumors as different identifiable cancers that require use of patient records, biopsies and other essays. Shimizu et al (2006) stated that decision trees predict outcomes and assist in making a clearer decision based on the findings from the tree. They are constructed to analyze a set of testing and training examples from which the classes are known. Later, they are applied to form a classification of previous examples. Use of IBM SPSS Modeler brings out a versatile and powerful text and data analytical work bench that assists in building accurate and predictable models in the shortest time possible without programming. The tool helps people discover trends and patterns in unstructured and structured data easily with the use of unique visual interface that is supported by an advanced analytical tool. The outcome of models and their understanding can influence them so one can take advantage of the given opportunity and make a quick mitigation of risks. References CLARK, J. S., & GELFAND, A. E. (2006). Hierarchical modelling for the environmental sciences statistical methods and applications. Oxford, Oxford University Press. http://site.ebrary.com/id/10271492. DAHAN, H. (2014). Proactive data mining with decision trees. http://dx.doi.org/10.1007/978-1- 4939-0539-3. (2009). IBM Looks to Predict Success: IBM strengthens its predictive analytics offering with SPSS acquisition. Insurance & Technology. 34, 9. INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATION, AND CONTROL, UNNIKRISHNAN, S., SURVE, S., & BHOIR, D. (2013). Advances in computing, communication, and control: third International Conference, ICAC3 2013, Mumbai, India, January 18-19, 2013. Proceedings. Berlin, Springer. http://dx.doi.org/10.1007/978-3-642-36321-4. KAHRAMAN, C. (2012). Computational intelligence systems in industrial engineering with recent theory and applications. Paris, France, Atlantis Press. http://dx.doi.org/10.2991/978-94-91216-77-0. KANTARDZIC, M. (2011). Data mining: concepts, models, methods, and algorithms. http://dx.doi.org/10.1002/9781118029145. LOSHIN, D. (2012). Business intelligence. San Francisco, Calif, Morgan Kaufmann. MICHALEWICZ, Z. (2007). Adaptive business intelligence. Berlin, Springer. http://site.ebrary.com/id/10152655. NGWENYAMA, O., & OSEI-BRYSON, K.-M. (2014). Advances in research methods for information systems research: data mining, data envelopment analysis, value focused thinking. http://alltitles.ebrary.com/Doc?id=10813381. SHIMIZU, T., CARVALHO, M. M. D., & LAURINDO, F. J. B. (2006). Strategic alignment process and decision support systems: theory and case studies. Hershey, PA, IRM Press. SPIEGEL, M. R., SRINIVASAN, R. A., & SCHILLER, J. J. (2000). Schaums outline of theory and problems of probability and statistics. New York, McGraw-Hill. Aspects of this technique. Read More

Adaptive Business Intelligence - Example

Extract of sample "Adaptive Business Intelligence"