Literature the scientific validity of study results. There

Literature Review


implementation of technology has made operations in the Oil and Gas industry
and by extension, the energy sector more efficient. Production levels have
remained intact because of the use of technology as data can be used, to assist
in making new discoveries or predictions which allows for more effective
decision making. The paper reviews the data collection process, as well as data
cleaning and data mining process.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!

order now

data collection process is the most crucial aspect for any developmental
purpose. Data collection is
defined as a systematic method, to gather data from multiple sources to obtain
meaningful information. Accurate data collection is essential, as it
maintains the integrity of research and allows for informed business decisions
and ensuring quality assurance. Fairweather and Rukenbrod in their respective
articles, emphasize the grave level of importance attached to ensuring the collection
of accurate and appropriate data. The consequences of inaccurate data include:
being unable to correctly answer research questions, it hampers the reliability
and validity of the study, it is a waste of time and resources and can lead to
legal, moral and unethical implications. Most, Craddick, Crawford, Redican,
Rhodes, Rukenbrod, and Laws (2003) describe ‘quality assurance’ and ‘quality
control’ as two approaches to assist in preserving data integrity and ensuring
the scientific validity of study results. There are several types of data
collection instruments that can be used to collect data. A single method cannot
be identified as best, the most suitable data collection instrument depends on
the field of research and whether  qualitative or quantitative data is preferred.

After data has been collected, it should be cleaned. Cleansing
data from impurities is an integral part of data processing and maintenance. This
has led to the development of a broad range of methods intending to enhance the
accuracy and thereby the usability of existing data (Müller and Freytag, n.d.).
Erhard Rahm and Hong Hai Do state that the detection and removal of
errors and inconsistencies from data to improve its quality. Data cleaning can also
be referred to as data preparation or data cleansing. Connolly and Begg, the
authors of Database Systems: A Practical Approach to Design, Implementation, and
Management state that the steps for data preparation are as follows: select
data, clean data, construct data, integrate data, and format data. Data
selection the procedure of determining the suitable data type and credible source.
Cleaning data is the process of finding corrupt, inaccurate or incomplete
records and correcting them. Constructing data involves fact checking, this is
where you ensure that the data you previously cleaning is valid, credible and
reliable. Integrating Data involves combining technical and business processes
to combine data from disparate sources to turn it into meaningful and valuable
information. Formatting data is the process of logically determining how the
data should be manipulated. An IEEE Computer Society Journal on Data
Engineering describe the data cleansing steps as data analysis, definition of
transformation workflow and mapping rules, verification, transformation and backflow
of cleaned data. Müller, and Freytag describe
these steps as data auditing, workflow specification, work flow execution and
post processing control. Other authors and database specialist name the
specific stages differently, however, regardless of the name of the phase,
in-depth reading indicates that all of the fundamental procedure is the same. 

the data has been collected and cleaned, it can be mined. Data mining is the
process pattern identification and establishing of relationships in data sets. There
are five most common data mining techniques, the first is Classification
Analysis, where, data records are classified into different segments, referred
to as records. These records are analysed and generate a set of new grouping
rules. These rules are used to classify new or future data. The second is
Association Learning Rule. This technique is used to identify hidden patterns
in data variables that may unrelated. It can be used for examining and
forecasting customer behaviour. The next is Anomaly or Outlier Detection; This is
observing datasets that do not match an expected pattern. Anomalies are also
known as outliers, novelties, noise, deviations and exceptions. The observation
provides critical and actionable information. The fourth is Clustering Analysis.
Clustering analysis is the process of discovering groups and clusters in the
data that show the degree of association between two objects is highest if they
belong to the same cluster and lowest otherwise. Lastly, there is Regression
Analysis, which assists in understanding the characteristic value of the
dependent variable changes, if any one of the independent variables is varied.
This means one variable is dependent on another, but it is not vice versa.

conclusion, it is noted that the process of data collection, data cleaning and
data mining are interdependent. Applying the correct technique to cleaned data,
can result in useful information that can solve a variety of business problems
and needs.