Big data is resembles to a data flood. The abundance of data extends day by day. Big data focus on the huge extent of data. The data may be in the form of structured, unstructured and semi structured. The structured data consist of text files that can be displayed in rows and columns. It can be easily processed. The unstructured data is opposite to structured. Data cannot be displayed in relational database. The example of unstructured data may be word processing document, presentation, audio, video, email and also many other business documents. The third category is semi structured data in which xml, JSON and NoSQL database include. The term big data highly linked with unstructured data. We can say that 80% of data in big data is unstructured. In actual big data refers to the data that is not handled by traditional database. The traditional database system holds the data in term of Gegabytes, while in big data it stores data in petabytes, exabytes, zettabytes, etc. Companies need to retain or hire the highly experienced staff for the deep analytical view of big data. The age of big data continuously increasing in most popular social site likes Facebook, twitter. Big data understanding will be different according to business, technology and industry terms. McKinsey defied five following units in which data rapidly grows. These are Healthcare, Public sector, retail, Manufacturing and personal location data. The main advantage of the big data it provide scalability and data analytics.The examples of big data are in real life scenarios like banks, social media, web data and any type of daily transactions.Big data definition complete with these five V’s volume, variety, velocity, veracity, Value. So, here are the 5 V’s of big data that elaborate in plain language.
Volume: In big data terms the word “big” defines the volume.in future the data will be in terms of zettabytes. From social networking sites a large amount of data is being shared. Here are some interesting stats that shows the volume of data. According to internet live stats in 1 sec there is:
· 64,551 google searches
· 7,886 tweets on twitter
· 822 Instagram photos uploaded in 1 sec
· 72,179 YouTube videos viewed in 1 sec
· 2,655,007 emails sent in 1 sec including spam mails
· 52,180 GB of internet traffic in 1 sec
· 2.5 million pieces of content shares by fb users
· 571 websites created every minute of the day
Variety: As I discussed the types of data structured, semi structured and un structured. These types of data is difficult to handle by traditional database system. Various types of data is called variety .In these days a lot of structured data is generated.
Velocity: The speed of data at which it is being created known as velocity. A few examples of data spawned from social networking sites are tweets on twitter, status/comments/shares on the Facebook and many others. The data is generated in Real time, near real time, hourly, daily, weekly, monthly, yearly, batch and so on.
Veracity: The conformity of data. The attributes of veracity include the accuracy, integrity and authenticity of data. It leads to the uncertainty of data, whether the data is verified or not.
Vagueness: Confusion about big data is named vagueness. There are different tools that is used to handle the big data. Is it Hadoop, hive, Map reduce, Apache pig or any other?
Value: The last but not least, Value is the most important characteristics of big data .It ensures that the achieved data is useful or not for the organization. Value added information will have great affect to boost up the organization.
At this stage the question raised how big data works. All recent big data technologies focus on the environment that is low cost and having low complexity, highly stable and integrated. All of these emerging technologies or tools that helps the users to handle the big data in cost effective point of view. These tools involved in how big data works, store and on analytics. These are the top technologies for the big data.
Hadoop: An Apache project that is free, open source and java based programming framework.it provide support to store and process very large data sets. Hadoop basis works in distributed environment. In functional point of view Hadoop project include:
· Hadoop common
· Hadoop YARN
· Hadoop MapReduce
All these Hadoop components perform individual task as Hadoop common provide essential libraries to support Hadoop framework, Hadoop Distributed File System store extremely large data sets on different distributed systems, Hadoop stands for Yet another resource Negotiator (YARN) do job scheduling and supervision and Hadoop MapReduce used for parallel processing of large data sets. Hadoop is vastly scalable, reliable, low cost, secure and offer colossal storage for any kind of data.