Skip to main content

What is BIG DATA ANALYSIS ?

Big data analytics is the process of examining large and varied data sets -- i.e., big data(Black box Data, Social media Data, Stock Exchange Data, Search Engine Data, Transport Data, Power grid Data) -- to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful information that can help organizations make more-informed business decisions. Volume, Variety and Velocity are the three V’s of Big Data.

A person involved in this kind-a-job is called to be a “Data Analyst”. And to receive this tag by large group of people he/she needs to be super good at statistics and if accompanied by software developing skills he/she would be called as “Data Scientist”.

How to be a renowned and efficient data scientist?

Technical Skills that the person needs to be good at are:
  • Statistical methods and packages (e.g. SPSS)
  • R and/or SAS languages
  • Data warehousing and business intelligence platforms
  • SQL databases and database querying languages
  • Programming (e.g. XML, Javascript or ETL frameworks)
  • Database design
  • Data mining
  • Data cleaning and munging
  • Data visualization and reporting techniques
  • Working knowledge of Hadoop & MapReduce
  • Machine learning techniques

Weapons they need to be aware of:

And when we talk about big data analytics, hadoop is a thing we can’t afford to miss. Well Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware using map-reduce algorithm. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs. It is currently managed by Apache Software Foundation, a global community of software developers and contributers.
Currently, four core modules are included in the basic framework from the Apache Foundation:
Hadoop Common – the libraries and utilities used by other Hadoop modules.
Hadoop Distributed File System (HDFS) – the Java-based scalable system that stores data across multiple machines without prior organization.
YARN – (Yet Another Resource Negotiator) provides resource management for the processes running on Hadoop.
MapReduce – a parallel processing software framework. It is comprised of two steps. Map step is a master node that takes inputs and partitions them into smaller subproblems and then distributes them to worker nodes. After the map step has taken place, the master node takes the answers to all of the subproblems and combines them to produce output.
Operational Data: “NoSQL” database provides a mechanism and storage for data.
Mango DB and Cassandra are the examples of databases based on NoSQL.(Both free and open-source).
Jaspersoft BI Suite: It can generate reports form the databases.
Karmasphere studio and analyst :It’s an IDE which makes creating and running Hadoop jobs easy. The tools provided by it like, Karmasphere Analyst, which is designed to simplify the process of plowing through all of the data in a Hadoop cluster.It has subroutines for uncompressing Zipped log files. Then it strings them together and parameterizes the Hive calls to produce a table of output for perusing.
And there are many other tools which you can put to your help according to your convenience.
Remember it’s just not about learning them, the application is more important and that depends on your own creativity and thinking skills. And the self-help tutorials won’t be teaching you the secret sauce like which statistics to consider and which statistics can be ignored, So stay in touch with some analytics professionals (and now-a-days with flooded social media platforms over the internet it is possible!).

Want to learn it? checkout these links:
(If there exist any other helpful resource not mentioned here please write in the comment section.)







Comments

  1. Contents are very simplified with a brief description. This will provide a great overview for absolute beginners.

    ReplyDelete

Post a Comment

Popular posts from this blog

Blockchain: The future of Internet Technology

We all have heard the term “ Bitcoin ” or “ Blockchain ” from one source or the other. Does it makes you eager to know what it is? And why everyone is so interested about it? Well Bitcoins, as you may already know, is a cryptocurrency, based on Blockchain. So, this article deals with knowing the fundamentals and discovering the best resources which is present over the internet about Blockchain Technology . source: Google images Actually, Blockchain is a platform where one internet user can transfer assets , any digital property to any other user (familiar or unfamiliar) and the transfer is very safe, secure and legitimate as all the records of the transfer are maintained and these records(blocks) are decentralised (i.e. everyone has access to it and no single person owns it) but no-one can tamper with these records as they are highly encrypted . Every node(system) has the ability to update the record and the most popular record becomes the de-facto official record to becom...

IoT: The Magic Wand Technology.

Just imagine how exciting and amazing this w orld would be if everything around us is connected, it would not only save our time but also reduce our unnecessary tensions and stress. And since human-kind has evolved and developed many-folds, thus, we have a super-power technology to turn this into reality and that’s termed as “ IoT ” or “ Internet of Things ” - the connection of Physical World with the Cyber World . Sounds fascinating, isn’t it? Now, let us see how it works in reality . The Cyber-Physical system is configured or designed by following these functions: ·         Connection level : It involves attachment-free or wireless communication, monitoring and recording the physical conditions of the environment using sensors and organizing them at a central location. ·         Conversion level (Data-to-information): Involves data-correlation, tracking of machine-failures, malfunctions, reduces dow...