"Big data" is a data set with a particularly large volume and data categories, and such data sets cannot be captured, managed and processed by traditional database tools. "Big data" first refers to data volumes? Large refers to a large data set, usually in 10TB? However, in practical application, many enterprise users put multiple data sets together, which has formed PB-level data volume; Secondly, it refers to the large variety of data, which comes from a variety of data sources, and the types and formats of data are increasingly rich. It has broken through the previously defined structured data category, including semi-structured and unstructured data. Secondly, the speed of data processing is fast, and the real-time processing of data can be achieved even when the amount of data is huge. The last feature is the high authenticity of data. With the interest of new data sources such as social data, enterprise content, transaction and application data, the limitations of traditional data sources have been broken, and enterprises increasingly need effective information power to ensure their authenticity and security.
Data collection: ETL tools are responsible for extracting data from distributed and heterogeneous data sources, such as relational data and flat data files, into the temporary middle layer, cleaning, converting and integrating them, and finally loading them into data warehouses or data marts, which become the basis of online analysis and data mining.
Access to data: relational database, NOSQL, SQL, etc.
Infrastructure: Cloud storage, distributed file storage, etc.
Data processing: NLP (NaturalLanguageProcessing) is a subject that studies the language problems of human-computer interaction. The key to natural language processing is to make computers "understand" natural language, so natural language processing is also called NLU (NaturalLanguage Understanding), also known as Computational Linguistics. On the one hand, it is a branch of language information processing; on the other hand, it is one of the core topics of artificial intelligence.
Statistics: hypothesis test, significance test, variance analysis, correlation analysis, t-test, variance analysis, chi-square analysis, partial correlation analysis, distance analysis, regression analysis, simple regression analysis, multiple regression analysis, stepwise regression, regression prediction and residual analysis, ridge regression, logistic regression analysis, curve estimation, factor analysis, cluster analysis, principal component analysis, factor analysis, fast clustering method and clustering method
Data mining: Classification, Estimation, Prediction, affinity grouping or association rules, Clustering, Description and Visualization, complex data type mining (Text, Web, graphics, video, audio, etc.)
Prediction: prediction model, machine learning, modeling and simulation.
Results: Cloud computing, tag cloud, diagram, etc.
To understand the concept of big data, we should first start with "big", which refers to the data scale. Big data generally refers to the amount of data above 10TB(1TB=1024GB). Big data is different from massive data in the past, and its basic characteristics can be summarized by four V’s (Vol-ume, Variety, Value and Veloc-ity), namely, large volume, diversity, low value density and high speed.
First, the data volume is huge. From TB level to PB level.
Secondly, there are many types of data, such as weblogs, videos, pictures, geographical location information, and so on.
Third, the value density is low. Take video as an example. During continuous monitoring, the data that may be useful is only one or two seconds.
Fourthly, the processing speed is fast. 1 second law. This last point is also fundamentally different from the traditional data mining technology. Internet of Things, cloud computing, mobile Internet, Internet of Vehicles, mobile phones, tablets, PCs, and various sensors all over the globe are all data sources or ways of carrying them.