Apache Spark and PySpark
Apache Spark and PySpark Some reading materials: Spark Documentation: Spark Official Documentation PySpark Documentation: PySpark API Documentation. Books: "Learning Spark", "Advanced Analytics with Spark" Big Data: Data volume in TBs, PBs and more. Hadoop is old system which is used for processing bigdata. Hadoop is using file system to process data. Characterisitcs: Volume - size of data, (bytes<kb<mb<gb<tb<pb....) Ex: Digital payments, Social media data, e-commerce data etc... Velocity - Data speed of travel Variety - Excel, RDBMS, txt, JSON, XML, HTML, Documents, images, audio, video, geo maps, etc... Veracity - Quality of data and accuracy of data, trust worthy data or not. Value - Actionable information or any data which will provide meaning for business decision or can be considered as useful. Company requirements: Data storage Data processing speed Scalability Hadoop: (HDFS - filesystem) (Map Reduce(computation) - Programming framework) Hand...