Monday, August 17, 2015

MongoDB

http://tshivanand89.blogspot.com/p/p-margin-bottom-0.html



Week 5 :

https://github.com/uisso/course-mongodb-M101J/blob/master/notes/week5.md
https://github.com/jrgcubano/mongoeduc/tree/master/m101js/week5/examples/byme/aggregation_project/quiz

HomeWork
http://mongodbchamps.blogspot.com/2014/09/m101j-mongodb-for-java-developers_43.html

https://github.com/jrgcubano/mongoeduc/tree/master/m101js/week5/homework/hw5.4



db.zips.aggregate([ {$group:{_id:"$state",population:{$sum:"$pop"}}} ])

Creating Indexes Quiz
db.students.createIndex( { "class" : 1, "student_name" : 1 } )

QUIZ: DOT NOTATION AND MULTIKEY

Suppose you have a collection called people in the database earth with documents of the following form:

db.people.createIndex({"work_history.company":-1});

Week 4 Ansewrs :  http://hao-deng.blogspot.com/2013/06/mongodb-course-note-4.html
http://mongodbchamps.blogspot.com/2014/09/m101j-mongodb-for-java-developers.html


https://github.com/checkcheckzz/MongoUCourseVideo#w4
https://github.com/ulrich/M101J/tree/master/hw4
https://github.com/jrgcubano/mongoeduc/tree/master/m101p/week4/homework/hw4.2
http://ogankeskiner.blogspot.com/2015/04/m101n-mongodb-for-net-developers.html
http://mongodbchamps.blogspot.com/2014/09/m101j-mongodb-for-java-developers.html
https://github.com/olange/learning-mongodb/blob/master/course-m101p/hw4-4-answer.md

Week5 Answers : http://johnsjavapda.blogspot.com/2013/12/mongo-week-5-aggregation-framework.html


Home Work :

http://da8y01.github.io/gh-blog/2014/05/07/m101j-week1-introduction.html

Monday, June 15, 2015

Hive Exercise

CREATE EXTERNAL TABLE movie (id INT, name STRING, year INT)  ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'  LOCATION '/user/training/movie';


CREATE EXTERNAL TABLE movierating (userid INT, movieid INT, rating INT)  ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'  LOCATION '/user/training/movierating';


INSERT OVERWRITE LOCAL DIRECTORY '/home/training/movie' SELECT * FROM movie

INSERT OVERWRITE LOCAL DIRECTORY '/home/training/movierating' SELECT * FROM movierating

Monday, June 8, 2015

CCD-410 Study Guide

Exam Sections


These are the current exam sections and the percentage of the exam devoted to these topics.

·           Core Hadoop Concepts (25%)
·           Storing Files in Hadoop (7%)
·           Job Execution Environment (10%)
·           Input and Output (6%)
·           Job Lifecycle (18%)
·           Data processing (6%)
·           Key and Value Types (6%)
·           The Hadoop Ecosystem (8%)

1. Core Hadoop Concepts (CCD-410:25% | CCD-470: 33%)


Objectives

·           Recognize and identify Apache Hadoop daemons and how they function both in data storage and processing under both CDH3 and CDH4.

·           Understand how Apache Hadoop exploits data locality, including rack placement policy.

·           Given a big data scenario, determine the challenges to large-scale computational models and how distributed systems attempt to overcome various challenges posed by the scenario.

·           Identify the role and use of both MapReduce v1 (MRv1) and MapReduce v2 (MRv2 / YARN) daemons.


Section Study Resources

·         Hadoop File System Shell Guide

·         Apache YARN docs

·         CDH4 YARN deployment docs

·         CDH4 update including MapReduce v2 (MRv2).


2. Storing Files in Hadoop (7%)

 Objectives

·         Analyze the benefits and challenges of the HDFS architecture

·         Analyze how HDFS implements file sizes, block sizes, and block abstraction.

·         Understand default replication values and storage requirements for replication.

·         Determine how HDFS stores, reads, and writes files.

·         Given a sample architecture, determine how HDFS handles hardware failure.


Section Study Resources

·         Hadoop: The Definitive Guide, 3rd edition: Chapter 3

·         Hadoop Operations: Chapter 2

·         Hadoop in Practice: Appendix C: HDFS Dissected


3. Job Configuration and Submission (7%)

 Objectives

·         Construct proper job configuration parameters

·         Identify the correct procedures for MapReduce job submission.

·         How to use various commands in job submission


Section Study Resources

·         Hadoop: The Definitive Guide, 3rd Edition: Chapter 5


4. Job Execution Environment (10%)

Objectives
·         Given a MapReduce job, determine the lifecycle of a Mapper and the lifecycle of a Reducer.

·         Understand the key fault tolerance principles at work in a MapReduce job.

·         Identify the role of Apache Hadoop Classes, Interfaces, and Methods.

·         Understand how speculative execution exploits differences in machine configurations and capabilities in a parallel environment and how and when it runs.


Section Study Resources
·         Hadoop in Action: Chapter 3

·         Hadoop: The Definitive Guide, 3rd Edition: Chapter 6


5. Input and Output (6%)

 Objectives

·         Given a sample job, analyze and determine the correct InputFormat and OutputFormat to select based on job requirements.

·         Understand the role of the RecordReader, and of sequence files and compression.


Section Study Resources

·         Hadoop: The Definitive Guide, 3rd Edition: Chapter 7

·         Hadoop in Action: Chapter 3

·         Hadoop in Practice: Chapter 3


6. Job Lifecycle (18%)

Objectives
·         Analyze the order of operations in a MapReduce job.

·         Analyze how data moves through a job.

·         Understand how partitioners and combiners function, and recognize appropriate use cases for each.

·         Recognize the processes and role of the the sort and shuffle process.


Section Study Resources

·         Hadoop: The Definitive Guide, 3rd Edition: Chapter 6

·         Hadoop in Practice: Techniques in section 6.4

Two blog posts from Philippe Adjiman’s Hadoop Tutorial Series

·         Tutorial on combiners

·         Tutorial on partitioners


7. Data processing (6%)

 Objectives

·         Analyze and determine the relationship of input keys to output keys in terms of both type and number, the sorting of keys, and the sorting of values.

·         Given sample input data, identify the number, type, and value of emitted keys and values from the Mappers as well as the emitted data from each Reducer and the number and contents of the output file(s).


Section Study Resources

·         Hadoop: The Definitive Guide, 3rd Edition: Chapter 7 on Input Formats and Output Formats

·         Hadoop in Practice: Chapter 3


8. Key and Value Types (6%)

Objectives
·         Given a scenario, analyze and determine which of Hadoop’s data types for keys and values are appropriate for the job.

·         Understand common key and value types in the MapReduce framework and the interfaces they implement.


Section Study Resources
·         Hadoop: The Definitive Guide, 3rd Edition: Chapter 4

·         Hadoop in Practice: Chapter 3


9. Common Algorithms and Design Patterns (7%)

 Objectives

·         Evaluate whether an algorithm is well-suited for expression in MapReduce.

·         Understand implementation and limitations and strategies for joining datasets in MapReduce.

·         Analyze the role of DistributedCache and Counters.


Section Study Resources

·         Hadoop: The Definitive Guide, 3rd Edition: Chapter 8

·         Hadoop in Practice: Chapter 4, 5, 7

·         Hadoop in Action: Chapter 5.2


10. The Hadoop Ecosystem (8%)

 Objectives

·         Analyze a workflow scenario and determine how and when to leverage ecosystems projects, including Apache Hive, Apache Pig, Sqoop and Oozie.

·         Understand how Hadoop Streaming might apply to a job workflow.


Section Study Resources

·         Hadoop: The Definitive Guide, 3rd Edition: Chapters 11, 12, 14, 15

·         Hadoop in Practice: Chapters 10, 11

·         Hadoop in Action: Chapters 10, 11



·         Apache Hive docs

·         Apache Pig docs

·         Introduction to Pig Video

·         Apache Sqoop docs



·         Each project in the Hadoop ecosystem has at least one book devoted to it. The exam scope does not require deep knowledge of programming in Hive, Pig, Sqoop, Cloudera Manager, Flume, etc. rather how those projects contribute to an overall big data ecosystem.