Doing Data Science by Cathy ONeil


Doing Data Science
Title : Doing Data Science
Author :
Rating :
ISBN : 1449358659
ISBN-10 : 9781449358655
Language : English
Format Type : Paperback
Number of Pages : 408
Publication : First published January 1, 2013

Now that people are aware that data can make the difference in an election or a business model, data science as an occupation is gaining ground. But how can you get started working in a wide-ranging, interdisciplinary field that’s so clouded in hype? This insightful book, based on Columbia University’s Introduction to Data Science class, tells you what you need to know.

In many of these chapter-long lectures, data scientists from companies such as Google, Microsoft, and eBay share new algorithms, methods, and models by presenting case studies and the code they use. If you’re familiar with linear algebra, probability, and statistics, and have programming experience, this book is an ideal introduction to data science.

Topics include:


Statistical inference, exploratory data analysis, and the data science process
Algorithms
Spam filters, Naive Bayes, and data wrangling
Logistic regression
Financial modeling
Recommendation engines and causality
Data visualization
Social networks and data journalism
Data engineering, MapReduce, Pregel, and Hadoop

Doing Data Science is collaboration between course instructor Rachel Schutt, Senior VP of Data Science at News Corp, and data science consultant Cathy O’Neil, a senior data scientist at Johnson Research Labs, who attended and blogged about the course.


Doing Data Science Reviews


  • Suhrob

    Not recommended at all. If the authors do not put much time into a book, neither should the reader.

    Rachel Schutt and Cathy O'Neil put together a book with an ambitious title about the emerging field of data science. They cover a broad ground spanning some basic statistics, machine learning, data acquisition, cleaning and visualization and finally also the ethics and sociology of this field.

    That sounds great doesn't it?

    Except - this is the single laziest book I have ever read. The book is essentially hastily put together blogs (originally appeared - and freely available on O'Neil's blog) on Rachel Schutt's data science classes (most of the classes were actually prepared by miscellaneous people from the data science business).

    All material is covered very superficially - while large chunks are devoted to machine learning there is no chance you could learn anything from it since there is not much system to the material and the coverage is largely theoretical and wonder-your-arms-don't-fall-off hand-wavy. Machine Learning in Action is Summa Theologica compared to this. It is not supposed to be a machine learning textbook? All right, it sure isn't. But it is not much else either.

    I follow Cathy O'Neil's blog so the ethics/sociology stuff was not new for me (again - do not expect them to put any extra effort to the book). And let us not be mistaken for a moment - this is still far from a thought-through critique of the field - just a bunch of blog snippets pasted together.

    Top it of with a self-congratulatory chapter of essay's by the students that participated in the course (way to go to fill more pages with no content that you don't even have to write yourself!).

    Cathy O'Neil once explained her high efficiency by saying that e.g. when preparing a 40 minute talk she will devote to it max. 20 minutes of work. If they write a book you read in 5 hours they probably spent full 2 hours writing it...

    It is unfortunate. The true high gain in efficiency comes from providing good content to your audience that goes on to create great things from it. Spend 8 hours on your 40 minute talk if the 50 people in the audience go on to do great stuff with it.... Or spend 20 minutes just to waste 51 x 40 minutes of time...

  • Philipp

    Pretty cool introductory course over all the things "data science" does nowadays, i.e., mostly data cleaning, machine learning, prediction etc. It's good to see that the authors spend so much time talking about the pitfalls of dirty data, how important it is to be skeptical of the model's output, overfitting, correlation != causation and much more; much stuff I've read so far just focuses on how the model is used, not how the model is interpreted.

    This book will certainly not turn you into a data scientists, but it will a) give you starting points for most common tasks (in other words, when someone will ask you for a fairly robust classification method, you might think of random forests or k-NN after this book). Plus, it gives you lots of books for each of the sub-fields to further your studies.

    My biggest annoyance is the constant love for Google, like every 50 pages or so Google is mentioned and how great it is to work there, and how fun it is to work on Google+ etc. pp.

    Things you need to know before reading:

    - Math: you should know what the common symbols like Sigma means, and be able to solve some equations
    - How to read programs: Most of the book's code is in R, with a bit of Python, and 4 or 5 lines of Go. You definitely need to understand programming to understand these examples, but the authors encourage readers to come up with their own in response to the exercise at hand.

  • Sebastian Gebski

    Good beginning, followed by the string of fails.

    Introduction to linear regression, K-NN & K-Means (AFAIR) has sharpened my appetite: it was concise, clear & I had my expectations' bar set for more details about far more sophisticated scientific methods. But ... that was it. Chapter about logistic regression was a total crap & the next ones weren't significantly better. Why?
    a.) very poor examples with barely any visualisations
    b.) code examples are bloated & non-illustrative - some are in R, some are in Python - they have barely any descriptions even if they span for several pages
    c.) formulas are just quoted, without deducing or any other clarification - take or leave
    d.) each chapter is based on "contribution" of different people - it's very clear that some of them didn't really have a clear vision of what they want to present

    I am very disappointed. It really could have been a good book. But it isn't.

  • Bojan Tunguz

    “Data Science” has become one of the most trendy research fields in recent years, as well as a catchall rubric for various job descriptions and work functions. The cynics and skeptics, and there are many of those, contend that “Data Science” is nothing more than repackaged Statistics, with a bit of coding and hacking thrown in. Its proponents, however, point out that most practicing data scientists use a variety of skills and techniques in their daily work, and come from a vast spectrum of career paths and backgrounds. I tend to side with the latter group, but I too am an outsider to this field and am still trying to get a better understanding of what it really entails.

    “Doing Data Science: Straight Talk from the Frontline” is a compendium of chapters that deal with data science as it is practiced in the real world. Each chapter is written by a different author, all of who have significant practical experience and are acknowledged authorities on data science. Most of the contributors work in industry, but data science is still so fresh and new that there is a lot of crossing over between academia and the corporate world.

    A few of the chapters include exercises, but these tend to be too advanced and assume too much background material for an introductory book. The exercises still give you a good idea of what kinds of problems data scientists tend to grapple with. However, this book is definitely not a textbook and cannot be effectively used as such. The book doesn’t provide any background on R, statistics, data scrubbing, machine learning, and various other techniques used by data scientist. It is highly unlikely that any single textbook would be able to do justice to all of that material anyways, but a book of that sort could still have a lot of potential use.

    There are two groups of people who would benefit from this book. The first are people who have absolutely no background in data science or any of its related fields, but would like to get a flavor of what data science is all about and are interested in exploring it for career purposes. The second group are people with significant technical background in one of the fields related to data science (programming, statistics, machine learning, etc.) who are interested in broadening their skills and would like to see how would their particular strengths fit within the broader data science field.

  • BCS

    Computer applications have become increasingly pervasive, collecting vast quantities of data relating to many diverse activities. Advances in computer hardware have provided solutions that enable this data to be collected and stored, and data owners expect that important and possibly commercially advantageous information is contained within these data sets.

    Data science is an emerging discipline that, while not yet having a strict and common definition, broadly relates to the preparation and exploration of typically large data sets in order to identify patterns, predict outcomes, classify and derive meaning using a variety of mathematical and software tools.

    This book, which is based on a course developed and delivered by the authors at Columbia University, is a practical introduction to data science and explores the subject from a variety of theoretical, application-specific and professional perspectives.

    Introductory chapters set the scene by exploring the question ‘what is data science?’, outlining the data science landscape and the role and skillset of a data scientist.

    The authors then examine general features of data sources, including the kinds of statistical limitations that can be present in data sets, and outline a data science process that defines a practical, general approach to data science projects. Methods of exploratory data analysis and data modelling are described and supported by practical exercises, which also introduce the reader to the R language.

    As might be expected, a significant part of the book is devoted the discussion of common probabilistic and statistical approaches used in solving classification and prediction problems, along with discussion of the suitability of different algorithms for different scenarios, how models can be fitted to data sets and ways in which algorithms and models can be evaluated. The book also outlines the principles and use cases of the map/reduce approach.

    In addition to data analysis techniques, the book contains introductions to general data science applications, including recommendation systems, data visualisation and approaches to deriving meaning and causality.

    As well as general data science topics, the authors also consider application-specific topics in data science. These chapters consider the differing features and analytical requirements of data sets such as social network data, epidemiological data and time-stamped financial data.

    Many chapters include additional content from subject matter experts who describe the ways in which the techniques discussed in the book are being applied to solve real problems.

    This book covers a lot of ground and contains many links to other sources. Where applicable, the content is supplemented by practical examples, using tools such as R, bash and Python, and the authors have provided downloadable data sets for readers to explore.

    The text is well written, with an authoritative but engaging informal style. However, it is worth noting that some chapters require the reader to have a reasonable mathematical background, including linear algebra. There is great emphasis on practical application of the content, supported by advice based on the authors’ own experiences. The book is a good introduction to the emerging field of data science, which encourages readers to delve deeper into the subject.

    Reviewed by Patrick Hill CEng MBCS CITP FIAP MIEEE

  • Alex

    Pretty hit-or-miss, chapter to chapter, and not fantastic overall. For a much better introductory overview (which is more in-depth and comprehensible than most of this book, even the parts of this book that intentionally delve into more detail), I'd recommend Fawcett and Provost's "Data Science for Business,' even if business is not the reader's area of work or primary interest. The level of detail in this book ranges too oddly to be generally useful; the authors (PhDs in math and stats) switch from colloquial discussion of some topics to a highly mathematical explanation of a very specific data problem that they explored in their personal careers, for example, and tend to give off the impression that one needs to understand everything that a math PhD understands in order to be able to use the techniques they cover (one doesn't).

  • Gavin

    The first third is: Talking About Data Science. But that's good; two careful, socially conscious techies talking is nice, and you would never get the dozens of handy heuristics in this from a usual STEM textbook. Crunchier than it looks - half the value is in the dull-looking, unannotated code samples at the end of each chapter, and isn't spelled out. Pedagogy!

    It is galling, then, that the data for chapters 6 and 8 has already link-rotted away. And half of the cool startups who came to talk to the class are dead and forgotten already.

    Only worth it if you can find the data.

    [
    Thinking #1, Theory 5 #2]

  • Bas

    Disjointed and random

  • briz

    Thus endeth my lunchtime reading book. I intermittently read this, over the course of many months, usually over a sandwich at lunch. For this style of reading, it holds up well: the chapters are discreet packets of data science chat. That said, I agree with other critiques of this book: if you're an aspiring data scientist, this book is NOT sufficient to get you off the ground. It's not a good beginner's book. It's maybe a good "pop data science" book, a pre-beginner's book. It's very light on the technical stuff, and, if anything, it's more like an anthropological survey of the state of the field.

    Each chapter covers a technique or common challenge or strategy, describes the general jist of what's going on, and then points you in the direction of papers, other books, or tutorials online. Early chapters have some "exercises", though they're more like general pointers of "oh, you could try this, I guess?" Later chapters don't even bother.

    For an O'Reilly book, I was disappointed that the
    GitHub repo didn't have, for example, the code examples mentioned in the book, or the exercises and toy datasets. (What? Are we supposed to manually copy down several pages of R code?!) Or even just a README.md with a bibliography (given how many shortened Google links are used as citations)? This makes it a starkly UNFRIENDLY book, which is weird since O'Reilly books (well, the good ones) can be very, very rich resources. This, instead, felt thin - and the repo is basically pointless.

    I *will* say that I enjoyed the banter-y tone of the book, and some of the discussions of techniques (e.g. there was a great, intuitive explanation of Principal Component Analysis) and "real world" issues (e.g. how Kaggle competitions are basically data science in a vacuum; what it's like to be a lady data scientist) were quite good. But, overall, yeah, this isn't really a "good enough" data science book.

  • Roberto Rigolin F Lopes

    We are in 2013, no one knows what the heck is "data science" but there are plenty of jobs out there. Here is a course for you, future data rockstar. Rachel and Cathy invited a bunch of people from industry to talk about a wide range of topics: from statistical inference to data visualization with plenty of algorithms, R code and data sets. This is therefore a hands on course with good theoretical depth. And the take away message is: if the world is a bunch of data pipes, don't just be a plumber. Rather behave like the freaking Mario Bros!

  • Stano

    The book is a very nice overview of the data science topic. It’s not a textbook of algorithms but it definitely does a good job explaining some of them. My favourite aspect is the recurrent narrative of the skills of a data scientist - how various skills can be combined and are actually desired.

  • Stanley Choi

    Good fit for those willing to understand data science in a plain English without alien codes

    It will be a good fit for those willing to understand data science in a plain English without alien codes

  • Ferhat Culfaz

    Nice overview, covers a range of interesting topics. Not so technical and an easy read.

  • Sean

    Interesting book with highlights from Schutt's course at Columbia. Full-color graphs. Covers various aspects of what real data scientists do. Good overview book, but light on technical details.

  • Mohammad Javad Jafari

    Not bad for the ones who are new to data science world however i recommend data science for dummies for these ones.

  • Rodrigo Rivera

    Data Science ist immer noch ein sehr schwammiger Begriff, einige behaupten es ist ein schöner Name für Statistik, andere sagen es handelt sich um das neue Business Intelligence aber für Big Data und wiederum andere glauben, dass Data Science ein komplett neues Thema ist. Denn es gibt weder Konsensus noch eine offizielle Definition. Allerdings handelt es sich um ein sehr sexy Thema zur Zeit und fast jeder Verlag hat mittlerweile ein Buch diesbezüglich im Angebot; O'reilly macht es nicht anders.

    Doing Data Science: Straight Talk from the Frontline versucht, eine Einführung zum Thema zu sein, ohne eine große Mathematik-Theorie dahinter. Vielmehr will von dem Alltag von Data Scientists erzählen und ein Basisverständnis für Data Science schaffen.

    Die Autorinnen sind bekannt in der Szene: Cathy O'Neill, ist eine bekannte Bloggerin (mathbabe) mit einem sehr starken mathematischen Hintergrund, und Rachel Schutt lehrt an der Columbia University in New York. All dies sind die richtigen Bedingungen für ein gutes Buch über das Thema: Erfahrene Expertinnen, die sich sehr gut schriftlich ausdrücken können und im Data Science tätig sind.

    Allerdings liegt hier genau das Problem, dass Buch ist weder ein Fachbuch noch ein Roman. Es fühlt sich genau wie eine Sammlung von Blog-Einträgen oder einen längeren Magazinartikel. Denn einige Kapitel sind Gastbeiträge von anderen Experten oder Studenten des erwähnten Kurses. Noch dazu ist das Buch an sich eine Ansammlung von Präsentationen und Vorträgen der Data Science Vorlesung an der Columbia University. Somit ist das Stil in jedem Kapitel etwas anders. Außerdem sind viele Sachen im Buch leider entweder extrem oberflächig erklärt oder sogar falsch.

    Es ist sehr schade, da das Buch extrem viel Potential hat. Trotzdem ist Doing Data Science: Straight Talk from the Frontline generell ein positiver Beitrag. Jeder kann die Themen verstehen, es ist unterhaltsam und die Literaturempfehlungen sind sehr umfassend. Daher kann jeder interessierte im Data Science das Buch schnell lesen und einen guten Überblick bekommen. Anschließend kann man die Theorie hinter Data Science durch ein umfassenderes Fachbuch in der Empfehlungsliste oder durch „An Introduction to Statistical Learning“ lernen.

  • Louis

    Doing Data Science is about the practice of data science, not its implementation. It is based on a course on data science that featured a guest lecturer on each topic. This leads to the guest lecturers (and chapters) focusing more on important concepts rather then the methodology. So, this is not a textbook or a how-to-do-this type of book, rather it is a how-to-think-when-doing book.

    A problem with books like this where each chapter is written by someone different is the need for coherence. A second is that each author typically has something to day, and she has to say it in her chapter. So, compared to other data science books, it suffers from the chapters not building on each other in a systematic way and having multiple messages that appear as you go through the book.

    One benefit from this is that each author has something to say. While I find the book thin on how to do things, this is a good source of wisdom in why things are done and issues that come up along the way in real life. I am teaching data science for the first time and I find myself turning here for topics of discussion which my chosen textbooks don't cover (as they have more focus on how to do things).

    I don't think this is the book to use to learn how to do data science, and I suspect the students at Columbia learned how to find other sources to help them figure things out. But it provides wisdom, which is harder to find and worth quite a bit.

    Note: I received a free electronic copy of this book from the publisher as par of the OReilly Bloggers program.

  • Leland William

    What is all the clamor about data science? What even is data science? Rachel Schutt addresses these questions in the introductory chapter of Doing Data Science. As a data padawan, naive and idealistic, I came to this book with the expectation that it would give me the prestidigitation of a powerful sorcerer. Needless to say, the book disappointed in that aspect. Turns out I'm going to have to sidle up to my computer and math books for 10,000 hours before I can do anything magical. Nonetheless, this book is an excellent introduction to the concept of data science and should be a go-to resource for anyone interested in learning more about the subject. The book is a compendium of lectures by figures in the data science community about different aspects of the field. Each of these lectures gives a broad overview of a subject. Math competency is recommended but not required to get the gist of most of the chapters.

  • Risto Hinno

    Interesting book, because it has many authors (book is based on lectures) and different viewpoints on data science. I liked it because it is like real life is - sometimes inconsistent and demanding. After reading this book you understand that you don't understand bunch of things. It is not like some usual articles/books about data science and showing only it's glory. It showed what are pitfalls in data science and how f****** hard it can be to get practical value from it. I liked that book discussed many themes and gave code examples and exercises which really put you into thinking. Books that are written by people who are practitioners are usually exciting to read. Only minus with this book is that some of code samples are old and some of the datasets which book refers could not be found on internet anymore.

  • Sefa

    Comprehensive intro on "What is data science?". It is nice that the book is not fully technical but it also analyze the hype around the field, its deep connections to math, statistics and computer science. Interesting that in each chapter, the book presents one or two data science people from industry and how they use described methods in their work. The way machine learning methods are described formally could have been better, it is hard to follow the math/notation the book follows. Also, the level of detail is not consistent (or monotonically increasing) through the book. The R/python/bash code snippets are nice and useful to have for reference.

  • Ji

    This book evolved from a course on data science, whose blog I followed last year. It was fun to read them in blog posts, but the same list of suggested resources with scattered codes on particular topics pieced together does not necessarily make a very useful or even meaningful book - after all, as "data science" has become such a business cliche and sales pitch for career crawlers (myself included), it's no surprise to see books with such a title more nonsensical than average. With that expectation in mind, this book should be considered rathe well written. (Mark it read so that I stop wasting time on it)