笔记 2 数据科学家工具箱

CLT
name of root is represented by a slash: /
home directory is represented by a tilde: ~
pwd print working directory
recipe: command -flags arguments
clear: clear out the commands in your current CLI window
ls lists files and folders in the current directory

-a lists hidden and unhidden files and folders

-al lists details for hidden and unhidden files and folders
cd stands for “change directory”

cd takes as an argument the directory you want to visit

cd with no argument takes you to your home directory

cd .. allows you to chnage directory to one level above your current directory
mkdir stands for “make directory”
touch creates an empty file
cp stands for “copy”

cp takes as its first argument a file, and as its second argument the path to where you want the file to be copied

cp can also be used for copying the contents of directories, but you must use the -r flag
rm stands for “remove”

use rm to delete entire directories and their contents by using the -r flag
mv stands for “move”

move files between directories

use mv to rename files
echo will print whatever arguments you provide
date will print today’s date
git

$ git config --global user.name "Your Name Here" # 输入用户名
$ git config --global user.email "your_email@example.com" # 输入邮箱
$ git config --list # 检查
$ git init # 初始化目录
$ git add . # 添加新文件
$ git add -u # 更新改名或删除的文件
$ git add -A|git add --all # 添加所有改动
$ git commit -m "your message goes here" # 描述并缓存本地工作区改动到上一次commit
$ git log # 查看commit记录 用Q退出
$ git status # 查看状态
$ git remote add # 添加服务器端地址
$ git remote -v # 查看远端状态
$ git push # 将本地commit推送到github服务器端
$ git pull|fetch|merge|clone # 本地获取远端repo
$ exit # 退出

Git = Local (on your computer); GitHub = Remote (on the web)
基本问题
描述分析：对数据进行描述但不解释
探索分析：寻找未知的变量间关系（相关不代表因果）
推断分析：用小样本推断总体统计模型的目标强依赖采样过程
预测分析：用一组变量预测另一变量不一定有因果关系
因果分析：改变一个变量引发另一个变量变化的分析随机实验平均效果
机理分析：对个体改变一个变量所导致另一个变量的精确变化公式模拟与参数拟合
数据次于问题
大数据依赖科学而不是数据
实验设计重视可重复性随机与分组预测与推断不同不要选数据

2.1 基础知识

层次：操作系统 - shell - 终端 - 命令行工具
分类：可执行文件、shell 内置命令、脚本、shell 函数、宏

2.2 数据获取

复制：cp 或 scp（安全复制） > scp -i mykey.pem ~/Desktop/logs.csv ubuntu@ec2-184-73-72-150.compute-1.amazonaws.com:data
解压：unpack > unpack logs.tar.gz
转化 excel 为csv：in2csv、csvcut、csvlook > in2csv data/imdb-250.xlsx | head | csvcut -c Title,Year,Rating | csvlook
查询关系数据库：sql2csv > sql2csv –db ‘sqlite:///data/iris.db’ –query ‘SELECT * FROM iris’ ‘WHERE sepal_length > 7.5’
互联网下载：curl -u 登录 -L 链接跳转 -I http头文件 > curl -s http://www.gutenberg.org/cache/epub/76/pg76.txt | head -n 10 > curl -u username:password ftp://host/file > curl -L j.mp/locatbbar
API：curlicue 来进行认证

2.3 可重复性工具

!! 可重复上次命令
chmod 增加权限
2.3 !/usr/bin/env bash 增加状况说明
NUM_WORDS=“$1” 增加参数

Notes

笔记 2 数据科学家工具箱

2.1 基础知识

2.2 数据获取

2.3 可重复性工具

2.3 !/usr/bin/env bash 增加状况说明

2.4 链接